Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Recent agent changes broke postgres CI on windows #218

Closed
anarazel opened this issue Mar 4, 2022 · 12 comments
Closed

Recent agent changes broke postgres CI on windows #218

anarazel opened this issue Mar 4, 2022 · 12 comments
Assignees

Comments

@anarazel
Copy link

anarazel commented Mar 4, 2022

Hi,

Something between agent version 1.73.2-f0bbc44 and 1.73.6-6fc7e4b broke our windows tests.

Successful run:
https://cirrus-ci.com/task/5737468287778816

Failed rerun:
https://cirrus-ci.com/task/5239235262283776

It looks like processes started in one script are now terminated. It might be that postgres doesn't do the right magic dance to fully detach from the starting console on windows - but it's not obvious what would be required, we've experimented with different CreateProcess() flags without success so far.

Regards,

Andres

@fkorotkov
Copy link
Contributor

Sorry about that! There were some changes in #217 related to how scripts are executed and maybe this particular line might affected your build. We'll try to investigate.

In the meantime you can add CIRRUS_AGENT_VERSION environment variable to your Windows tasks to workaround the issue and explicitly use 1.73.2 version of the agent for now.

@anarazel
Copy link
Author

anarazel commented Mar 5, 2022

There were some changes in #217 related to how scripts are executed and maybe this particular line might affected your build. We'll try to investigate.

Yea, I was guessing it might be related, and staring at the code trying to figure out what it's trying to do ;)

It'd be good if stuff like killing processes were noted in the debug log.

In the meantime you can add CIRRUS_AGENT_VERSION environment variable to your Windows tasks to workaround the issue and explicitly use 1.73.2 version of the agent for now.

Ah, neat.

@anarazel
Copy link
Author

anarazel commented Mar 5, 2022

Ah, found it, I think: On windows it looks like you create a job for each script: https://github.com/cirruslabs/cirrus-ci-agent/blob/master/internal/executor/shellcommands_windows.go#L39
but then the recent commit does a ShellCommands.kill whenever the script is done:
6fc7e4b#diff-6380403c4cdbed39c01cb2da63297533395138b38147b47aa9504af47a41fdf4R76

In contrast to before now the entire job is killed, which includes all child processes. Basically undoing #179 afaics?

@anarazel
Copy link
Author

anarazel commented Mar 5, 2022

And because the job is created without JOB_OBJECT_LIMIT_SILENT_BREAKAWAY_OK, we can't even escape the job easily...

@edigaryev
Copy link
Contributor

Hi Andres!

I'm sorry that these changes have broke your build. However, I don't currently see any other way around this, as we can't guarantee both (1) Cmd.Wait() termination without #179 and (2) collecting all the data from os.Pipe without killing these children processes.

By the way, the underlying issue for these fixes we've had to implement in the agent is explained in golang/go#23019.

Have you considered switching to background script instruction instead?

@anarazel
Copy link
Author

anarazel commented Mar 5, 2022

However, I don't currently see any other way around this, as we can't guarantee both (1) Cmd.Wait() termination without #179 and (2) collecting all the data from os.Pipe without killing these children processes.

The problem isn't caused by killing the direct child processes. They're all dead already. The problem is that on windows you're killing the entire job, including children of child processes, and you're not allowing commands to "escape" the job (see comment about JOB_OBJECT_LIMIT_SILENT_BREAKAWAY_OK above).

Note that on !windows you're just killing the process group, rather than something like a job (which'd be more equivalent to a cgroup). You actually can create child processes as a process group on windows too. Have you considered creating the shell with CREATE_NEW_PROCESS_GROUP and then killing that process group, rather than a job?

Have you considered switching to background script instruction instead?

Yes, doesn't work without adding a fair bit of added complication / fragility. The command only returns when we're ready to execute the next steps.

@edigaryev
Copy link
Contributor

Hi Andres! Sorry for the late reply.

I've managed to reproduce your issue and I think that I understand the problem you're facing better now.

I understand that in your case the the simplest solution would be to use a blocking pg_ctl -w start with a non-background script instruction, but this implies that the agent will let commands escape the job (which is again, what you want).

The problem with this is that it is not ideal for non-ephemeral environments with Persistent Worker, where a single compute environment is re-used across build runs.

To me, it still seems that the proper solution would be to call pg_isready somewhere in between startcreate_background_script and test_pl_script repeatedly until it succeeds.

However, I've realized there might be a middle-ground approach, see #220. It should keep the old behavior for background processes that don't hold the standard output, yet still ensure that the agent retrieves all the output from the processes it has spawned, which is vital in e.g. cache instruction's fingerprint_script.

@anarazel
Copy link
Author

anarazel commented Mar 9, 2022

Hi,

Thanks for working on this.

The problem with this is that it is not ideal for non-ephemeral environments with Persistent Worker, where a single compute environment is re-used across build runs.

That makes sense - but the !windows case currently isn't protected against that at all? Process groups are obviously easy to "escape" from, and a lot of software does so (including pg_ctl). Seems that for windows the right approach would be to have one job across the whole testrun and kill that at the end? For linux you'd presumably need a cgroup. No idea for BSDs or macOS.

To me, it still seems that the proper solution would be to call pg_isready somewhere in between startcreate_background_script and test_pl_script repeatedly until it succeeds.

That'd kind of work, but is a fair bit more complicated than desirable :/. Besides needing to loop etc, there's an added complications: Currently the code to drop admin rights etc is in pg_ctl, which will return after starting postgres. The daemon itself refuses to run with admin rights. We don't currently have tools to wait until the server is shut down.

Partially that's something we should improve in our tooling, admittedly...

However, I've realized there might be a middle-ground approach, see #220. It should keep the old behavior for background processes that don't hold the standard output, yet still ensure that the agent retrieves all the output from the processes it has spawned, which is vital in e.g. cache instruction's fingerprint_script.

Sounds good.

@anarazel
Copy link
Author

However, I've realized there might be a middle-ground approach, see #220. It should keep the old behavior for background processes that don't hold the standard output, yet still ensure that the agent retrieves all the output from the processes it has spawned, which is vital in e.g. cache instruction's fingerprint_script.

A run for a branch of postgres that doesn't specify the agent version ended up using "agent version 1.74.0-b55ebe7", which contains #220 afaics, still fails: https://cirrus-ci.com/task/6021634623537152

Reading through later iterations of that change, I'd have to set CIRRUS_ESCAPING_PROCESSES? Would be nice to document that, if so.

I still don't understand the equivalence between killing jobs on windows and killing process groups everywhere else. Process groups can be escaped (and are by typical daemons) with a single setsid(), jobs not.

@edigaryev
Copy link
Contributor

That makes sense - but the !windows case currently isn't protected against that at all? Process groups are obviously easy to "escape" from, and a lot of software does so (including pg_ctl).

You're right: the decision to use the jobs on Windows was due to convenience, because the process group implementation is lacking (see nodejs/node#3617).

After, I've rationalized this as a way towards better resource controls, but since it's not possible to use cgroups conveniently across different UNIX-like OSes, the full process controls are currently lacking.

Seems that for windows the right approach would be to have one job across the whole testrun and kill that at the end?

Unfortunately this wouldn't work, because we need to ensure that there are no processes holding their end of our pipe FD at the end of each instructions, otherwise we'd wait for the pipe to EOF indefinitely. This pipe is used to collect the standard output/error and is crucial for cache fingerprint scripts, for example (because missing output will result in a different hash).

@edigaryev
Copy link
Contributor

Reading through later iterations of that change, I'd have to set CIRRUS_ESCAPING_PROCESSES? Would be nice to document that, if so.

That's right. Docs on the way: cirruslabs/cirrus-ci-docs#984.

BTW, regarding JOB_OBJECT_LIMIT_SILENT_BREAKAWAY_OK as an alternative solution.

Have you meant JOB_OBJECT_LIMIT_BREAKAWAY_OK (https://docs.microsoft.com/en-us/windows/win32/procthread/job-objects#managing-processes-in-jobs)? Because the silent one seems to only apply for parents that are created by the agent, in our case this would be a shell. So we wouldn't kill anything except for the shell, and it makes no sense to use the jobs at all.

And as for the JOB_OBJECT_LIMIT_BREAKAWAY_OK, I haven't found any CLI tool that would enable to start the programs with this flag. Do you have any in mind?

@anarazel
Copy link
Author

Just FTR: CIRRUS_ESCAPING_PROCESSES did fix the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants