Recent agent changes broke postgres CI on windows #218

anarazel · 2022-03-04T23:02:15Z

Hi,

Something between agent version 1.73.2-f0bbc44 and 1.73.6-6fc7e4b broke our windows tests.

Successful run:
https://cirrus-ci.com/task/5737468287778816

Failed rerun:
https://cirrus-ci.com/task/5239235262283776

It looks like processes started in one script are now terminated. It might be that postgres doesn't do the right magic dance to fully detach from the starting console on windows - but it's not obvious what would be required, we've experimented with different CreateProcess() flags without success so far.

Regards,

Andres

fkorotkov · 2022-03-05T02:03:03Z

Sorry about that! There were some changes in #217 related to how scripts are executed and maybe this particular line might affected your build. We'll try to investigate.

In the meantime you can add CIRRUS_AGENT_VERSION environment variable to your Windows tasks to workaround the issue and explicitly use 1.73.2 version of the agent for now.

anarazel · 2022-03-05T02:37:54Z

There were some changes in #217 related to how scripts are executed and maybe this particular line might affected your build. We'll try to investigate.

Yea, I was guessing it might be related, and staring at the code trying to figure out what it's trying to do ;)

It'd be good if stuff like killing processes were noted in the debug log.

In the meantime you can add CIRRUS_AGENT_VERSION environment variable to your Windows tasks to workaround the issue and explicitly use 1.73.2 version of the agent for now.

Ah, neat.

anarazel · 2022-03-05T02:42:58Z

Ah, found it, I think: On windows it looks like you create a job for each script: https://github.com/cirruslabs/cirrus-ci-agent/blob/master/internal/executor/shellcommands_windows.go#L39
but then the recent commit does a ShellCommands.kill whenever the script is done:
6fc7e4b#diff-6380403c4cdbed39c01cb2da63297533395138b38147b47aa9504af47a41fdf4R76

In contrast to before now the entire job is killed, which includes all child processes. Basically undoing #179 afaics?

anarazel · 2022-03-05T03:04:52Z

And because the job is created without JOB_OBJECT_LIMIT_SILENT_BREAKAWAY_OK, we can't even escape the job easily...

Will be reverted once fixed on cirrus's side. See also cirruslabs/cirrus-ci-agent#218 Discussion: https://postgr.es/m/CA+hUKGKx7k14n2nAALSvv6M_AB6oHasNBA65X6Dvo8hwfi9y0A@mail.gmail.com

edigaryev · 2022-03-05T15:01:40Z

Hi Andres!

I'm sorry that these changes have broke your build. However, I don't currently see any other way around this, as we can't guarantee both (1) Cmd.Wait() termination without #179 and (2) collecting all the data from os.Pipe without killing these children processes.

By the way, the underlying issue for these fixes we've had to implement in the agent is explained in golang/go#23019.

Have you considered switching to background script instruction instead?

anarazel · 2022-03-05T16:43:58Z

However, I don't currently see any other way around this, as we can't guarantee both (1) Cmd.Wait() termination without #179 and (2) collecting all the data from os.Pipe without killing these children processes.

The problem isn't caused by killing the direct child processes. They're all dead already. The problem is that on windows you're killing the entire job, including children of child processes, and you're not allowing commands to "escape" the job (see comment about JOB_OBJECT_LIMIT_SILENT_BREAKAWAY_OK above).

Note that on !windows you're just killing the process group, rather than something like a job (which'd be more equivalent to a cgroup). You actually can create child processes as a process group on windows too. Have you considered creating the shell with CREATE_NEW_PROCESS_GROUP and then killing that process group, rather than a job?

Have you considered switching to background script instruction instead?

Yes, doesn't work without adding a fair bit of added complication / fragility. The command only returns when we're ready to execute the next steps.

edigaryev · 2022-03-09T17:42:53Z

Hi Andres! Sorry for the late reply.

I've managed to reproduce your issue and I think that I understand the problem you're facing better now.

I understand that in your case the the simplest solution would be to use a blocking pg_ctl -w start with a non-background script instruction, but this implies that the agent will let commands escape the job (which is again, what you want).

The problem with this is that it is not ideal for non-ephemeral environments with Persistent Worker, where a single compute environment is re-used across build runs.

To me, it still seems that the proper solution would be to call pg_isready somewhere in between startcreate_background_script and test_pl_script repeatedly until it succeeds.

However, I've realized there might be a middle-ground approach, see #220. It should keep the old behavior for background processes that don't hold the standard output, yet still ensure that the agent retrieves all the output from the processes it has spawned, which is vital in e.g. cache instruction's fingerprint_script.

anarazel · 2022-03-09T18:03:56Z

Hi,

Thanks for working on this.

The problem with this is that it is not ideal for non-ephemeral environments with Persistent Worker, where a single compute environment is re-used across build runs.

That makes sense - but the !windows case currently isn't protected against that at all? Process groups are obviously easy to "escape" from, and a lot of software does so (including pg_ctl). Seems that for windows the right approach would be to have one job across the whole testrun and kill that at the end? For linux you'd presumably need a cgroup. No idea for BSDs or macOS.

To me, it still seems that the proper solution would be to call pg_isready somewhere in between startcreate_background_script and test_pl_script repeatedly until it succeeds.

That'd kind of work, but is a fair bit more complicated than desirable :/. Besides needing to loop etc, there's an added complications: Currently the code to drop admin rights etc is in pg_ctl, which will return after starting postgres. The daemon itself refuses to run with admin rights. We don't currently have tools to wait until the server is shut down.

Partially that's something we should improve in our tooling, admittedly...

However, I've realized there might be a middle-ground approach, see #220. It should keep the old behavior for background processes that don't hold the standard output, yet still ensure that the agent retrieves all the output from the processes it has spawned, which is vital in e.g. cache instruction's fingerprint_script.

Sounds good.

anarazel · 2022-03-17T23:42:03Z

However, I've realized there might be a middle-ground approach, see #220. It should keep the old behavior for background processes that don't hold the standard output, yet still ensure that the agent retrieves all the output from the processes it has spawned, which is vital in e.g. cache instruction's fingerprint_script.

A run for a branch of postgres that doesn't specify the agent version ended up using "agent version 1.74.0-b55ebe7", which contains #220 afaics, still fails: https://cirrus-ci.com/task/6021634623537152

Reading through later iterations of that change, I'd have to set CIRRUS_ESCAPING_PROCESSES? Would be nice to document that, if so.

I still don't understand the equivalence between killing jobs on windows and killing process groups everywhere else. Process groups can be escaped (and are by typical daemons) with a single setsid(), jobs not.

edigaryev · 2022-03-17T23:47:22Z

That makes sense - but the !windows case currently isn't protected against that at all? Process groups are obviously easy to "escape" from, and a lot of software does so (including pg_ctl).

You're right: the decision to use the jobs on Windows was due to convenience, because the process group implementation is lacking (see nodejs/node#3617).

After, I've rationalized this as a way towards better resource controls, but since it's not possible to use cgroups conveniently across different UNIX-like OSes, the full process controls are currently lacking.

Seems that for windows the right approach would be to have one job across the whole testrun and kill that at the end?

Unfortunately this wouldn't work, because we need to ensure that there are no processes holding their end of our pipe FD at the end of each instructions, otherwise we'd wait for the pipe to EOF indefinitely. This pipe is used to collect the standard output/error and is crucial for cache fingerprint scripts, for example (because missing output will result in a different hash).

edigaryev · 2022-03-17T23:55:13Z

Reading through later iterations of that change, I'd have to set CIRRUS_ESCAPING_PROCESSES? Would be nice to document that, if so.

That's right. Docs on the way: cirruslabs/cirrus-ci-docs#984.

BTW, regarding JOB_OBJECT_LIMIT_SILENT_BREAKAWAY_OK as an alternative solution.

Have you meant JOB_OBJECT_LIMIT_BREAKAWAY_OK (https://docs.microsoft.com/en-us/windows/win32/procthread/job-objects#managing-processes-in-jobs)? Because the silent one seems to only apply for parents that are created by the agent, in our case this would be a shell. So we wouldn't kill anything except for the shell, and it makes no sense to use the jobs at all.

And as for the JOB_OBJECT_LIMIT_BREAKAWAY_OK, I haven't found any CLI tool that would enable to start the programs with this flag. Do you have any in mind?

anarazel · 2022-03-23T17:13:28Z

Just FTR: CIRRUS_ESCAPING_PROCESSES did fix the issue.

fkorotkov assigned edigaryev Mar 5, 2022

edigaryev mentioned this issue Mar 9, 2022

Try to close the piper with timeout before killing the shell commands #220

Merged

edigaryev closed this as completed Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recent agent changes broke postgres CI on windows #218

Recent agent changes broke postgres CI on windows #218

anarazel commented Mar 4, 2022

fkorotkov commented Mar 5, 2022

anarazel commented Mar 5, 2022

anarazel commented Mar 5, 2022

anarazel commented Mar 5, 2022

edigaryev commented Mar 5, 2022

anarazel commented Mar 5, 2022

edigaryev commented Mar 9, 2022

anarazel commented Mar 9, 2022

anarazel commented Mar 17, 2022

edigaryev commented Mar 17, 2022

edigaryev commented Mar 17, 2022

anarazel commented Mar 23, 2022

Recent agent changes broke postgres CI on windows #218

Recent agent changes broke postgres CI on windows #218

Comments

anarazel commented Mar 4, 2022

fkorotkov commented Mar 5, 2022

anarazel commented Mar 5, 2022

anarazel commented Mar 5, 2022

anarazel commented Mar 5, 2022

edigaryev commented Mar 5, 2022

anarazel commented Mar 5, 2022

edigaryev commented Mar 9, 2022

anarazel commented Mar 9, 2022

anarazel commented Mar 17, 2022

edigaryev commented Mar 17, 2022

edigaryev commented Mar 17, 2022

anarazel commented Mar 23, 2022