Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] kill/crash conan process leaves a lock file containing count -1 #11033

Closed
cr-dani-koretsky opened this issue Apr 14, 2022 · 3 comments
Closed
Milestone

Comments

@cr-dani-koretsky
Copy link

cr-dani-koretsky commented Apr 14, 2022

Environment Details (include every applicable attribute)

  • Operating System+version: Windows Server 2012
  • Compiler+version: VS 2017 MSBuild (17:21:17 Microsoft (R) Build Engine version 16.9.0+5e4b48a27 for .NET Framework)
  • Conan version: 1.44.1
  • Python version: 2.7.14

Steps to reproduce (Include if Applicable)

  1. Run a conan install command on many packages
  2. Kill the conan process in the middle - specifically while a lock file remains
  3. Try to run another conan process later - it will fail
  4. Try to run "remove locks" -> it will succeed but the count still remains at -1 [minus one]

Logs (Executed commands with output) (Include/Attach if Applicable)

16:43:59  gtest/1.10.0@google/stable is locked by another concurrent conan process, wait...
16:43:59  If not the case, quit, and do 'conan remove --locks'
17:20:31  conanfile.txt: Installing package

Basically - even running conan remove --locks doesn't help.

Is anyone aware of such a bug?

Executive summary

Conan dying mid-operation sometimes leaks lock files, even conan remove --locks doesn't fix it (and too late anyways, after jobs are indefinitely stuck until intervention).

I think a simple "pid stored in the lock file" can prevent such infinite deadlocks.
Bonus - we avoid the sketchy conan remove --locks command which doesn't even help here.

Our only workaround currently is to delete the lock file itself directly..

I'm thinking of the Linux Daemon style classic "If the pid is not alive - just take over the file".
WDYT?

@cr-dani-koretsky cr-dani-koretsky changed the title [bug] conan kill process leaves a lock file containing count -1 [bug] kill/crash conan process leaves a lock file containing count -1 Apr 14, 2022
@memsharded
Copy link
Member

Hi @cr-dani-koretsky

It is a well known fact that a crash will leave the locks there, and a conan remove --locks is necessary, and still there might be cases when things are not fully fixed.

Yes, we have considered different alternatives. Any daemon has been completely discarded. It is extremely challenging to maintain such functionality, specially in multi-platform (windows, linux, mac, sun, freebsd...). Too much effort for the value. There are some multi-platform issues involved, like for example Windows reusing PIDs. Also the fact that many tasks can take a lot of time, from minutes to almost hours (like building a large library), doesn't help, because that mean that any approach based in timeouts will eventually fail for some users.

For Conan 2.0 we have completely removed concurrency at the moment. It was challenging enough to implement the multi-revision cache and get rid of the short-paths implementation. We will eventually go back to the cache concurrency issue, but most likely after 2.0 (what we call 2.X). We might try to leverage DB sync capabilities, or some new implementation of readers-writer that apparently has been contributed to fasteners library, and that might get rid of the problem, managing that at the system level (instead of the app level that is what we are doing now)

@memsharded memsharded added this to the 2.X milestone Apr 15, 2022
@cr-dani-koretsky
Copy link
Author

cr-dani-koretsky commented Apr 19, 2022

Hi @memsharded !

All of your points are valid and good ideas.
I think any improvement to "conan killed dont break all future conans" would help.

You are correct that PIDs could get re-used.

Let me know if I understood correct, the flow you are describing:

  1. Conan ran with PID X and killed/crashed
  2. Other processes grab same PID

Wouldn't the following avoid all of these problems?

  1. On conan execution - check if the PID in the lock file is alive
  2. If alive -> check if it is a CONAN process (since nothing else should maintain ownership of CONAN lock files)
  3. Check that it is not US (in case we are the ones who grabbed the same PID)

Even if we avoid implementing check No. 2 and only implement No. 3 (in case someone uses conan not via commandline conan) -> we are just "making the fix incomplete" but we're still making things WAY better here..

We would still identify an "abandoned lock" in the 99% of cases where same PID didn't yet get re-used by ANY process that is not US.

For most busy machines (multiple builds per day/week) - I think PID re-use will be a much rarer issue than "any killed conan in the past leaves locks forever".

WDYT?

P.S.
About conan remove locks that doesn't work - could we perhaps fix this?
For our temporary solution we could start with running conan remove locks before every new conan - however currently even this doesn't release locks, it leaves them in a bad state that conan still considers locked..

@memsharded
Copy link
Member

Closing in favor of #15840 where the Conan 2 cache concurrency future progress will be reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants