-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zincati fails to update nodes: Too many open files #1608
Comments
That release has the new Zincati release: https://github.com/coreos/zincati/releases/tag/v0.0.26. Possibly related to one of those dependency bumps? |
OK and Zincati v0.0.26 just got promoted in today's |
It includes Zincati v0.0.26 which may be a culprit in coreos/fedora-coreos-tracker#1608 where updates appear to break in some situations. Let's pause until we know more.
It's an old issue, but this looks similar in that it's also about async reqwest: seanmonstar/reqwest#386 I didn't see anything obvious though in the reqwest release notes (for the range we went through in that Zincati release). |
It includes Zincati v0.0.26 which may be a culprit in coreos/fedora-coreos-tracker#1608 where updates appear to break in some situations. Let's pause until we know more.
This problem is real:
|
In a quick skim of things I think this is fixed by lucab/libsystemd-rs@ee15505 |
Looks like we have a few day window where zincati hasn't yet run out of it's alotment of open files:
I imagine for all |
Zincati bump in coreos/zincati#1118. |
I think there have been other cases in the past where zincati has got into a state where a simple restart would have allowed the system to progress updating. I wonder if we shouldn't do something like this to just force restart the service periodically: diff --git a/dist/systemd/system/zincati.service b/dist/systemd/system/zincati.service
index 1837a09..ae6fdda 100644
--- a/dist/systemd/system/zincati.service
+++ b/dist/systemd/system/zincati.service
@@ -22,6 +22,7 @@ Type=notify
ExecStart=/usr/libexec/zincati agent ${ZINCATI_VERBOSITY}
Restart=on-failure
RestartSec=10s
+RuntimeMaxSec=2w
[Install]
WantedBy=multi-user.target |
If we set the time to something shorter it could possibly be a round about way to fix coreos/zincati#928 |
I know this would feel gross/icky. The truth of the matter is that we want code that is bug free and doesn't have any problems, but the reality is we can't foresee/catch everything and in this component of the system it's important to have an escape hatch. |
The
|
I can confirm that coreos/zincati#1118 fixes this issue:
|
Seems reasonable to me, though I would bump it to e.g. 3 weeks so it's more safely larger than our usual release frequency. IOW, we usually shouldn't be running for longer than 3 weeks, and if we are, that might indicate something is wrong and restarting Zincati is worth a try. |
We've seen some issues in the past where a simple restart of the zincati daemon would have allowed systems to continue updating. Let's periodically restart the zincati daemon to handle cases like this in the future, which we can't always foresee. The most recent example being: coreos/fedora-coreos-tracker#1608
|
The fix for this went into This was a fast-track to get the stable update out within a day of it shipping to end users. By getting it out to users this fast the Zincati client on the node should not have run out of open files allotment and should be able to update the system from |
Started a hackmd for the coreos-status communication: https://hackmd.io/RcLX0wjNTE-BheouqnYt_Q?edit |
The fix for this went into |
The fix for this went into |
If you were on The logs look like this (this example is from a
So you'll also need to apply the workaround. |
In `38.20231027.2.0` it was the last 38 release of `testing`. It also happens to be the first release with the zincati problem [1]. To avoid this problem we'll make the 38->39 update barrier (the one that satisfies https://docs.fedoraproject.org/en-US/fedora-coreos/update-barrier-signing-keys/ be `38.20231014.2.0` rather than `38.20231027.2.0`. [1] coreos/fedora-coreos-tracker#1608).
In `38.20231027.2.0` it was the last 38 release of `testing`. It also happens to be the first release with the zincati problem [1]. To avoid this problem we'll make the 38->39 update barrier (the one that satisfies https://docs.fedoraproject.org/en-US/fedora-coreos/update-barrier-signing-keys/ be `38.20231014.2.0` rather than `38.20231027.2.0`. [1] coreos/fedora-coreos-tracker#1608
In `38.20231027.2.0` it was the last 38 release of `testing`. It also happens to be the first release with the zincati problem [1]. To avoid this problem we'll make the 38->39 update barrier (the one that satisfies https://docs.fedoraproject.org/en-US/fedora-coreos/update-barrier-signing-keys/ be `38.20231014.2.0` rather than `38.20231027.2.0`. [1] coreos/fedora-coreos-tracker#1608
Describe the bug
Several
next
nodes running39.20231022.1.0
did not update as part of the latest update cycle.Zincati reported errors.
Restarting zincati fixed the problem.
Reproduction steps
Haven't spent time trying to reproduce.
Examining the fleet I observe:
39.20231022.1.0
on/around 2023-10-25 18:35:58 UTC and have remained up seem to be affected.Expected behavior
Node updates like it always has.
Actual behavior
Node does not update. Zincati reports errors.
System details
Butane or Ignition config
No response
Additional information
Node updated on Oct 25th and by Oct 29th a too many files error was observed and repeats up until recently. No other system funcationality impacted during this time.
The text was updated successfully, but these errors were encountered: