-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bridge: rewrite fsreplace1 as AsyncChannel #20278
bridge: rewrite fsreplace1 as AsyncChannel #20278
Conversation
This bridge crash is unrelated, but may be interesting? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I suppose the conversion to asyncchannel is necessary to support large file uploads in c-files? That part is nice.
I'm less sure about the NamedTemporaryFile, though. The "compute temp path" code is/was gross, but the umask and tmp
scoping hacks are as well. I have some questions about it, but if the with
context seriously doesn't clean up when exceptions happen, it's not worth having IMHO.
src/cockpit/channels/filesystem.py
Outdated
await loop.run_in_executor(None, tmp.write, data) | ||
data = await self.read() | ||
|
||
os.fdatasync(tmp.fileno()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that necessary? The commit message doesn't give the "why".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't call fdatasync() then we're not guaranteed to have the new file's data on disk before the rename takes effect, and we could end up with a zero-length file.
This is how g_file_set_contents()
does it, for example. See this comment: https://gitlab.gnome.org/GNOME/glib/-/blob/704d0bb297cff6fc53beca48ccc2cfd061f0bf5e/glib/gfileutils.c#L1094
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so this isn't about rename() not working on "only in write cache" data (that would surprise me, as it's entirely an inode-level operation), but just generally making sure about data integrity, i.e. strong promises about that API ("if it succeeds, it's on disk"). This isn't documented. I don't mind adding it, it just wasn't clear why it's necessary for the renaming to work (... as it isn't). Thanks! (I'd prefer a second commit with an explanation, but not insisting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not yet clear to me that we accept this proposal PR in principle yet, so it would indeed be useful to have this if we stay with the old approach.
I'd feel pretty bad about calling that outside of a thread, though — it can block for seconds... and calling it in a thread in a way that works in callback-based code would start to get a bit hairy...
src/cockpit/channels/filesystem.py
Outdated
os.fchmod(tmp.fileno(), stat.S_IMODE(buf.st_mode)) | ||
os.fchown(tmp.fileno(), buf.st_uid, buf.st_gid) | ||
os.rename(tmp.name, path) | ||
tmp = None # don't need to delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, is that safe with a with
statement? This looks like it could seriously crash the contex's __exit__()
. It apparently doesn't right now, but this is too hackish for my taste.
I just tested
>>> import tempfile, os
>>> with tempfile.NamedTemporaryFile() as tmp:
... os.unlink(tmp.name)
and that works fine. There is no situation where we would ever want to keep the temporary file around, it should be removed in all cases. So it seems that can be cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that also work in Python 3.6? At some point this gave FileNotFoundError
. They recently improved the situation with a new delete_on_close
kwarg in Python 3.12, so I wouldn't be surprised if there were other cleanups on this point.
We could do the delete and suppress the message, but there's a theoretical (albeit effectively 0) possibility that we're deleting a file that we didn't create in that case:
- we create the tempfile, write write, fsync
- rename it into place over the real filename
- at this exact moment another process comes along and creates a new temp file and just happens to get exactly the same name as the one we used
- we
os.unlink()
what we think is our filename, but is actually the one created by the other process
So effectively, I'm using tmp = None
here to mean "don't delete this".
As for the context manager vs. None
thing, this isn't actually a problem. There are two things involved in a context manager:
- the context manager itself (which is captured by the
with
block to have__exit__()
called later) - the return value of the
__enter__()
function of the context manager — this is the thing that gets optionally captured by the optionalas
clause
We're nulling out the second copy, but not the first. In this case, I think it's the same object, but to be honest I'm not sure. In either case, it doesn't matter — with
holds its own copy and will .close()
it (but not delete it).
I had a different version of this code that kept a separate variable which was the name of the file that we should delete when exiting the function. I could try that again if you feel it is better. I ended up liking this version more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Makes sense to keep an internal copy, so that the code can mess around with the variable.
I tested it on py3.6 (rhel 8.1.0) and that indeed fails with a FileNotFoundError
.
All that umask/weird scoping here feels like that NamedTemporaryFile isn't really the API that we want? Replacing one hack (tmpname calculation) with two others (smaller, but harder to understand) doesn't really feel like an improvement to me.
Just to gauge: I don't like it, but not a veto. I just wanted to understand how this works. If you like it, then go ahead, but I wouldn't mind keeping the current approach for this (the async conversion is really nice, though, but that's independent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's another consideration here which I didn't explicitly state that made me lean more towards NamedTemporaryFile
. I really dislike the chown/chmod tricks (as you know), and I particularly dislike the idea that we might be in a situation where either:
- we chown/chmod late, in which case we might temporarily reveal sensitive data which ought not to be revealed (which is a security problem)
- we chown/chmod early, in which case we might have some file with special bits like setuid in a half-written state (which is a security problem)
So if we're going to chown/chmod
, then I specifically like that NamedTemporaryFile
creates a file as the current user, with a maximally locked down set of permissions, and we only change the file mode/owner after it's fully written. It actually legit seems like the best way.
But ya — the inability to deal with umask nicely from Python is a long-documented and awful problem. There's a similar problem where it's very difficult to create executable files with the proper permissions because open()
doesn't support the executable bit and you can't chmod()
after the fact because of the frustration of determining the value of the umask. The solution in that case is the same, and it sucks — lose all of the benefits of the standard library and write low-level code via os.open()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we can do the "start with minimal permissions" fairly easily in the current code as well if we change it to
fd = os.open(..., 0o666 if stat is None else 0600)
and moving the chown to closing.
self._tempfile.close() | ||
self._tempfile = None | ||
# otherwise, spool data into a temporary file until EOF then rename into place... | ||
with tempfile.NamedTemporaryFile(dir=dirname, prefix=f'.{basename}-', delete=False) as tmp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context tmp
here looks like it shadows the tmp
from above which his already in scope. I experimented with this and it seems it actually merely overwrites it, and tmp
is still available after the with
. But still, this is hard to follow and unclean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya. I buy that argument. It was also a bit difficult to convince mypy
to like this.
src/cockpit/channels/filesystem.py
Outdated
if tmp is not None: | ||
os.unlink(tmp.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above -- doing that should be the whole point of using a with
context? I seriously doubt my Python knowledge right now..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create the NamedTempFile()
with delete=True
, which means that the contextmanager will close the tempfile, but not delete it. This would all be so much better if we could use delete_on_close=False
, but it's only in 3.12 :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO delete_on_close=True
is the whole reason why we want to use NamedTemporaryFile. I.e. always remove the temp file, and just rename it to the final name in the happy path. I'd consider a rewrite using =False
as making it worse..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From where I'm sitting: context managers are nice, but so are finally:
clauses. They're essentially interchangeable.
NamedTemporaryFile
, in the best case, does ~4 things:
- creates a unique name
- gives you a file-like object for interacting with it
- closes it when you're done
- deletes it when you're done
The last one is not working for us, so we have to go manual on that one.... but I'm happy to use the other 3...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for evaluating all the options. So the summary is "they all suck, Linux doesn't have a good API for this, let's just pick the least bad one" 😅
So, I'm ok with this, we just need to figure out the fsreplace/fswatch race below.
57ef746
to
cad44f4
Compare
Looking at the dump unit test failure:
It reproduces fine with The crash happens because the
but the second time after the whole test is done and calling hooks.after(() => cockpit.spawn(["rm", "-f", filename])); i.e. the problem is that the watch somehow sticks around even after the watch channel got closed with This primarily concerns fswatch, not fsreplace, but it can't be a concidence that making this channel async interferes with fswatch? I'm staring at the output with this patch, which adds some debug logs and also avoids the configChanged() crash on null: diff --git pkg/kdump/config-client.js pkg/kdump/config-client.js
index 066c8697e..c25ed2d3a 100644
--- pkg/kdump/config-client.js
+++ pkg/kdump/config-client.js
@@ -43,6 +43,7 @@ export class ConfigFile {
this._fileHandle = cockpit.file(filename, { superuser });
this._fileHandle.watch((rawContent) => {
+ console.log("XXXk ConfigFile", filename, "changed; content", rawContent);
this._parseText(rawContent);
});
}
diff --git pkg/kdump/test-config-client.js pkg/kdump/test-config-client.js
index d4a099621..f74e51c87 100644
--- pkg/kdump/test-config-client.js
+++ pkg/kdump/test-config-client.js
@@ -52,7 +52,10 @@ QUnit.module("kdump", hooks => {
filename = filename.trim();
});
- hooks.after(() => cockpit.spawn(["rm", "-f", filename]));
+ hooks.after(() => {
+ console.log("XXXX hooks.after, removing", filename);
+ cockpit.spawn(["rm", "-f", filename]);
+ });
QUnit.test("config_update", function(assert) {
const done = assert.async();
@@ -60,6 +63,9 @@ QUnit.module("kdump", hooks => {
const dataWasChanged = new Promise(resolve => { this.dataWasChangedResolve = resolve });
let config;
const configChanged = (event, settings) => {
+ console.log("XXXX configChanged called", JSON.stringify(settings));
+ if (settings === null)
+ return;
assert.equal(settings._internal.foo.value, "moo", "value changed correctly");
assert.equal("key" in settings._internal, false, "setting with comment deleted correctly");
assert.equal("will" in settings._internal, false, "setting without comment deleted correctly");
@@ -80,11 +86,16 @@ QUnit.module("kdump", hooks => {
delete config.settings._internal.will;
config.settings._internal.hooray = { value: "value" };
config.addEventListener('kdumpConfigChanged', configChanged);
+ console.log("XXXX listener added");
config.write(config.settings)
.then(() => {
+ console.log("XXXX wrote new config");
// Close watch channel
config.close();
- dataWasChanged.then(done);
+ dataWasChanged.then(() => {
+ console.log("XXXX dataWasChanged resolved, ending test");
+ done();
+ });
});
});
}); and this is confusing:
i.e. I know we can't rely on the order of the console.log() relative to the pybridge debug logs, but I would at least assume/hope that the ordering of the XXXXes should be reliable. Which means that the configChanged callback with Also, the debug logs in the kdump client shows that the file watch is really called with a null value. However, the file doesn't just get deleted -- when I comment out the So the main issue is that the watching catches a file state which is "absent", which suggests that this isn't atomic? I looked for bridge debug messages. I get
which should be the atomic rename, and after that it reads the file. But then there's a second and third
which I don't know how to interpret. Somehow this new fsreplace and fswatch don't get along.. With current main, there are these filesystem debug messages:
while with this PR there are these:
which is mostly identical except for an extra read. So it's not additional inotify events or so. |
A-ha! With more debugging around cockpit.js I see that this happens:
So the issue is that if (message.problem) {
const error = new BasicError(message.problem, message.message);
console.log("XXXX try_read problem", JSON.stringify(message));
fire_watch_callbacks(null, null, error);
dfd.reject(error);
return;
} From the perspective of It certainly would be nice if we could guarantee that calling --- pkg/kdump/test-config-client.js
+++ pkg/kdump/test-config-client.js
@@ -83,6 +83,7 @@ QUnit.module("kdump", hooks => {
config.write(config.settings)
.then(() => {
// Close watch channel
+ config.removeEventListener('kdumpConfigChanged', configChanged);
config.close();
dataWasChanged.then(done); With that, the test works reliably. |
The watch channel can emit events even after `.close()` is called, which can cause issues in the `kdump/test-config-client.js` QUnit test if the timing of the file write changes (which it is about to do, in a upcoming commit which reimplements fsreplace1 asynchronously). Removing the event listener is a sure way to ensure we won't get further events.
Our previous reason for avoiding tempfile.NamedTemporaryFile() was that we didn't want to chown() after the fact, but we've crossed that bridge already since 72027c1, so we may as well go all the way. Rewrite things as a single run() function using NamedTemporaryFile as a context manager. This is a fairly substantial simplification. We also make the channel properly asynchronous by dispatching the actual write to an executor. We also add fdatasync(), which we were missing before. We need to make a minor adjustment to AsyncChannel to support sending close arguments, similar to the way we handle them on GeneratorChannel.
cad44f4
to
4454f89
Compare
So this unfortunately landed already, as part of #20315. I ignored the red testing-farm checks on CentOS 9 there because I figured that it wasn't possible that they were relevant to the work there, but this change is quite a lot more invasive, and the same checks are red here. Let's re-run those to find out if this is what caused that.... |
... yup, looks like testing farm was just flaking. So then this is taken care of in #20315. |
Our previous reason for avoiding tempfile.NamedTemporaryFile() was that we didn't want to chown() after the fact, but we've crossed that bridge already since 72027c1, so we may as well go all the way.
Rewrite things as a single run() function using NamedTemporaryFile as a context manager. This is a fairly substantial simplification. We also make the channel properly asynchronous by dispatching the actual write to an executor. We also add fdatasync(), which we were missing before.