-
-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Deleting stale locks #1246
Conversation
@@ -70,6 +70,10 @@ def __init__(self, repository, key, manifest, path=None, sync=True, do_files=Fal | |||
self.key = key | |||
self.manifest = manifest | |||
self.path = path or os.path.join(get_cache_dir(), repository.id_str) | |||
if os.environ.get('BORG_UNIQ_HOSTNAME'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uniq is the unix command, but i guess we rather spell it correctly as unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another thing that came to my mind:
if we create and store a uuid clientside, we could use that and not need the user to assert uniqueness of hostname.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever uuid you store, it becomes not uniq once you clone a vm or something, right? I thought that was the main reason for not making this the default, otherwise we can just switch to such uuids (generatedand stored in .cache/borg on startup if not there already) and be done with it .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hostname is also the same after cloning, right?
but maybe it is remembered more easily to be changed than some borg uuid. hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frequently enough hostname is NOT something that is cloned with a VM, because it's obtaned via a reverse DNS lookup, for example.
e.g. I have a single subdir that holds an OS instance that I export with NFS, I then run tens of virtual machines out of it (nfsroot), every one of those virtual machines has a different hostname because of different DNS record.
Can we try to read We could also try to read the mac of the default network interface, which would be cross-platform. |
Not sure if the MAC addr is a good idea. It is not always unique even for real hardware (although it should be) and also finding out and accessing the default interface might be rather platform specific. Same goes for machine-id as it seems. It is there on systemd machines. But BSD, OS X, Windows? |
The real trick is how to determine if our local unique ID has become stale and is no longer unique (because a vm was cloned or whatever). |
Also another vector is - if we store our own ID - where do we store it? in home dir? but that could be a network share and therefore visible from multiple nodes making it non-unique as well. |
In which cases? I'm talking about the hardware MAC here, not the MAC actually used for networking (can be changed software level).
from uuid import getnode |
I guess mac (no matter if it was software overridden even) is a good proxy for a unique id, otherwise things would just not work on the network and we only care about all of this for networked access anyway. |
Obtaining some unique or maybe generated ID depending on various intransparent factors (=uuid.getnode) is IMHO a bad match for this. Sticking with the hostname is clear and easy to understand + verify. |
@enkore If two MACs on the same network are the same, then they've got bigger problems. Also, they'll be much more independent then hostnames. We can specify an override in the config if you'd like. |
We cannot specify an override in the config because we lock before we have a chance of reading the repo config. |
getnode's documentation is rather unclear on this. I read it as
note added by @ThomasWaldmann: it stores the ID into a module global var after determining it. It is not persistent. |
@enkore I'd assume the cache is in memory. Also, this is Python's recommended way of doing this specific task, do you have a better way? Hostname conflicts aren't that rare, especially when given the default Linux hostname is localhost, and that most cloud hosting services probably share hostnames across VMs. |
|
||
# Just nuke the stale locks early on load | ||
if self.ok_to_kill_zombie_locks: | ||
for key in (SHARED, EXCLUSIVE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ... um, unelegant.
I wonder if there's an easier way to do these checks.
The env var could be BORG_UUID=... so users can put there anything they like and borg would just use that string, if given, as unique id. |
I feel like BORG_UUID could be a separate thing altogether with its own patch because the changes are going to be mostly orthogonal to this patch other than one extra check for the unique_hostname var. |
return False | ||
|
||
try: | ||
# This may not work in Windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should work in principle; on Windows you don't really send signals, but open a process and then call various functions on the process handle. Python shouldn't be different; opening the process for a dead PID fails and should raise OSError. The error number will be different from ESRCH ...
Doesn't matter too much; this branch doesn't support Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have read that at first kill(pid, 0) used to kill a process in windows for good. And then that was changed and it started to return error at all times.
So this nees to be verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to look at the implementation to see if/whether signal is checked for it's value. If it's not any signal number may terminate a process, yes...
static PyObject *
os_kill_impl(PyModuleDef *module, pid_t pid, Py_ssize_t signal)
#ifndef MS_WINDOWS
#else /* !MS_WINDOWS */
{
PyObject *result;
DWORD sig = (DWORD)signal;
DWORD err;
HANDLE handle;
/* Console processes which share a common console can be sent CTRL+C or
CTRL+BREAK events, provided they handle said events. */
if (sig == CTRL_C_EVENT || sig == CTRL_BREAK_EVENT) {
if (GenerateConsoleCtrlEvent(sig, (DWORD)pid) == 0) {
}
/* If the signal is outside of what GenerateConsoleCtrlEvent can use,
attempt to open and terminate the process. */
handle = OpenProcess(PROCESS_ALL_ACCESS, FALSE, (DWORD)pid);
if (handle == NULL) {
err = GetLastError();
return PyErr_SetFromWindowsErr(err);
}
if (TerminateProcess(handle, sig) == 0) {
err = GetLastError();
result = PyErr_SetFromWindowsErr(err);
} else {
Py_INCREF(Py_None);
result = Py_None;
}
CloseHandle(handle);
return result;
}
#endif /* !MS_WINDOWS */
#endif /* HAVE_KILL */
So this would terminate a live process with exit code = 0 for signal = 0.
Anyway, that's something for the windows branch to handle...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since windows port is in the works, I feel like potentially questionable Windows-related bits should be at least highlighted to be checked as the port is progressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, leave the comments in :)
Logic-wise this looks sound to me. A bit of "nitpicking" was already mentioned; I would add that using both "zombie lock" and "stale lock" to refer to the same thing is confusing, use only "stale lock". The unique-host-thingy is as @verygreen says orthogonal to this patch and doesn't seem to be end-of-discussion'd yet. So I'd say we move that to another ticket. |
It's stale lock on the outside, zombie lock on the inside, if you noticed. |
|
||
def load(self): | ||
try: | ||
with open(self.path) as f: | ||
data = json.load(f) | ||
|
||
# Just nuke the stale locks early on load | ||
if self.ok_to_kill_zombie_locks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this here means that we won't do any stale lock detection if BORG_UNIQUE_HOSTNAME is not set, so no message would be printed hinting about this possibility, unlike the other check which is a bit inconsistent.
Do we want this message here or is it ok to omit it and let people learn through the documentation, I wonder?
"what to do with the hostname field?"-discussion -> #1253 |
If BORG_UNIQUE_HOSTNAME shell variable is set, stale locks in both cache and repository are deleted. Stale lock is defined as a lock that's originating from the same hostname as us, and correspond to a pid that no longer exists. This fixes borgbackup#562
thread_id = 0 | ||
return _hostname, _pid, thread_id | ||
|
||
|
||
def check_lock_stale(host, pid, thread): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_lock_stale
BTW, just FYI: i did another round of researching 3rd party lockfile libs. There is still nothing great out there, everything I found has obvious issues. |
BTW, master or 1.0-maint? |
we need both, I think. |
I'm thinking master first, maybe see it at least in a RC in action before backporting? |
Superseded by #1674 |
If BORG_UNIQ_HOSTNAME shell variable is set, stale locks
in both cache and repository are deleted.
I wanted to add a parameter to the repo config, but alas, we lock
it before the read (makes sense, really), so the only two options
are the shell variable and a command line switch.
Do we want the switch? What would the name be? Any good ways
to enable this permanently for those wishing so if the config is
off-limits?
Currently if the option is set - we would detect if a lock is of
known format and would delete it.
We only update roster if we are to do an exclusive lock. Should we care
to prune stale locks from roster if we are only getting a shared lock,
or can we leave it be until such a time exclusive locker comes around?
Also I guess I need feedback on other named things and also my
weak python-fu ;)
This fixes #562
eventually ;)