WIP: Deleting stale locks #1246

verygreen · 2016-07-03T04:39:43Z

If BORG_UNIQ_HOSTNAME shell variable is set, stale locks
in both cache and repository are deleted.
I wanted to add a parameter to the repo config, but alas, we lock
it before the read (makes sense, really), so the only two options
are the shell variable and a command line switch.
Do we want the switch? What would the name be? Any good ways
to enable this permanently for those wishing so if the config is
off-limits?

Currently if the option is set - we would detect if a lock is of
known format and would delete it.
We only update roster if we are to do an exclusive lock. Should we care
to prune stale locks from roster if we are only getting a shared lock,
or can we leave it be until such a time exclusive locker comes around?

Also I guess I need feedback on other named things and also my
weak python-fu ;)

This fixes #562
eventually ;)

ThomasWaldmann · 2016-07-03T11:34:25Z

src/borg/cache.py

@@ -70,6 +70,10 @@ def __init__(self, repository, key, manifest, path=None, sync=True, do_files=Fal
        self.key = key
        self.manifest = manifest
        self.path = path or os.path.join(get_cache_dir(), repository.id_str)
+        if os.environ.get('BORG_UNIQ_HOSTNAME'):


uniq is the unix command, but i guess we rather spell it correctly as unique.

another thing that came to my mind:
if we create and store a uuid clientside, we could use that and not need the user to assert uniqueness of hostname.

Whatever uuid you store, it becomes not uniq once you clone a vm or something, right? I thought that was the main reason for not making this the default, otherwise we can just switch to such uuids (generatedand stored in .cache/borg on startup if not there already) and be done with it .

the hostname is also the same after cloning, right?
but maybe it is remembered more easily to be changed than some borg uuid. hmm...

Frequently enough hostname is NOT something that is cloned with a VM, because it's obtaned via a reverse DNS lookup, for example.

e.g. I have a single subdir that holds an OS instance that I export with NFS, I then run tens of virtual machines out of it (nfsroot), every one of those virtual machines has a different hostname because of different DNS record.

PlasmaPower · 2016-07-03T11:57:52Z

Can we try to read /etc/machine-id first, and if that exists use it instead of the hostname and automatically enable this feature?

We could also try to read the mac of the default network interface, which would be cross-platform.

ThomasWaldmann · 2016-07-03T15:49:25Z

Not sure if the MAC addr is a good idea. It is not always unique even for real hardware (although it should be) and also finding out and accessing the default interface might be rather platform specific.

Same goes for machine-id as it seems. It is there on systemd machines. But BSD, OS X, Windows?
So we would rather generate an own unique ID and store it on the client in the config, that works everywhere.

verygreen · 2016-07-03T16:02:40Z

The real trick is how to determine if our local unique ID has become stale and is no longer unique (because a vm was cloned or whatever).

verygreen · 2016-07-03T16:03:46Z

Also another vector is - if we store our own ID - where do we store it? in home dir? but that could be a network share and therefore visible from multiple nodes making it non-unique as well.

PlasmaPower · 2016-07-03T16:08:03Z

Not sure if the MAC addr is a good idea. It is not always unique even for real hardware (although it should be)

In which cases? I'm talking about the hardware MAC here, not the MAC actually used for networking (can be changed software level).

finding out and accessing the default interface might be rather platform specific

from uuid import getnode

https://docs.python.org/3.1/library/uuid.html

verygreen · 2016-07-03T16:10:29Z

I guess mac (no matter if it was software overridden even) is a good proxy for a unique id, otherwise things would just not work on the network and we only care about all of this for networked access anyway.

enkore · 2016-07-03T16:30:11Z

Obtaining some unique or maybe generated ID depending on various intransparent factors (=uuid.getnode) is IMHO a bad match for this.

Sticking with the hostname is clear and easy to understand + verify.

PlasmaPower · 2016-07-03T16:32:05Z

@enkore If two MACs on the same network are the same, then they've got bigger problems. Also, they'll be much more independent then hostnames. We can specify an override in the config if you'd like.

verygreen · 2016-07-03T16:33:24Z

We cannot specify an override in the config because we lock before we have a chance of reading the repo config.
Unless there's some node-global config somewhere that I missed that could be read before opening a repository.

enkore · 2016-07-03T16:34:51Z

@enkore If two MACs on the same network are the same, then they've got bigger problems. Also, they'll be much more independent then hostnames. We can specify an override in the config if you'd like.

getnode's documentation is rather unclear on this. I read it as

may or may not return a hardware-address derived number of a random NIC in the system
may or may not generate a random number and return that instead
may or may not store that number somewhere in the FS ("The first time this runs, it may launch a separate program")

note added by @ThomasWaldmann: it stores the ID into a module global var after determining it. It is not persistent.

PlasmaPower · 2016-07-03T16:40:11Z

@enkore I'd assume the cache is in memory. Also, this is Python's recommended way of doing this specific task, do you have a better way? Hostname conflicts aren't that rare, especially when given the default Linux hostname is localhost, and that most cloud hosting services probably share hostnames across VMs.

verygreen · 2016-07-03T19:26:53Z

src/borg/locking.py

+
+            # Just nuke the stale locks early on load
+            if self.ok_to_kill_zombie_locks:
+                for key in (SHARED, EXCLUSIVE):


this is ... um, unelegant.
I wonder if there's an easier way to do these checks.

ThomasWaldmann · 2016-07-03T20:04:51Z

The env var could be BORG_UUID=... so users can put there anything they like and borg would just use that string, if given, as unique id.

verygreen · 2016-07-03T20:30:39Z

I feel like BORG_UUID could be a separate thing altogether with its own patch because the changes are going to be mostly orthogonal to this patch other than one extra check for the unique_hostname var.

enkore · 2016-07-03T21:31:51Z

src/borg/locking.py

+        return False
+
+    try:
+        # This may not work in Windows.


Should work in principle; on Windows you don't really send signals, but open a process and then call various functions on the process handle. Python shouldn't be different; opening the process for a dead PID fails and should raise OSError. The error number will be different from ESRCH ...

Doesn't matter too much; this branch doesn't support Windows.

I have read that at first kill(pid, 0) used to kill a process in windows for good. And then that was changed and it started to return error at all times.
So this nees to be verified.

I'd have to look at the implementation to see if/whether signal is checked for it's value. If it's not any signal number may terminate a process, yes...

static PyObject * os_kill_impl(PyModuleDef *module, pid_t pid, Py_ssize_t signal) #ifndef MS_WINDOWS #else /* !MS_WINDOWS */ { PyObject *result; DWORD sig = (DWORD)signal; DWORD err; HANDLE handle; /* Console processes which share a common console can be sent CTRL+C or CTRL+BREAK events, provided they handle said events. */ if (sig == CTRL_C_EVENT || sig == CTRL_BREAK_EVENT) { if (GenerateConsoleCtrlEvent(sig, (DWORD)pid) == 0) { } /* If the signal is outside of what GenerateConsoleCtrlEvent can use, attempt to open and terminate the process. */ handle = OpenProcess(PROCESS_ALL_ACCESS, FALSE, (DWORD)pid); if (handle == NULL) { err = GetLastError(); return PyErr_SetFromWindowsErr(err); } if (TerminateProcess(handle, sig) == 0) { err = GetLastError(); result = PyErr_SetFromWindowsErr(err); } else { Py_INCREF(Py_None); result = Py_None; } CloseHandle(handle); return result; } #endif /* !MS_WINDOWS */ #endif /* HAVE_KILL */

So this would terminate a live process with exit code = 0 for signal = 0.

Anyway, that's something for the windows branch to handle...

Since windows port is in the works, I feel like potentially questionable Windows-related bits should be at least highlighted to be checked as the port is progressing.

Yes, leave the comments in :)

enkore · 2016-07-03T21:37:26Z

Logic-wise this looks sound to me. A bit of "nitpicking" was already mentioned; I would add that using both "zombie lock" and "stale lock" to refer to the same thing is confusing, use only "stale lock".

The unique-host-thingy is as @verygreen says orthogonal to this patch and doesn't seem to be end-of-discussion'd yet. So I'd say we move that to another ticket.

verygreen · 2016-07-03T21:39:58Z

It's stale lock on the outside, zombie lock on the inside, if you noticed.
But I can change that - no problems.

verygreen · 2016-07-03T21:59:52Z

src/borg/locking.py


    def load(self):
        try:
            with open(self.path) as f:
                data = json.load(f)
+
+            # Just nuke the stale locks early on load
+            if self.ok_to_kill_zombie_locks:


BTW, this here means that we won't do any stale lock detection if BORG_UNIQUE_HOSTNAME is not set, so no message would be printed hinting about this possibility, unlike the other check which is a bit inconsistent.
Do we want this message here or is it ok to omit it and let people learn through the documentation, I wonder?

enkore · 2016-07-03T22:09:00Z

"what to do with the hostname field?"-discussion -> #1253

If BORG_UNIQUE_HOSTNAME shell variable is set, stale locks in both cache and repository are deleted. Stale lock is defined as a lock that's originating from the same hostname as us, and correspond to a pid that no longer exists. This fixes borgbackup#562

ThomasWaldmann · 2016-07-21T16:01:39Z

src/borg/locking.py

    thread_id = 0
    return _hostname, _pid, thread_id


+def check_lock_stale(host, pid, thread):


is_lock_stale

ThomasWaldmann · 2016-07-21T17:04:27Z

BTW, just FYI: i did another round of researching 3rd party lockfile libs. There is still nothing great out there, everything I found has obvious issues.

ThomasWaldmann · 2016-07-21T17:05:26Z

BTW, master or 1.0-maint?

verygreen · 2016-07-21T17:08:20Z

we need both, I think.
But I want a master version first and then we can backport.

enkore · 2016-07-21T17:23:22Z

I'm thinking master first, maybe see it at least in a RC in action before backporting?

enkore · 2016-10-02T10:04:53Z

Superseded by #1674

ThomasWaldmann reviewed Jul 3, 2016
View reviewed changes

verygreen force-pushed the auto-breaklock branch from ca9be6f to 52b9b8f Compare July 3, 2016 19:25

verygreen reviewed Jul 3, 2016
View reviewed changes

verygreen force-pushed the auto-breaklock branch from 52b9b8f to 5fd73cc Compare July 3, 2016 20:31

enkore reviewed Jul 3, 2016
View reviewed changes

verygreen reviewed Jul 3, 2016
View reviewed changes

enkore mentioned this pull request Jul 3, 2016

Extend hostname in lock roster with additional metadata? #1253

Closed

verygreen force-pushed the auto-breaklock branch from 5fd73cc to c278634 Compare July 4, 2016 02:56

ThomasWaldmann force-pushed the master branch from e10b653 to f363ddd Compare July 4, 2016 18:12

enkore mentioned this pull request Jul 21, 2016

SIGTERM handling #519

Closed

ThomasWaldmann reviewed Jul 21, 2016
View reviewed changes

src/borg/locking.py

thread_id = 0

return _hostname, _pid, thread_id

def check_lock_stale(host, pid, thread):

Copy link

Member

ThomasWaldmann Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_lock_stale

enkore mentioned this pull request Sep 12, 2016

Killed as a python subprocess leaves lock behind #1591

Closed

enkore mentioned this pull request Oct 2, 2016

Automatically remove stale locks #1674

Merged

enkore closed this Oct 2, 2016

sten0 mentioned this pull request Oct 11, 2022

cancelling the archive check doesnt kill the underlying borg command borgbase/vorta#930

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Deleting stale locks #1246

WIP: Deleting stale locks #1246

verygreen commented Jul 3, 2016

ThomasWaldmann Jul 3, 2016

ThomasWaldmann Jul 3, 2016

verygreen Jul 3, 2016

ThomasWaldmann Jul 3, 2016

verygreen Jul 3, 2016

PlasmaPower commented Jul 3, 2016

ThomasWaldmann commented Jul 3, 2016

verygreen commented Jul 3, 2016

verygreen commented Jul 3, 2016

PlasmaPower commented Jul 3, 2016 •

edited

verygreen commented Jul 3, 2016

enkore commented Jul 3, 2016 •

edited

PlasmaPower commented Jul 3, 2016

verygreen commented Jul 3, 2016

enkore commented Jul 3, 2016 •

edited by ThomasWaldmann

PlasmaPower commented Jul 3, 2016 •

edited

verygreen Jul 3, 2016

ThomasWaldmann commented Jul 3, 2016

verygreen commented Jul 3, 2016 •

edited by ThomasWaldmann

enkore Jul 3, 2016 •

edited

verygreen Jul 3, 2016

enkore Jul 3, 2016 •

edited

verygreen Jul 3, 2016

enkore Jul 3, 2016 •

edited

enkore commented Jul 3, 2016 •

edited

verygreen commented Jul 3, 2016

verygreen Jul 3, 2016

enkore commented Jul 3, 2016

ThomasWaldmann Jul 21, 2016

ThomasWaldmann commented Jul 21, 2016

ThomasWaldmann commented Jul 21, 2016

verygreen commented Jul 21, 2016

enkore commented Jul 21, 2016

enkore commented Oct 2, 2016

WIP: Deleting stale locks #1246

WIP: Deleting stale locks #1246

Conversation

verygreen commented Jul 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PlasmaPower commented Jul 3, 2016

ThomasWaldmann commented Jul 3, 2016

verygreen commented Jul 3, 2016

verygreen commented Jul 3, 2016

PlasmaPower commented Jul 3, 2016 • edited

verygreen commented Jul 3, 2016

enkore commented Jul 3, 2016 • edited

PlasmaPower commented Jul 3, 2016

verygreen commented Jul 3, 2016

enkore commented Jul 3, 2016 • edited by ThomasWaldmann

PlasmaPower commented Jul 3, 2016 • edited

Choose a reason for hiding this comment

ThomasWaldmann commented Jul 3, 2016

verygreen commented Jul 3, 2016 • edited by ThomasWaldmann

enkore Jul 3, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enkore Jul 3, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enkore Jul 3, 2016 • edited

Choose a reason for hiding this comment

enkore commented Jul 3, 2016 • edited

verygreen commented Jul 3, 2016

Choose a reason for hiding this comment

enkore commented Jul 3, 2016

Choose a reason for hiding this comment

ThomasWaldmann commented Jul 21, 2016

ThomasWaldmann commented Jul 21, 2016

verygreen commented Jul 21, 2016

enkore commented Jul 21, 2016

enkore commented Oct 2, 2016

PlasmaPower commented Jul 3, 2016 •

edited

enkore commented Jul 3, 2016 •

edited

enkore commented Jul 3, 2016 •

edited by ThomasWaldmann

PlasmaPower commented Jul 3, 2016 •

edited

verygreen commented Jul 3, 2016 •

edited by ThomasWaldmann

enkore Jul 3, 2016 •

edited

enkore Jul 3, 2016 •

edited

enkore Jul 3, 2016 •

edited

enkore commented Jul 3, 2016 •

edited