Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repoctl is backing-up files without any good-reason #58

Closed
PedroHLC opened this issue Sep 8, 2020 · 39 comments
Closed

Repoctl is backing-up files without any good-reason #58

PedroHLC opened this issue Sep 8, 2020 · 39 comments

Comments

@PedroHLC
Copy link

PedroHLC commented Sep 8, 2020

Hi @cassava,

I just lot the entire repository twice today, the files were moved to the backup directory.

Tasks used includes (only) repoctl update and repoctl add <some kernels>.
I'm using the 0.21 release.

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

This time repoctl --debug status -mca doesn't show anything wrong 😢

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

So sorry! I'm looking into it.

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

Luckily I logged the moment it backed up everything this second time:

Copying and adding to repository: linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst{,.sig}
Adding package to database: /srv/http/chaotic-aur/x86_64/linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst
error: read package /srv/http/chaotic-aur/x86_64/linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst: invalid input: magic number mismatch.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Could you post the output of repoctl version?

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

repoctl version

repoctl version 0.21 (30 August, 2020)
Copyright 2016-2020, Ben Morgan <cassava@iexu.de>

You may find repoctl on the Internet at
    https://github.com/cassava/repoctl
Please report any bugs you may encounter.

The source code of repoctl is licensed under the MIT license.

Current configuration:
    columnate = false
    color = "auto"
    quiet = false

    current_profile = "default"
    default_profile = "default"

    [profiles.default]
        repo = "/srv/http/chaotic-aur/x86_64/chaotic-aur.db.tar.zst"
        add_params = []
        rm_params = []
        ignore_aur = []
        require_signature = false
        backup = true
        backup_dir = "/srv/http/chaotic-aur/archive/"
        interactive = false
        pre_action = ""
        post_action = ""

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

That looks like the output of the bug that already got fixed... hmm. I wonder if repoctl-git version is different from the one I packaged.

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

😅 this was with aur.archlinux.org/packages/repoctl

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Oh boy...

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

I got it from the pacman's cache:
https://lonewolf.pedrohlc.com/.hidden/repoctl-0.21-1-x86_64.pkg.tar.zst

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Ah, you might consider using repoctl-0.21-3.

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

I may have missed updating it 😅 I'll have to wait for the recompilation cycle to get in letter R.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

My guess is that might be it. I might have messed up the PKGBUILD for the go module migration, which could result in local Go modules being used instead of the vendored ones... This was before I followed the updated Arch Go packaging guidelines for modules.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

And the error you describe here is one that got fixed in one of the dependencies of repoctl. So there might be that mismatch.

Also, I downloaded the package that caused the trouble and followed your procedure locally and didn't have any trouble, so at least I can't reproduce it with 0.21-2 and 0.21-3.

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

One notice, when magic number mismatch happens with repoctl update no files are backed up, it just fails, when it happens with repoctl add it's catastrophic.
And another thing: What happens when the file didn't finish writing? I have some async tasks and they may be trying to add files before they finished writing...

EDIT: I've updated to 0.21-3, I'll keep you posted...

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Oh interesting, I should look into that. I think I'm starting to understand the backup behavior. Has to do with how repoctl reads all data first and then tries to act on it.
I think I need to add an separate use-case for "new file exists and I can't read it".
Because I did not consider this originally. This would solve both problems at once.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Also, a partially-written archive would pose some problems, because repoctl only reads as much of a package as it needs to; currently I'm relying on repo-add to handle the case of an incomplete file.

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

Yeah, repo-add failing and consequentially repcotl exiting with a failure code too is enough. That's how it has been working the past year. And how it would work with pure repo-add too. The server will reattempt it later on failures...

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

So one thing I could definitely do is have repoctl add verify the packages before copying them to repository. Currently it just copies them over and trusts in repo-add.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Alright then that change is now on master with 4822d1f.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Luckily I logged the moment it backed up everything this second time:

Copying and adding to repository: linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst{,.sig}
Adding package to database: /srv/http/chaotic-aur/x86_64/linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst
error: read package /srv/http/chaotic-aur/x86_64/linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst: invalid input: magic number mismatch.

And you're saying that adding this package to database actually caused all the packages in the repository to be backed up?

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

Alright then that change is now on master with 4822d1f.

Moved to it 👍

By the end of the recompilation cycles, I'll let you know if something goes wrong.
After changing to -3 no error happened yet!

And you're saying that adding this package to database actually caused all the packages in the repository to be backed up?

Yeah, The full logs have a bunch of "Backing up..." after this

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

So bizarre that I can't get that backup-behavior reproduced at all... :-(

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

I was using 0.21 since #57 was closed, and only now it happened (and then again, but after building 800 packages in a small-time period).
I think that sounds like a race condition issue.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Do you run repoctl update while building other packages?

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

I do, the server has 40 vCPUs, I don't like to leave any of these idle 😊, so it's a chaos of down, update, and when files come from a third cluster add.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Ok... that explains a lot. 😄

This would have been very useful to know earlier. So far I haven't considered the ramifications of parallel updates and adds. This is a tricky one.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

Actually I'd also be interested in hearing any pain-points you might have in building that many packages.

For example: I've always found the situation difficult where you need to build newer dependencies that then need to be installable for the next makepkg -s command.

@cassava
Copy link
Owner

cassava commented Sep 8, 2020

I think it would be better to create a new issue specifically for the use-case "Support parallel execution of repoctl".

@PedroHLC
Copy link
Author

PedroHLC commented Sep 8, 2020

😅 someway somehow I managed that, my first infra has a "batch" command, and I execute it like this:
chaotic-batchbuild somepackage anotherpackage -- apackagethatdepends
and I wrapped the "add" command, it waits for a lock to be deleted before running a secondary repoctl update

(And the second one has a db-bump command that does almost the same)

@PedroHLC
Copy link
Author

PedroHLC commented Sep 9, 2020

Sometimes I still get:

error: read package /srv/http/chaotic-aur/x86_64/hamsket-git-r1222.fe82ff7-1-x86_64.pkg.tar.zst: invalid input: magic number mismatch.
Adding package to database: /srv/http/chaotic-aur/x86_64/gstreamer0.10-base-0.10.36-13-x86_64.pkg.tar.zst
Adding package to database: /srv/http/chaotic-aur/x86_64/gstreamer0.10-base-plugins-0.10.36-13-x86_64.pkg.tar.zst

Sometimes is uglier:

error: read package /srv/http/chaotic-aur/x86_64/gnome-shell-extension-xrdesktop-git-0.14.0.29.9c5c0c3-1-any.pkg.tar.zst: cannot find file ".PKGINFO".
error: read package /srv/http/chaotic-aur/x86_64/mkinitcpio-openswap-0.1.0-3-any.pkg.tar.zst: cannot find file ".PKGINFO".
error: read package /srv/http/chaotic-aur/x86_64/pango-anydesk-1:1.43.0-3-x86_64.pkg.tar.zst: invalid input: magic number mismatch.
error: read package /srv/http/chaotic-aur/x86_64/perl-authen-simple-0.5-9-any.pkg.tar.zst: cannot find file ".PKGINFO".
error: read package /srv/http/chaotic-aur/x86_64/qomui-git-0.8.2.r22.23650ab-1-x86_64.pkg.tar.zst: invalid input: magic number mismatch.
error: read package /srv/http/chaotic-aur/x86_64/ripcord-arch-libs-0.4.26-1-x86_64.pkg.tar.zst: invalid input: magic number mismatch.
error: read package /srv/http/chaotic-aur/x86_64/woeusb-ng-0.2.5-3-any.pkg.tar.zst: invalid input: magic number mismatch.
Adding package to database: /srv/http/chaotic-aur/x86_64/tpmmanager-0.8.1-8-x86_64.pkg.tar.zst

But these packages are still being added to the repo...

As the catastrophic event seems to have ceased, I'm closing this issue.

@PedroHLC PedroHLC closed this as completed Sep 9, 2020
@cassava
Copy link
Owner

cassava commented Sep 9, 2020

Hey @PedroHLC, the errors you are seeing there are to be expected when repoctl reads tar.zst files that are still being written.

cannot find file ".PKGINFO" happens when the Zst decompression is successful far enough that the TAR reader can start processing the archive, but it doesn't find the .PKGINFO file that supposed to be in the TAR.

invalid input: magic number mismatch happens when the Zst decompression fails because not enough of the file has been written.

If repoctl encounters these files, it should just ignore them.

Further final thoughts from me:

  • Since this is hard to replicate, one way to reproduce this might be to truncate files at a certain number of bytes.
  • Optimally, files that are in the process of being written or copied should be given an extension that repoctl ignores.

Quick question: Do you run repo add and repo update in parallel?

@PedroHLC
Copy link
Author

PedroHLC commented Sep 9, 2020

Do you run repo add and repo update in parallel?

I observed it today, and repoctl is not running in parallel. My lock wrapper is working and probably has been the way the entire past year. I just don't avoid partially written files. But I'm considering using the same lock file for the copying operations...

@PedroHLC
Copy link
Author

Sadly it happened once more, with repoctl add (and not running parallel).

@cassava
Copy link
Owner

cassava commented Sep 28, 2020

Good to know that it can also happen by itself! Debugging data-race issues are really really hard, because a lot of behavior is just undefined, which can mean basically anything. But if it happens without any other instance running in parallel, then I might just have a chance to observe it myself.

If you ever manage to reproduce it reliably, that is of course the absolute best, but from the sound of it that doesn't happen.

Do you know if anything else was running at the same time, e.g. Pacman? I opted to not use libalm, the Pacman libraries, because it was always annoying to have to recompile a tool like cower every time I updated pacman. But that means that I had to come up with the database reading myself, which isn't as battle-tested as that from Pacman.

@PedroHLC
Copy link
Author

PedroHLC commented Sep 28, 2020

Sadly it took me 40hrs to notice the packages were gone 😅
Thankfully one of the mirrors isn't syncing the packages deletes and I've been using it as a backup.

@PedroHLC
Copy link
Author

PedroHLC commented Sep 28, 2020

I had one entry showing as mixxx_beta-git: updated( -> r6814-1) in repoctl status.

This package wasn't even appearing in my dump with tar -tv --zstd -f chaotic-aur.db.tar.zst | awk '/^d/{print $6}'. And it was built near the time things went crazy.

I've added it with repo-add and now it's in the database and doesn't show in repoctl status anymore...

Do you know if anything else was running at the same time, e.g. Pacman?

It shouldn't be running, for except inside some containers...

@AladW
Copy link

AladW commented Apr 9, 2022

Over the years I've also noticed (and had reports) of local repository database suddenly becoming empty. Never found out the cause either. In my case, aur-build does not seem to have data races either (all built packages are written to a random, private directory before being mv'd to the local repository, and repo-add has its own locking mechanism which I presume (?) to be functional).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants