Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support backing up Zarrs #134

Merged
merged 21 commits into from May 3, 2022
Merged

Support backing up Zarrs #134

merged 21 commits into from May 3, 2022

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Mar 10, 2022

Closes #127.

To do:

  • Integrate sync_zarr() into the rest of the backup code
  • Add more tests
  • Update the populate command to support populating Zarr datasets as well
  • Set --zarr-target in backups2datalad-cron
  • Test on sample zarrs in the main archive (could encounter cases of not yet complete checksums etc)

@jwodder jwodder marked this pull request as ready for review March 22, 2022 16:19
@jwodder
Copy link
Member Author

jwodder commented Mar 22, 2022

@yarikoptic This should be finished, but the old tests are failing because of dandi/dandi-archive#961, and one of the new tests is failing because of dandi/dandi-archive#965.

@yarikoptic yarikoptic self-requested a review March 28, 2022 13:50
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magnificent ;) takes me a while to read through ... not sure if I ever reach the bottom ;) Left some questions on where I had a pick at the code, and also :

  • did you test on some zarrs in the archive (the last checkbox in the original description)?

I think I would prefer to merge it and just "deploy" but would like to know first that it was tried. I guess offload to backup on dropbox -- that you didn't try, did you?

tools/backups2datalad/util.py Show resolved Hide resolved
tools/backups2datalad/util.py Outdated Show resolved Hide resolved
tools/backups2datalad/zarr.py Show resolved Hide resolved
name, isdir=False, checksum=listing.checksums[name]
)

async def stat(self, entry: RemoteZarrEntry) -> Tuple[str, ZarrEntryStat]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh -- we do it per each key? that might be expensive/slow, isn't it?
is there some API to query a number of keys at once or it would just be serialized into individual ones? (hard to believe, didn't check S3 API)

Copy link
Member Author

@jwodder jwodder Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There does not appear to be an API for querying multiple keys from S3. I'm not clear on what you mean by "it would just be serialized into individual ones".

Moreover, note that just getting the S3 key for a Zarr entry would still require an individual HEAD request to see where the /zarr/{zarr_id}.zarr/{path} endpoint redirects to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless it could/would be implemented quickly - I think it is ok to proceed here as is for now, but in the long run we will need more efficient API for that.
Please file an issue (or PR if you see a straightforward way to implement) with dandi-archive to expose necessary information in dandi-api without requiring per file in Zarr querying -- such querying would be of huge (and avoidable) impact on dandi-api! Another part of the motivation for that is that /zarr/{zarr_id}.zarr/{path} is there to provide a "sparse" access to zarr for zarr clients, so they would unlikely be traversing the entire zarr folder. In our case, we would.

Possibly related is dandi/dandi-archive#925 -- as a "mid point" solution could be that information is exposed in that zarr_content_read endpoint, but since it is not recursive, most likely it would be a good fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, note that just getting the S3 key for a Zarr entry would still require an individual HEAD request to see where the /zarr/{zarr_id}.zarr/{path} endpoint redirects to.

worse comes to worse (no information on redirect to S3 URL) we can hardcode the assumption of 1-to-1 mapping into S3 bucket. I believe zarr implementation doesn't care about versionId's ATM, does it? So we would need to get a versionId'ed via S3 request anyways :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted a comment on that issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on how you're envisioning this new implementation working?

For a zarr folder dataset establish a sync of that folder directly from S3, not going through API. And then just sweep through final state and registerurl's pointing to API for completeness.

FWIW, it could even use good old https://github.com/datalad/datalad-crawler/ extension to do that with simple_s3 pipeline (I can show how if we decide to go this route -- it is old code but seems generally work), or may be git annex supports it naively now? https://git-annex.branchable.com/todo/import_tree/ marked as done -- please check if you can import tree from S3 special remote, and either URLs would be versioned etc. If it works -- let's go native git-annex way. If not -- we can go crawler way. It is quite efficient to get only updates (uses timestamps on keys iirc to filter out unrelated changes), and also establish commits to intermediate states whenever needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this rely on the assumption that an entry foo/bar/baz in a Zarr is located at /zarr/{zarr_id}/foo/bar/baz on S3? That strikes me as an implementation detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it indeed is an implementation detail. But ATM we are not creating some generic tool which people would re-use for some other DANDI archive setups (which aren't even supported). Indeed it might make setup of testing trickier though.
I am ok either way, but I am afraid with current implementation approach updates might be tricky/expensive (require either fscaching of the entire tree zarr checksum or some other way to assess no changes; or may be recording the state/checksum locally with the commit message and then relying on git status to say that it is all clean.)

Just FTR here is how to establish and crawl a prefix in the bucket using datalad-crawler (needs to be pip install'ed):

datalad create /tmp/test-crawl
cd /tmp/test-crawl
datalad crawl-init --save --template=simple_s3 bucket=dandiarchive prefix=zarr/001e3b6d-26fb-463f-af28-520a25680ab4/2/ to_http=1
datalad crawl

cons (besides implementation detail) of crawler: it proceeds through all those files serially so might take awhile. We might want to get back into crawler and add parallization support. Should be quite doable IMHO.

I think the best would be to try how your current implementation works on some proper zarr in staging to assess how/if it would scale for backing up even larger zarrs, and in particular rerunning for possibly fetching updates. In parallel can compare to running a crawler on the same zarr and see how well that one copes with original fetch and then "update"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are considering other tools, why not just use s5cmd instead of datalad to crawl the prefix (it's significantly faster than for that kind of ls, even with etag retrieval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to register keys and URLs for those files in annex... actually ideally not even download the files in that step (forgot if we paid attention to that here), but just mint/register based on ETag in git-annex similarly to how we do it for regular assets, and then download/backup in a separate step. I don't think that "crawl"ing is the one which would be bottleneck. It would be the rest of interaction with git-annex + making update efficient (i.e. instrument interaction with S3 to only react on new stuff as datalad-crawler does, although I believe it does get a list of all keys/versions for that first... would need to check).
So -- not sure if use of s5cmd would mitigate anything here.

tools/backups2datalad/zarr.py Show resolved Hide resolved
tools/backups2datalad/zarr.py Show resolved Hide resolved
@jwodder
Copy link
Member Author

jwodder commented Apr 6, 2022

@yarikoptic No, I did not test on any Zarrs in the main archive. It's my understanding that all of the Zarrs on production are huge.

@jwodder
Copy link
Member Author

jwodder commented Apr 14, 2022

@yarikoptic I tried backing up https://gui-staging.dandiarchive.org/dandiset/101233 with this script. The most recent run took 138 minutes and failed at this point with:

install(error): /Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff (dataset) [Failed to clone from any candidate source URL. Encountered errors per each url were:
- downloads/zarr/94fdcde1-afa8-437f-8d7b-c75967295843
  CommandError: 'git -c diff.ignoreSubmodules=none clone --progress downloads/zarr/94fdcde1-afa8-437f-8d7b-c75967295843 /Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff' failed with exitcode 128 [err: 'Cloning into '/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff'...
fatal: failed to copy file to '/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff/.git/objects/01/d63330e82304979b8899d6522c121ef2ad63ae': No such file or directory']]

When I checked afterwards,.../downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr existed but was empty. Does DataLad require the target directory to already exist when calling clone()?

I tried running py-spy on the backup process, but it got killed at some point, likely due to running out of memory.

Also note that I had to make an update to the code in order to handle a change in git-annex 10.20220222, and the updated code does not work with older versions of git-annex. Do you need the script to work with older versions, or can you just update the version of git-annex used when the script is actually run?

@yarikoptic
Copy link
Member

Does DataLad require the target directory to already exist when calling clone()?

It must not. Exit code 128 iirc is segv iirc? Something freaked git out I guess.

We can use newer git annex if needed but would need to make sure that other commands are compatible.

Out of memory would sound suspicious, but may be there was git/git-annex processes explosion? We used to have some issues like that on osx in the past but I thought they were addressed

@jwodder
Copy link
Member Author

jwodder commented Apr 18, 2022

@yarikoptic Are you saying that clone() does or doesn't require the target to exist? I tried a potential fix in which the parent directory of the Zarr path was created in the Dandiset dataset before cloning the Zarr dataset; that failed with the same error as before, except this time the Zarr and its filetree looked successfully cloned.

Out of memory would sound suspicious, but may be there was git/git-annex processes explosion?

I don't think py-spy tracks subprocesses, and there should have been only 36 git-annex processes for most of the backup script's runtime anyway. I suspect that two hours of constant function calls just happened to produce too much data for py-spy to handle.

@yarikoptic
Copy link
Member

@yarikoptic Are you saying that clone() does or doesn't require the target to exist?

as with git clone it doesn't require the target to exist. It can permit for it to exist if it is an empty directory or a dataset which already has a desired clone (but error out if clone of something else)

here are quick examples to demonstrate above statement
lena:/tmp
$> ls -ld dandi; datalad clone ///dandi
ls: cannot access 'dandi': No such file or directory
install(ok): /tmp/dandi (dataset)                                                                     
(dev3) 1 17783.....................................:Mon 18 Apr 2022 11:16:22 AM EDT:.
lena:/tmp
$> ls -ld dandi; datalad clone ///dandi
drwx------ 5 yoh yoh 4096 Apr 18 11:16 dandi/
(dev3) 1 17783.....................................:Mon 18 Apr 2022 11:16:27 AM EDT:.
lena:/tmp
$> rm -rf dandi; mkdir dandi; ls -ld dandi; datalad clone ///dandi
drwx------ 2 yoh yoh 4096 Apr 18 11:16 dandi/
install(ok): /tmp/dandi (dataset)                                                                     
(dev3) 1 17784.....................................:Mon 18 Apr 2022 11:16:55 AM EDT:.
lena:/tmp
$> rm -rf dandi; mkdir dandi; ls -ld dandi; datalad clone ///dandi/dandisets dandi; datalad clone ///dandi
drwx------ 2 yoh yoh 4096 Apr 18 11:17 dandi/
[INFO   ] scanning for unlocked files (this may take some time)                                       
[INFO   ] access to 1 dataset sibling dandi-dandisets-dropbox not auto-enabled, enable with:          
| 		datalad siblings -d "/tmp/dandi" enable -s dandi-dandisets-dropbox 
install(ok): /tmp/dandi (dataset)
install(error): /tmp/dandi (dataset) [target path already exists and not empty, refuse to clone into target path]

but note that "directory" issue refers to a directory in .git/objects:

failed with exitcode 128 [err: 'Cloning into '/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff'...
fatal: failed to copy file to '/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff/.git/objects/01/d63330e82304979b8899d6522c121ef2ad63ae': No such file or directory']]

so it is not about datalad clone and target dir -- it is smth about git or may be filesystem running out of space or inodes, or alike

Out of memory would sound suspicious, but may be there was git/git-annex processes explosion?

I don't think py-spy tracks subprocesses, and there should have been only 36 git-annex processes for most of the backup script's runtime anyway. I suspect that two hours of constant function calls just happened to produce too much data for py-spy to handle.

I most often use py-spy top to attach to a running process. I barely ever run py-spy for a very long running process since produced flamegraph would be hard to analyze. py-spy record also takes --pid so should be possible to sample a running process for a bit. There is also --subprocesses but I don't remember if I used them. I had in mind just monitoring number of git-annex processes as you already did (using top or straight ps), so that is not the problem.

overall -- do you observe the same error while trying as dandi on drogon? (btrfs doesn't have inodes limit)

@yarikoptic
Copy link
Member

BTW!!! Given

$> python3 -c 'print(len("/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff/.git/objects/01/d63330e82304979b8899d6522c121ef2ad63ae"))'
231

we might be at the limit of what OSX has for max path length! Try just mkdir -p /Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff/.git/objects/01/d63330e82304979b8899d6522c121ef2ad63ae it , and also

$> datalad wtf -S system | grep max_path_le
  - max_path_length: 260

could give us an answer although that number is really a "ballpark" (run it under /Users/jwodder/dartmouth/dandisets/tools/downloads/ or near)

But also it points to the limit we might hit on drogon as well with all such long names/nesting, hm...

@jwodder
Copy link
Member Author

jwodder commented Apr 18, 2022

@yarikoptic The mkdir command succeeds without error. datalad wtf -S system | grep max_path_le gives 306.

@yarikoptic
Copy link
Member

ok, I would have run with falling into pdb at that point of failure, try to do clone manually from CLI exactly as datalad tries and go from there

@jwodder
Copy link
Member Author

jwodder commented Apr 18, 2022

@yarikoptic Doing datalad --dbg clone ../../../../../zarr/8a0b96c3-7b9f-4528-ae25-87e3e8641535/ sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-2_SPIM.ngff (which I'm pretty sure is the equivalent of the clone that failed on the most recent run) succeeded.

@yarikoptic
Copy link
Member

well, it is not equivalent, so I guess the devil might be in the detail. Could you just try do exactly that git clone at the point of failure at the location of the process at moment of failure? note that it is also likely to be the git bundled with git-annex installation (if it is bundled on OSX, can't check ATM since our non-M1 OSX datalad laptop needs manual tlc after recent power outage)

@jwodder
Copy link
Member Author

jwodder commented Apr 18, 2022

@yarikoptic I ran the exact same backup script on drogon, backing up to a folder in /mnt/backup, and that succeeded (runtime: 112 minutes).

@yarikoptic
Copy link
Member

ok, so it must be something about OSX and/or git on that system, and you can successfully benchmark etc on drogon. And 112 minutes for a single zarr or how many zarrs in dandiset? how long takes to update right after (again 112 minutes?)?

CI tests are red since seems still needing more recent git-annex you noticed: Invalid option --json'` so please update CI setup.

@jwodder
Copy link
Member Author

jwodder commented Apr 18, 2022

@yarikoptic

  • It was 112 minutes for a single Dandiset containing nine Zarrs, each containing 37481 files
  • Running the script again took 65 minutes.
  • The CI installs git-annex from neurodebian, which is apparently out of date

@yarikoptic
Copy link
Member

  • The CI installs git-annex from neurodebian, which is apparently out of date

with some incompatibility changes which were mitigated only in datalad 0.16.x I w a bit reluctant to update version in neurodebian... ok -- updated now to 10.20220322-1~ndall+1.

  • Running the script again took 65 minutes.

eh, which suggests that such workflow would hardly be usable for any frequent updates in archive wide setting. Where on drogon it is ? I will see how fast it is to crawl/recrawl using datalad-crawler straight from S3 (even without any optimization but with fetching data ATM).

@jwodder
Copy link
Member Author

jwodder commented Apr 18, 2022

@yarikoptic The backup is in /mnt/backup/dandi/tmp/dandisets-zarr.

@jwodder
Copy link
Member Author

jwodder commented May 3, 2022

@yarikoptic Ping. Updates?

@yarikoptic
Copy link
Member

@yarikoptic This should be finished, but the old tests are failing because of dandi/dandi-archive#961, and one of the new tests is failing because of dandi/dandi-archive#965.

I have merged dandi/dandi-archive#1074 boosting dandischema to 0.7.1 on server. In principle that should address the issue, but we would need also release of dandi-cli so this code uses also "congruent" version?

@@ -1,16 +1,17 @@
# Python ~= 3.8
# git-annex >= 8.20210903
# git-annex >= 10.20220222
Copy link
Member

@yarikoptic yarikoptic May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upgrading in dandisets env: git-annex 8.20211028-alldep_h27987b5_100 --> 10.20220322-alldep_hc98582e_100

@yarikoptic
Copy link
Member

Thank you @jwodder. ok, let's proceed with this since I am not sure if there is an alternative. I have upgraded git-annex, not spotted any major show stopper in the diff, and you confirmed that it worked on sample dandisets, so please proceed with merge/deploy-ing it on drogon and monitoring for an initial run. May be keep cron job skipping 000108, and run a dedicated one in parallel for 000108 so we do not need to wait for it to complete while keeping the rest still nicely backed up. After 000108 is backed up, we could bring it into the cron job "pool" of backups.

@jwodder jwodder merged commit 23a9192 into draft May 3, 2022
@jwodder jwodder deleted the gh-127 branch May 3, 2022 18:16
@jwodder
Copy link
Member Author

jwodder commented May 3, 2022

@yarikoptic After merging, running the script on everything other than 000108 took about an hour. I then ran the script on just 000108 twice, and both times it crashed after about seven minutes due to a connection timeout.

Also, note that I've currently disabled the cronjob.

@yarikoptic
Copy link
Member

  • connection timeout to what? something to fix up/retry for?
  • can't we have cron job running with exclude 000108 and meanwhile mess around/troubleshoot/fix for the zarr backups of 000108?

@jwodder
Copy link
Member Author

jwodder commented May 4, 2022

@yarikoptic

connection timeout to what?

Either dandiarchive or the S3 bucket, I couldn't tell which.

something to fix up/retry for?

All async HTTP requests are already automatically retried on errors.

can't we have cron job running with exclude 000108 and meanwhile mess around/troubleshoot/fix for the zarr backups of 000108?

I'm concerned about what would happen if the cronjob tried to save changes to the superdataset while the 000108 dataset was in the middle of an update.

@yarikoptic
Copy link
Member

@yarikoptic

connection timeout to what?

Either dandiarchive or the S3 bucket, I couldn't tell which.

something to fix up/retry for?

All async HTTP requests are already automatically retried on errors.

without seeing the timeout traceback I can't tell either retries happened for connection timeout (thus something must have been really down) or we need to expand the range what we retry for? If issue persists -- please file an issue with details if you need some feedback/ideas, or just address it if it is smth obvious.

can't we have cron job running with exclude 000108 and meanwhile mess around/troubleshoot/fix for the zarr backups of 000108?

I'm concerned about what would happen if the cronjob tried to save changes to the superdataset while the 000108 dataset was in the middle of an update.

I think you already would invoke only for the dandisets which were processed in that call, thus excluding 000108: https://github.com/dandi/dandisets/blob/HEAD/tools/backups2datalad/datasetter.py#L89 . The only conflict could happen if two processes try to save super dataset at the same time, but that is unlikely so I would not worry about that .

@jwodder
Copy link
Member Author

jwodder commented May 5, 2022

@yarikoptic Based on the logs, numerous HEAD requests to /zarr/{zarr_id}.zarr/{path} (starting with https://api.dandiarchive.org/api/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695.zarr/1/.zarray) resulted in a connection timeout.

I've re-enabled the cronjob.

@satra
Copy link
Member

satra commented May 5, 2022

just a thought, for backup, would it be better to just talk directly to s3 like s5cmd does? this would also significantly reduce api load for zarr files.

@jwodder
Copy link
Member Author

jwodder commented May 5, 2022

I tried backing up 000108 again, and this time the script ran for ten minutes before dying from a connection timeout error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide support for zarr (.zarr/.ngff)
3 participants