Support backing up Zarrs #134

jwodder · 2022-03-10T15:41:31Z

Closes #127.

To do:

Integrate sync_zarr() into the rest of the backup code
Add more tests
Update the populate command to support populating Zarr datasets as well
Set --zarr-target in backups2datalad-cron
Test on sample zarrs in the main archive (could encounter cases of not yet complete checksums etc)

jwodder · 2022-03-22T16:20:34Z

@yarikoptic This should be finished, but the old tests are failing because of dandi/dandi-archive#961, and one of the new tests is failing because of dandi/dandi-archive#965.

yarikoptic

magnificent ;) takes me a while to read through ... not sure if I ever reach the bottom ;) Left some questions on where I had a pick at the code, and also :

did you test on some zarrs in the archive (the last checkbox in the original description)?

I think I would prefer to merge it and just "deploy" but would like to know first that it was tried. I guess offload to backup on dropbox -- that you didn't try, did you?

tools/backups2datalad/util.py

tools/backups2datalad/zarr.py

yarikoptic · 2022-04-05T21:31:40Z

tools/backups2datalad/zarr.py

+                        name, isdir=False, checksum=listing.checksums[name]
+                    )
+
+    async def stat(self, entry: RemoteZarrEntry) -> Tuple[str, ZarrEntryStat]:


oh -- we do it per each key? that might be expensive/slow, isn't it?
is there some API to query a number of keys at once or it would just be serialized into individual ones? (hard to believe, didn't check S3 API)

There does not appear to be an API for querying multiple keys from S3. I'm not clear on what you mean by "it would just be serialized into individual ones".

Moreover, note that just getting the S3 key for a Zarr entry would still require an individual HEAD request to see where the /zarr/{zarr_id}.zarr/{path} endpoint redirects to.

Unless it could/would be implemented quickly - I think it is ok to proceed here as is for now, but in the long run we will need more efficient API for that.
Please file an issue (or PR if you see a straightforward way to implement) with dandi-archive to expose necessary information in dandi-api without requiring per file in Zarr querying -- such querying would be of huge (and avoidable) impact on dandi-api! Another part of the motivation for that is that /zarr/{zarr_id}.zarr/{path} is there to provide a "sparse" access to zarr for zarr clients, so they would unlikely be traversing the entire zarr folder. In our case, we would.

Possibly related is dandi/dandi-archive#925 -- as a "mid point" solution could be that information is exposed in that zarr_content_read endpoint, but since it is not recursive, most likely it would be a good fit.

Moreover, note that just getting the S3 key for a Zarr entry would still require an individual HEAD request to see where the /zarr/{zarr_id}.zarr/{path} endpoint redirects to.

worse comes to worse (no information on redirect to S3 URL) we can hardcode the assumption of 1-to-1 mapping into S3 bucket. I believe zarr implementation doesn't care about versionId's ATM, does it? So we would need to get a versionId'ed via S3 request anyways :-/

I posted a comment on that issue.

Could you elaborate on how you're envisioning this new implementation working?

For a zarr folder dataset establish a sync of that folder directly from S3, not going through API. And then just sweep through final state and registerurl's pointing to API for completeness.

FWIW, it could even use good old https://github.com/datalad/datalad-crawler/ extension to do that with simple_s3 pipeline (I can show how if we decide to go this route -- it is old code but seems generally work), or may be git annex supports it naively now? https://git-annex.branchable.com/todo/import_tree/ marked as done -- please check if you can import tree from S3 special remote, and either URLs would be versioned etc. If it works -- let's go native git-annex way. If not -- we can go crawler way. It is quite efficient to get only updates (uses timestamps on keys iirc to filter out unrelated changes), and also establish commits to intermediate states whenever needed.

Wouldn't this rely on the assumption that an entry foo/bar/baz in a Zarr is located at /zarr/{zarr_id}/foo/bar/baz on S3? That strikes me as an implementation detail.

it indeed is an implementation detail. But ATM we are not creating some generic tool which people would re-use for some other DANDI archive setups (which aren't even supported). Indeed it might make setup of testing trickier though.
I am ok either way, but I am afraid with current implementation approach updates might be tricky/expensive (require either fscaching of the entire tree zarr checksum or some other way to assess no changes; or may be recording the state/checksum locally with the commit message and then relying on git status to say that it is all clean.)

Just FTR here is how to establish and crawl a prefix in the bucket using datalad-crawler (needs to be pip install'ed):

datalad create /tmp/test-crawl cd /tmp/test-crawl datalad crawl-init --save --template=simple_s3 bucket=dandiarchive prefix=zarr/001e3b6d-26fb-463f-af28-520a25680ab4/2/ to_http=1 datalad crawl

cons (besides implementation detail) of crawler: it proceeds through all those files serially so might take awhile. We might want to get back into crawler and add parallization support. Should be quite doable IMHO.

I think the best would be to try how your current implementation works on some proper zarr in staging to assess how/if it would scale for backing up even larger zarrs, and in particular rerunning for possibly fetching updates. In parallel can compare to running a crawler on the same zarr and see how well that one copes with original fetch and then "update"

if we are considering other tools, why not just use s5cmd instead of datalad to crawl the prefix (it's significantly faster than for that kind of ls, even with etag retrieval.

we need to register keys and URLs for those files in annex... actually ideally not even download the files in that step (forgot if we paid attention to that here), but just mint/register based on ETag in git-annex similarly to how we do it for regular assets, and then download/backup in a separate step. I don't think that "crawl"ing is the one which would be bottleneck. It would be the rest of interaction with git-annex + making update efficient (i.e. instrument interaction with S3 to only react on new stuff as datalad-crawler does, although I believe it does get a list of all keys/versions for that first... would need to check).
So -- not sure if use of s5cmd would mitigate anything here.

tools/backups2datalad/zarr.py

jwodder · 2022-04-06T15:58:27Z

@yarikoptic No, I did not test on any Zarrs in the main archive. It's my understanding that all of the Zarrs on production are huge.

jwodder · 2022-04-14T18:10:29Z

@yarikoptic I tried backing up https://gui-staging.dandiarchive.org/dandiset/101233 with this script. The most recent run took 138 minutes and failed at this point with:

install(error): /Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff (dataset) [Failed to clone from any candidate source URL. Encountered errors per each url were:
- downloads/zarr/94fdcde1-afa8-437f-8d7b-c75967295843
  CommandError: 'git -c diff.ignoreSubmodules=none clone --progress downloads/zarr/94fdcde1-afa8-437f-8d7b-c75967295843 /Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff' failed with exitcode 128 [err: 'Cloning into '/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff'...
fatal: failed to copy file to '/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff/.git/objects/01/d63330e82304979b8899d6522c121ef2ad63ae': No such file or directory']]

When I checked afterwards,.../downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr existed but was empty. Does DataLad require the target directory to already exist when calling clone()?

I tried running py-spy on the backup process, but it got killed at some point, likely due to running out of memory.

Also note that I had to make an update to the code in order to handle a change in git-annex 10.20220222, and the updated code does not work with older versions of git-annex. Do you need the script to work with older versions, or can you just update the version of git-annex used when the script is actually run?

yarikoptic · 2022-04-15T17:53:13Z

Does DataLad require the target directory to already exist when calling clone()?

It must not. Exit code 128 iirc is segv iirc? Something freaked git out I guess.

We can use newer git annex if needed but would need to make sure that other commands are compatible.

Out of memory would sound suspicious, but may be there was git/git-annex processes explosion? We used to have some issues like that on osx in the past but I thought they were addressed

jwodder · 2022-04-18T14:57:13Z

@yarikoptic Are you saying that clone() does or doesn't require the target to exist? I tried a potential fix in which the parent directory of the Zarr path was created in the Dandiset dataset before cloning the Zarr dataset; that failed with the same error as before, except this time the Zarr and its filetree looked successfully cloned.

Out of memory would sound suspicious, but may be there was git/git-annex processes explosion?

I don't think py-spy tracks subprocesses, and there should have been only 36 git-annex processes for most of the backup script's runtime anyway. I suspect that two hours of constant function calls just happened to produce too much data for py-spy to handle.

yarikoptic · 2022-04-18T15:24:05Z

@yarikoptic Are you saying that clone() does or doesn't require the target to exist?

as with git clone it doesn't require the target to exist. It can permit for it to exist if it is an empty directory or a dataset which already has a desired clone (but error out if clone of something else)

here are quick examples to demonstrate above statement

lena:/tmp
$> ls -ld dandi; datalad clone ///dandi
ls: cannot access 'dandi': No such file or directory
install(ok): /tmp/dandi (dataset)                                                                     
(dev3) 1 17783.....................................:Mon 18 Apr 2022 11:16:22 AM EDT:.
lena:/tmp
$> ls -ld dandi; datalad clone ///dandi
drwx------ 5 yoh yoh 4096 Apr 18 11:16 dandi/
(dev3) 1 17783.....................................:Mon 18 Apr 2022 11:16:27 AM EDT:.
lena:/tmp
$> rm -rf dandi; mkdir dandi; ls -ld dandi; datalad clone ///dandi
drwx------ 2 yoh yoh 4096 Apr 18 11:16 dandi/
install(ok): /tmp/dandi (dataset)                                                                     
(dev3) 1 17784.....................................:Mon 18 Apr 2022 11:16:55 AM EDT:.
lena:/tmp
$> rm -rf dandi; mkdir dandi; ls -ld dandi; datalad clone ///dandi/dandisets dandi; datalad clone ///dandi
drwx------ 2 yoh yoh 4096 Apr 18 11:17 dandi/
[INFO   ] scanning for unlocked files (this may take some time)                                       
[INFO   ] access to 1 dataset sibling dandi-dandisets-dropbox not auto-enabled, enable with:          
| 		datalad siblings -d "/tmp/dandi" enable -s dandi-dandisets-dropbox 
install(ok): /tmp/dandi (dataset)
install(error): /tmp/dandi (dataset) [target path already exists and not empty, refuse to clone into target path]

but note that "directory" issue refers to a directory in .git/objects:

failed with exitcode 128 [err: 'Cloning into '/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff'...
fatal: failed to copy file to '/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff/.git/objects/01/d63330e82304979b8899d6522c121ef2ad63ae': No such file or directory']]

so it is not about datalad clone and target dir -- it is smth about git or may be filesystem running out of space or inodes, or alike

Out of memory would sound suspicious, but may be there was git/git-annex processes explosion?

I don't think py-spy tracks subprocesses, and there should have been only 36 git-annex processes for most of the backup script's runtime anyway. I suspect that two hours of constant function calls just happened to produce too much data for py-spy to handle.

I most often use py-spy top to attach to a running process. I barely ever run py-spy for a very long running process since produced flamegraph would be hard to analyze. py-spy record also takes --pid so should be possible to sample a running process for a bit. There is also --subprocesses but I don't remember if I used them. I had in mind just monitoring number of git-annex processes as you already did (using top or straight ps), so that is not the problem.

overall -- do you observe the same error while trying as dandi on drogon? (btrfs doesn't have inodes limit)

yarikoptic · 2022-04-18T15:50:48Z

BTW!!! Given

$> python3 -c 'print(len("/Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff/.git/objects/01/d63330e82304979b8899d6522c121ef2ad63ae"))'
231

we might be at the limit of what OSX has for max path length! Try just mkdir -p /Users/jwodder/dartmouth/dandisets/tools/downloads/ds/101233/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff/.git/objects/01/d63330e82304979b8899d6522c121ef2ad63ae it , and also

$> datalad wtf -S system | grep max_path_le
  - max_path_length: 260

could give us an answer although that number is really a "ballpark" (run it under /Users/jwodder/dartmouth/dandisets/tools/downloads/ or near)

But also it points to the limit we might hit on drogon as well with all such long names/nesting, hm...

jwodder · 2022-04-18T15:54:14Z

@yarikoptic The mkdir command succeeds without error. datalad wtf -S system | grep max_path_le gives 306.

yarikoptic · 2022-04-18T15:55:33Z

ok, I would have run with falling into pdb at that point of failure, try to do clone manually from CLI exactly as datalad tries and go from there

jwodder · 2022-04-18T16:00:15Z

@yarikoptic Doing datalad --dbg clone ../../../../../zarr/8a0b96c3-7b9f-4528-ae25-87e3e8641535/ sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-2_SPIM.ngff (which I'm pretty sure is the equivalent of the clone that failed on the most recent run) succeeded.

yarikoptic · 2022-04-18T17:40:05Z

well, it is not equivalent, so I guess the devil might be in the detail. Could you just try do exactly that git clone at the point of failure at the location of the process at moment of failure? note that it is also likely to be the git bundled with git-annex installation (if it is bundled on OSX, can't check ATM since our non-M1 OSX datalad laptop needs manual tlc after recent power outage)

jwodder · 2022-04-18T18:54:52Z

@yarikoptic I ran the exact same backup script on drogon, backing up to a folder in /mnt/backup, and that succeeded (runtime: 112 minutes).

yarikoptic · 2022-04-18T19:33:22Z

ok, so it must be something about OSX and/or git on that system, and you can successfully benchmark etc on drogon. And 112 minutes for a single zarr or how many zarrs in dandiset? how long takes to update right after (again 112 minutes?)?

CI tests are red since seems still needing more recent git-annex you noticed: Invalid option --json'` so please update CI setup.

jwodder · 2022-04-18T20:43:40Z

@yarikoptic

It was 112 minutes for a single Dandiset containing nine Zarrs, each containing 37481 files
Running the script again took 65 minutes.
The CI installs git-annex from neurodebian, which is apparently out of date

yarikoptic · 2022-04-18T22:51:16Z

The CI installs git-annex from neurodebian, which is apparently out of date

with some incompatibility changes which were mitigated only in datalad 0.16.x I w a bit reluctant to update version in neurodebian... ok -- updated now to 10.20220322-1~ndall+1.

Running the script again took 65 minutes.

eh, which suggests that such workflow would hardly be usable for any frequent updates in archive wide setting. Where on drogon it is ? I will see how fast it is to crawl/recrawl using datalad-crawler straight from S3 (even without any optimization but with fetching data ATM).

jwodder · 2022-04-18T23:14:32Z

@yarikoptic The backup is in /mnt/backup/dandi/tmp/dandisets-zarr.

jwodder · 2022-05-03T12:55:25Z

@yarikoptic Ping. Updates?

yarikoptic · 2022-05-03T14:24:05Z

@yarikoptic This should be finished, but the old tests are failing because of dandi/dandi-archive#961, and one of the new tests is failing because of dandi/dandi-archive#965.

I have merged dandi/dandi-archive#1074 boosting dandischema to 0.7.1 on server. In principle that should address the issue, but we would need also release of dandi-cli so this code uses also "congruent" version?

yarikoptic · 2022-05-03T16:49:36Z

tools/backups2datalad.req.txt

@@ -1,16 +1,17 @@
 # Python ~= 3.8
-# git-annex >= 8.20210903
+# git-annex >= 10.20220222


upgrading in dandisets env: git-annex 8.20211028-alldep_h27987b5_100 --> 10.20220322-alldep_hc98582e_100

yarikoptic · 2022-05-03T16:56:04Z

Thank you @jwodder. ok, let's proceed with this since I am not sure if there is an alternative. I have upgraded git-annex, not spotted any major show stopper in the diff, and you confirmed that it worked on sample dandisets, so please proceed with merge/deploy-ing it on drogon and monitoring for an initial run. May be keep cron job skipping 000108, and run a dedicated one in parallel for 000108 so we do not need to wait for it to complete while keeping the rest still nicely backed up. After 000108 is backed up, we could bring it into the cron job "pool" of backups.

jwodder · 2022-05-03T20:35:35Z

@yarikoptic After merging, running the script on everything other than 000108 took about an hour. I then ran the script on just 000108 twice, and both times it crashed after about seven minutes due to a connection timeout.

Also, note that I've currently disabled the cronjob.

yarikoptic · 2022-05-04T00:17:10Z

connection timeout to what? something to fix up/retry for?
can't we have cron job running with exclude 000108 and meanwhile mess around/troubleshoot/fix for the zarr backups of 000108?

jwodder · 2022-05-04T13:02:57Z

@yarikoptic

connection timeout to what?

Either dandiarchive or the S3 bucket, I couldn't tell which.

something to fix up/retry for?

All async HTTP requests are already automatically retried on errors.

can't we have cron job running with exclude 000108 and meanwhile mess around/troubleshoot/fix for the zarr backups of 000108?

I'm concerned about what would happen if the cronjob tried to save changes to the superdataset while the 000108 dataset was in the middle of an update.

yarikoptic · 2022-05-04T14:29:28Z

@yarikoptic

connection timeout to what?

Either dandiarchive or the S3 bucket, I couldn't tell which.

something to fix up/retry for?

All async HTTP requests are already automatically retried on errors.

without seeing the timeout traceback I can't tell either retries happened for connection timeout (thus something must have been really down) or we need to expand the range what we retry for? If issue persists -- please file an issue with details if you need some feedback/ideas, or just address it if it is smth obvious.

can't we have cron job running with exclude 000108 and meanwhile mess around/troubleshoot/fix for the zarr backups of 000108?

I'm concerned about what would happen if the cronjob tried to save changes to the superdataset while the 000108 dataset was in the middle of an update.

I think you already would invoke only for the dandisets which were processed in that call, thus excluding 000108: https://github.com/dandi/dandisets/blob/HEAD/tools/backups2datalad/datasetter.py#L89 . The only conflict could happen if two processes try to save super dataset at the same time, but that is unlikely so I would not worry about that .

jwodder · 2022-05-05T13:02:54Z

@yarikoptic Based on the logs, numerous HEAD requests to /zarr/{zarr_id}.zarr/{path} (starting with https://api.dandiarchive.org/api/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695.zarr/1/.zarray) resulted in a connection timeout.

I've re-enabled the cronjob.

satra · 2022-05-05T13:05:40Z

just a thought, for backup, would it be better to just talk directly to s3 like s5cmd does? this would also significantly reduce api load for zarr files.

jwodder · 2022-05-05T13:18:32Z

I tried backing up 000108 again, and this time the script ran for ten minutes before dying from a connection timeout error.

jwodder force-pushed the gh-127 branch 2 times, most recently from 4c4bb6c to bdf460d Compare March 11, 2022 20:08

yarikoptic mentioned this pull request Mar 16, 2022

save would make git crash replacing a file with directory with content datalad/datalad#6558

Closed

jwodder force-pushed the gh-127 branch from 8a444b1 to 8576743 Compare March 17, 2022 16:23

jwodder added 12 commits March 22, 2022 10:13

Move tests to their own folder

19509b3

Update for trio 0.20

a79b78a

Update httpx

041627e

Core Zarr syncing

f150956

Integrate sync_zarr into rest of backup code

0b86974

Initial test

9a5ea42

Set git user details in GitHub Actions

b2270db

Extend a test, and some fixes

e16a859

Another test

f868870

Disable test that hits a datalad bug

49a0500

Test backing up after deleting a Zarr

d24e5c3

Add a test of some pathological cases

46e636d

jwodder force-pushed the gh-127 branch from 8576743 to 46e636d Compare March 22, 2022 14:13

Populating Zarr datasets

c7a3563

jwodder marked this pull request as ready for review March 22, 2022 16:19

yarikoptic self-requested a review March 28, 2022 13:50

yarikoptic reviewed Apr 5, 2022

View reviewed changes

jwodder added 2 commits April 6, 2022 11:43

Autoupdate .pre-commit-config.yaml

0fc3235

Unset datalad.repo.backend after creating a dataset

0d4253a

jwodder added 4 commits April 6, 2022 12:17

Add comments

d6e21c0

Address a nondeterministic problem

496cbb2

Fix

b87d948

Update registerurl handling for git-annex 10.20220222

a348e0f

Push some potentially long-running tasks into threads

b77b8a6

Update required git-annex version

731dca1

jwodder mentioned this pull request May 3, 2022

populate command needs unit-testing #139

Closed

yarikoptic reviewed May 3, 2022

View reviewed changes

jwodder merged commit 23a9192 into draft May 3, 2022

jwodder deleted the gh-127 branch May 3, 2022 18:16

jwodder mentioned this pull request May 17, 2022

Fetch Zarr file listings directly from S3 #158

Merged

Support backing up Zarrs #134

Support backing up Zarrs #134

Conversation

jwodder commented Mar 10, 2022 • edited by yarikoptic

jwodder commented Mar 22, 2022

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwodder Apr 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwodder commented Apr 6, 2022

jwodder commented Apr 14, 2022 • edited

yarikoptic commented Apr 15, 2022

jwodder commented Apr 18, 2022

yarikoptic commented Apr 18, 2022

yarikoptic commented Apr 18, 2022

jwodder commented Apr 18, 2022

yarikoptic commented Apr 18, 2022

jwodder commented Apr 18, 2022

yarikoptic commented Apr 18, 2022

jwodder commented Apr 18, 2022

yarikoptic commented Apr 18, 2022

jwodder commented Apr 18, 2022

yarikoptic commented Apr 18, 2022

jwodder commented Apr 18, 2022

jwodder commented May 3, 2022

yarikoptic commented May 3, 2022

yarikoptic May 3, 2022 • edited

Choose a reason for hiding this comment

yarikoptic commented May 3, 2022

jwodder commented May 3, 2022 • edited

yarikoptic commented May 4, 2022

jwodder commented May 4, 2022

yarikoptic commented May 4, 2022

jwodder commented May 5, 2022 • edited

satra commented May 5, 2022

jwodder commented May 5, 2022

jwodder commented Mar 10, 2022 •

edited by yarikoptic

jwodder Apr 6, 2022 •

edited

jwodder commented Apr 14, 2022 •

edited

yarikoptic May 3, 2022 •

edited

jwodder commented May 3, 2022 •

edited

jwodder commented May 5, 2022 •

edited