New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support backing up Zarrs #134
Conversation
4c4bb6c
to
bdf460d
Compare
@yarikoptic This should be finished, but the old tests are failing because of dandi/dandi-archive#961, and one of the new tests is failing because of dandi/dandi-archive#965. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magnificent ;) takes me a while to read through ... not sure if I ever reach the bottom ;) Left some questions on where I had a pick at the code, and also :
- did you test on some zarrs in the archive (the last checkbox in the original description)?
I think I would prefer to merge it and just "deploy" but would like to know first that it was tried. I guess offload to backup on dropbox -- that you didn't try, did you?
name, isdir=False, checksum=listing.checksums[name] | ||
) | ||
|
||
async def stat(self, entry: RemoteZarrEntry) -> Tuple[str, ZarrEntryStat]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh -- we do it per each key? that might be expensive/slow, isn't it?
is there some API to query a number of keys at once or it would just be serialized into individual ones? (hard to believe, didn't check S3 API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There does not appear to be an API for querying multiple keys from S3. I'm not clear on what you mean by "it would just be serialized into individual ones".
Moreover, note that just getting the S3 key for a Zarr entry would still require an individual HEAD
request to see where the /zarr/{zarr_id}.zarr/{path}
endpoint redirects to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless it could/would be implemented quickly - I think it is ok to proceed here as is for now, but in the long run we will need more efficient API for that.
Please file an issue (or PR if you see a straightforward way to implement) with dandi-archive to expose necessary information in dandi-api without requiring per file in Zarr querying -- such querying would be of huge (and avoidable) impact on dandi-api! Another part of the motivation for that is that /zarr/{zarr_id}.zarr/{path}
is there to provide a "sparse" access to zarr for zarr clients, so they would unlikely be traversing the entire zarr folder. In our case, we would.
Possibly related is dandi/dandi-archive#925 -- as a "mid point" solution could be that information is exposed in that zarr_content_read
endpoint, but since it is not recursive, most likely it would be a good fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, note that just getting the S3 key for a Zarr entry would still require an individual
HEAD
request to see where the/zarr/{zarr_id}.zarr/{path}
endpoint redirects to.
worse comes to worse (no information on redirect to S3 URL) we can hardcode the assumption of 1-to-1 mapping into S3 bucket. I believe zarr implementation doesn't care about versionId's ATM, does it? So we would need to get a versionId'ed via S3 request anyways :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I posted a comment on that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on how you're envisioning this new implementation working?
For a zarr folder dataset establish a sync of that folder directly from S3, not going through API. And then just sweep through final state and registerurl's pointing to API for completeness.
FWIW, it could even use good old https://github.com/datalad/datalad-crawler/ extension to do that with simple_s3
pipeline (I can show how if we decide to go this route -- it is old code but seems generally work), or may be git annex supports it naively now? https://git-annex.branchable.com/todo/import_tree/ marked as done
-- please check if you can import tree from S3 special remote, and either URLs would be versioned etc. If it works -- let's go native git-annex way. If not -- we can go crawler way. It is quite efficient to get only updates (uses timestamps on keys iirc to filter out unrelated changes), and also establish commits to intermediate states whenever needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this rely on the assumption that an entry foo/bar/baz
in a Zarr is located at /zarr/{zarr_id}/foo/bar/baz
on S3? That strikes me as an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it indeed is an implementation detail. But ATM we are not creating some generic tool which people would re-use for some other DANDI archive setups (which aren't even supported). Indeed it might make setup of testing trickier though.
I am ok either way, but I am afraid with current implementation approach updates might be tricky/expensive (require either fscaching of the entire tree zarr checksum or some other way to assess no changes; or may be recording the state/checksum locally with the commit message and then relying on git status
to say that it is all clean.)
Just FTR here is how to establish and crawl a prefix in the bucket using datalad-crawler (needs to be pip install
'ed):
datalad create /tmp/test-crawl
cd /tmp/test-crawl
datalad crawl-init --save --template=simple_s3 bucket=dandiarchive prefix=zarr/001e3b6d-26fb-463f-af28-520a25680ab4/2/ to_http=1
datalad crawl
cons (besides implementation detail) of crawler: it proceeds through all those files serially so might take awhile. We might want to get back into crawler and add parallization support. Should be quite doable IMHO.
I think the best would be to try how your current implementation works on some proper zarr in staging to assess how/if it would scale for backing up even larger zarrs, and in particular rerunning for possibly fetching updates. In parallel can compare to running a crawler on the same zarr and see how well that one copes with original fetch and then "update"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are considering other tools, why not just use s5cmd
instead of datalad to crawl the prefix (it's significantly faster than for that kind of ls, even with etag retrieval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to register keys and URLs for those files in annex... actually ideally not even download the files in that step (forgot if we paid attention to that here), but just mint/register based on ETag in git-annex similarly to how we do it for regular assets, and then download/backup in a separate step. I don't think that "crawl"ing is the one which would be bottleneck. It would be the rest of interaction with git-annex + making update efficient (i.e. instrument interaction with S3 to only react on new stuff as datalad-crawler does, although I believe it does get a list of all keys/versions for that first... would need to check).
So -- not sure if use of s5cmd would mitigate anything here.
@yarikoptic No, I did not test on any Zarrs in the main archive. It's my understanding that all of the Zarrs on production are huge. |
@yarikoptic I tried backing up https://gui-staging.dandiarchive.org/dandiset/101233 with this script. The most recent run took 138 minutes and failed at this point with:
When I checked afterwards, I tried running py-spy on the backup process, but it got killed at some point, likely due to running out of memory. Also note that I had to make an update to the code in order to handle a change in git-annex 10.20220222, and the updated code does not work with older versions of git-annex. Do you need the script to work with older versions, or can you just update the version of git-annex used when the script is actually run? |
It must not. Exit code 128 iirc is segv iirc? Something freaked git out I guess. We can use newer git annex if needed but would need to make sure that other commands are compatible. Out of memory would sound suspicious, but may be there was git/git-annex processes explosion? We used to have some issues like that on osx in the past but I thought they were addressed |
@yarikoptic Are you saying that
I don't think py-spy tracks subprocesses, and there should have been only 36 git-annex processes for most of the backup script's runtime anyway. I suspect that two hours of constant function calls just happened to produce too much data for py-spy to handle. |
as with here are quick examples to demonstrate above statementlena:/tmp
$> ls -ld dandi; datalad clone ///dandi
ls: cannot access 'dandi': No such file or directory
install(ok): /tmp/dandi (dataset)
(dev3) 1 17783.....................................:Mon 18 Apr 2022 11:16:22 AM EDT:.
lena:/tmp
$> ls -ld dandi; datalad clone ///dandi
drwx------ 5 yoh yoh 4096 Apr 18 11:16 dandi/
(dev3) 1 17783.....................................:Mon 18 Apr 2022 11:16:27 AM EDT:.
lena:/tmp
$> rm -rf dandi; mkdir dandi; ls -ld dandi; datalad clone ///dandi
drwx------ 2 yoh yoh 4096 Apr 18 11:16 dandi/
install(ok): /tmp/dandi (dataset)
(dev3) 1 17784.....................................:Mon 18 Apr 2022 11:16:55 AM EDT:.
lena:/tmp
$> rm -rf dandi; mkdir dandi; ls -ld dandi; datalad clone ///dandi/dandisets dandi; datalad clone ///dandi
drwx------ 2 yoh yoh 4096 Apr 18 11:17 dandi/
[INFO ] scanning for unlocked files (this may take some time)
[INFO ] access to 1 dataset sibling dandi-dandisets-dropbox not auto-enabled, enable with:
| datalad siblings -d "/tmp/dandi" enable -s dandi-dandisets-dropbox
install(ok): /tmp/dandi (dataset)
install(error): /tmp/dandi (dataset) [target path already exists and not empty, refuse to clone into target path] but note that "directory" issue refers to a directory in
so it is not about
I most often use overall -- do you observe the same error while trying as |
BTW!!! Given
we might be at the limit of what OSX has for max path length! Try just
could give us an answer although that number is really a "ballpark" (run it under But also it points to the limit we might hit on drogon as well with all such long names/nesting, hm... |
@yarikoptic The |
ok, I would have run with falling into |
@yarikoptic Doing |
well, it is not equivalent, so I guess the devil might be in the detail. Could you just try do exactly that |
@yarikoptic I ran the exact same backup script on drogon, backing up to a folder in |
ok, so it must be something about OSX and/or git on that system, and you can successfully benchmark etc on drogon. And 112 minutes for a single zarr or how many zarrs in dandiset? how long takes to update right after (again 112 minutes?)? CI tests are red since seems still needing more recent git-annex you noticed: |
|
with some incompatibility changes which were mitigated only in datalad 0.16.x I w a bit reluctant to update version in neurodebian... ok -- updated now to
eh, which suggests that such workflow would hardly be usable for any frequent updates in archive wide setting. Where on drogon it is ? I will see how fast it is to crawl/recrawl using datalad-crawler straight from S3 (even without any optimization but with fetching data ATM). |
@yarikoptic The backup is in |
@yarikoptic Ping. Updates? |
I have merged dandi/dandi-archive#1074 boosting dandischema to 0.7.1 on server. In principle that should address the issue, but we would need also release of dandi-cli so this code uses also "congruent" version? |
@@ -1,16 +1,17 @@ | |||
# Python ~= 3.8 | |||
# git-annex >= 8.20210903 | |||
# git-annex >= 10.20220222 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upgrading in dandisets
env: git-annex 8.20211028-alldep_h27987b5_100 --> 10.20220322-alldep_hc98582e_100
Thank you @jwodder. ok, let's proceed with this since I am not sure if there is an alternative. I have upgraded git-annex, not spotted any major show stopper in the diff, and you confirmed that it worked on sample dandisets, so please proceed with merge/deploy-ing it on |
@yarikoptic After merging, running the script on everything other than 000108 took about an hour. I then ran the script on just 000108 twice, and both times it crashed after about seven minutes due to a connection timeout. Also, note that I've currently disabled the cronjob. |
|
Either dandiarchive or the S3 bucket, I couldn't tell which.
All async HTTP requests are already automatically retried on errors.
I'm concerned about what would happen if the cronjob tried to save changes to the superdataset while the 000108 dataset was in the middle of an update. |
without seeing the timeout traceback I can't tell either retries happened for connection timeout (thus something must have been really down) or we need to expand the range what we retry for? If issue persists -- please file an issue with details if you need some feedback/ideas, or just address it if it is smth obvious.
I think you already would invoke only for the dandisets which were processed in that call, thus excluding 000108: https://github.com/dandi/dandisets/blob/HEAD/tools/backups2datalad/datasetter.py#L89 . The only conflict could happen if two processes try to save super dataset at the same time, but that is unlikely so I would not worry about that . |
@yarikoptic Based on the logs, numerous HEAD requests to I've re-enabled the cronjob. |
just a thought, for backup, would it be better to just talk directly to s3 like s5cmd does? this would also significantly reduce api load for zarr files. |
I tried backing up 000108 again, and this time the script ran for ten minutes before dying from a connection timeout error. |
Closes #127.
To do:
sync_zarr()
into the rest of the backup codepopulate
command to support populating Zarr datasets as well--zarr-target
inbackups2datalad-cron