Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"instantiate" dandisets from the "backup" #243

Closed
yarikoptic opened this issue Sep 17, 2020 · 12 comments · Fixed by #244
Closed

"instantiate" dandisets from the "backup" #243

yarikoptic opened this issue Sep 17, 2020 · 12 comments · Fixed by #244

Comments

@yarikoptic
Copy link
Member

To provide extensive testing for #226 on dandisets we have already in the archive, we need to download them all. But that would be increasingly prohibitive.

On drogon backup server we already have a datalad dataset with the backup of S3.

The idea is to "instantiate" dandisets present in the archive as directories with symlinks (or could actually be actual files via cp --refllink=always since its BTRFS CoW filesystem!) into some location on the drive, where those "symlinks" would be coming from an asset store which is located under /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore .

The culprit is that asset store is using its own UUID, it is not an id of the girder's "file" . So we would need to either follow the redirect from dandiarchive's girder to https://girder.dandiarchive.org/api/v1/file/{file['id']}/download to get the actual asset id:

$> curl -I https://girder.dandiarchive.org/api/v1/file/5f176584f63d62e1dbd06946/download     
HTTP/1.1 303 See Other
Server: nginx/1.14.0 (Ubuntu)
Date: Thu, 17 Sep 2020 14:42:28 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 1652
Connection: keep-alive
Allow: DELETE, GET, HEAD, OPTIONS, PATCH, POST, PUT
Girder-Request-Uid: 6e2b2cbc-c6a3-4265-8068-14151b94f9cc
Location: https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVEYHMC7MS%2F20200917%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200917T144228Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=FwoGZXIvYXdzEBAaDKv1lZXvP9wFZRzEdCK%2FAZFXw8ch9QU9XsbYJneN4%2BIZTHUkUdu9P8xVvYlNNECKMEA25TTmbsKywS5YSDkTeY6x%2F67QDDHhbGRH89XanXXUejXHSk%2F5vU8MajEq0WV2iGMkpbYTUw9lFIlCAXnprmcDLd7LyTWCBi9tWpycrXD8YSUto3VUXG%2FTMjHOx4%2FG8CGi3I%2F1m3siPX7SQexDrmK7YpGI0jxEVYxF9sVvUtKeYF3PZWyX1b6KB0t%2BOOy4UCL%2FPRhW8gYvtHO%2F2EnxKIrrjfsFMi1WZZe2Ye%2FEJi6jx1xSE6nG%2B%2BdKQ%2BdHoigP06wBwHoLSaCdyIhVkoNGyk%2BEMt8%3D&X-Amz-Signature=cf09a8c3a24040939b756080de942bc79f171675ab70a4530a13e271b4adbe09
Strict-Transport-Security: max-age=63072000

to get that girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc path which is on drogon:

$> ls -l /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc                 
lrwxrwxrwx 1 yoh yoh 125 Jul 21 18:00 /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc -> ../../../.git/annex/objects/z8/3w/MD5E-s18792--33318fd510094e4304868b4a481d4a5a/MD5E-s18792--33318fd510094e4304868b4a481d4a5a

which would work but somewhat inefficient (we could cache those since mapping should not change) but work, or load from mongodb back the entire table and get all the mappings (more work - probably not).

So I think the course of action could be to

  • add an option "add_resolved_url" to GirderCli.get_dandiset_and_assets so it would add resolved URLs like above "https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc..." to the returned records of the assets
  • add dandi instantiate --assetstore PATH -o TOPPATH DANDISET_ID command (present only in DANDI_DEVEL mode) which would just go through all the assets of the dandiset and perform aforementioned cp -L --reflink=always {assetstore}/{path-within-assetstore-fromurl}

I think it should work quite fast and would be very efficient since no heavy data transfer would be happening and no new space consumed (besides for filesystem level metadata for COW copied files)

@yarikoptic
Copy link
Member Author

actually it might be much easier --- we don't need to tune get_dandiset_and_assets or add a new command -- just a devel option --assetstore option to download command and do cp instead of download if that one is found in the assetstore.
Records returned by get_dandiset_and_assets already include girder.id :

$> dandi ls -r https://dandiarchive.org/dandiset/000027/draft      
2020-09-17 11:15:53,755 [    INFO] Traversing remote dandisets (000027) recursively
- ...
- attrs:
    ctime: '2020-07-21T22:00:36.362000+00:00'
    mtime: '2020-07-21T17:31:55.283394-04:00'
    size: 18792
  girder:
    id: 5f176584f63d62e1dbd06946
...
  name: sub-RAT123.nwb
  path: /sub-RAT123/sub-RAT123.nwb
...
  type: file

so it will just be a helper function in download to resolve girder id to the path in the asset store, ideally cached.

@jwodder
Copy link
Member

jwodder commented Sep 17, 2020

Is it really necessary to implement this as a download option? It's not something that any normal user would need, and it would probably be cleaner if the whole thing was just a script that used dandi as a library.

@yarikoptic
Copy link
Member Author

Sure -- could just be an outside script. I just thought it might be simpler to just implement it within dandi-cli as a DANDI_DEVEL option.

@jwodder
Copy link
Member

jwodder commented Sep 17, 2020

Is there a recommended way to get a list of all Dandiset IDs? I know I can use Girder's /dandi endpoint, but it lacks decent pagination support, and I'm not sure if one of the other API components has a better endpoint.

Also, should there be some sort of handling of Dandiset versions?

@jwodder
Copy link
Member

jwodder commented Sep 17, 2020

Problem: The Python on drogon is 3.5, yet this library requires 3.6.

@yarikoptic
Copy link
Member Author

Please just install miniconda in your HOME with any suitable Python.

re list of dandisets:

In [11]: from dandi import girder                                                                                

In [12]: cl = girder.get_client("https://girder.dandiarchive.org")                                               

In [13]: [r['name'] for r in cl.listFolder("5e59bb0af19e820ab6ea6c62", parentFolderType='collection')]           
Out[13]: 
['000003',
 '000004',
 '000005',
 '000006',
 '000007',
 '000008',
 '000009',
 '000010',
 '000011',
 '000012',
 '000013',
 '000015',
...

@yarikoptic
Copy link
Member Author

re miniconda: cut pasteable example from http://handbook.datalad.org/en/latest/intro/installation.html?highlight=hpc#linux-machines-with-no-root-access-e-g-hpc-systems

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
# acknowledge license, keep everything at default
$ conda install -c conda-forge datalad

@jwodder
Copy link
Member

jwodder commented Sep 17, 2020

The asset for 000020/sub-746220081/sub-746220081_ses-751268829_icephys.nwb (which should be at girder-assetstore/47/dc/47dc1e65b27441f28d5e7d9cf1109c12) appears to be missing from the backup for some reason, making the script fail. Should failures of the cp command just be ignored?

@yarikoptic
Copy link
Member Author

hm... for now please add an option for that -- for the "investigate metadata compliance" it is ok to miss a few files. BUT eventually we need to figure it out. (backup runs daily, so it might be that there were some changes to that dandiset today? will check later)

@jwodder
Copy link
Member

jwodder commented Sep 18, 2020

The script has finished running, taking 20 minutes and 22 seconds. Should I commit it to a top-level tools/ directory in this repository or place it somewhere else?

@yarikoptic
Copy link
Member Author

awesome! yes please - commit it under tools/.

Due to the dandi/dandiarchive-legacy#491 though we are lacking dandiset.yaml in each one of those. Could you please adjust the script to use dandi download --download dandiset.yaml to instantiate all of them so we get them "more complete"?

@jwodder
Copy link
Member

jwodder commented Sep 18, 2020

Done. Pull request: #244

yarikoptic added a commit that referenced this issue Sep 21, 2020
Script for "instantiating" Dandisets from asset store
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants