"instantiate" dandisets from the "backup" #243

yarikoptic · 2020-09-17T14:50:06Z

To provide extensive testing for #226 on dandisets we have already in the archive, we need to download them all. But that would be increasingly prohibitive.

On drogon backup server we already have a datalad dataset with the backup of S3.

The idea is to "instantiate" dandisets present in the archive as directories with symlinks (or could actually be actual files via cp --refllink=always since its BTRFS CoW filesystem!) into some location on the drive, where those "symlinks" would be coming from an asset store which is located under /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore .

The culprit is that asset store is using its own UUID, it is not an id of the girder's "file" . So we would need to either follow the redirect from dandiarchive's girder to https://girder.dandiarchive.org/api/v1/file/{file['id']}/download to get the actual asset id:

$> curl -I https://girder.dandiarchive.org/api/v1/file/5f176584f63d62e1dbd06946/download     
HTTP/1.1 303 See Other
Server: nginx/1.14.0 (Ubuntu)
Date: Thu, 17 Sep 2020 14:42:28 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 1652
Connection: keep-alive
Allow: DELETE, GET, HEAD, OPTIONS, PATCH, POST, PUT
Girder-Request-Uid: 6e2b2cbc-c6a3-4265-8068-14151b94f9cc
Location: https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVEYHMC7MS%2F20200917%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200917T144228Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=FwoGZXIvYXdzEBAaDKv1lZXvP9wFZRzEdCK%2FAZFXw8ch9QU9XsbYJneN4%2BIZTHUkUdu9P8xVvYlNNECKMEA25TTmbsKywS5YSDkTeY6x%2F67QDDHhbGRH89XanXXUejXHSk%2F5vU8MajEq0WV2iGMkpbYTUw9lFIlCAXnprmcDLd7LyTWCBi9tWpycrXD8YSUto3VUXG%2FTMjHOx4%2FG8CGi3I%2F1m3siPX7SQexDrmK7YpGI0jxEVYxF9sVvUtKeYF3PZWyX1b6KB0t%2BOOy4UCL%2FPRhW8gYvtHO%2F2EnxKIrrjfsFMi1WZZe2Ye%2FEJi6jx1xSE6nG%2B%2BdKQ%2BdHoigP06wBwHoLSaCdyIhVkoNGyk%2BEMt8%3D&X-Amz-Signature=cf09a8c3a24040939b756080de942bc79f171675ab70a4530a13e271b4adbe09
Strict-Transport-Security: max-age=63072000

to get that girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc path which is on drogon:

$> ls -l /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc                 
lrwxrwxrwx 1 yoh yoh 125 Jul 21 18:00 /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc -> ../../../.git/annex/objects/z8/3w/MD5E-s18792--33318fd510094e4304868b4a481d4a5a/MD5E-s18792--33318fd510094e4304868b4a481d4a5a

which would work but somewhat inefficient (we could cache those since mapping should not change) but work, or load from mongodb back the entire table and get all the mappings (more work - probably not).

So I think the course of action could be to

add an option "add_resolved_url" to GirderCli.get_dandiset_and_assets so it would add resolved URLs like above "https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc..." to the returned records of the assets
add dandi instantiate --assetstore PATH -o TOPPATH DANDISET_ID command (present only in DANDI_DEVEL mode) which would just go through all the assets of the dandiset and perform aforementioned cp -L --reflink=always {assetstore}/{path-within-assetstore-fromurl}

I think it should work quite fast and would be very efficient since no heavy data transfer would be happening and no new space consumed (besides for filesystem level metadata for COW copied files)

The text was updated successfully, but these errors were encountered:

yarikoptic · 2020-09-17T15:17:17Z

actually it might be much easier --- we don't need to tune get_dandiset_and_assets or add a new command -- just a devel option --assetstore option to download command and do cp instead of download if that one is found in the assetstore.
Records returned by get_dandiset_and_assets already include girder.id :

$> dandi ls -r https://dandiarchive.org/dandiset/000027/draft      
2020-09-17 11:15:53,755 [    INFO] Traversing remote dandisets (000027) recursively
- ...
- attrs:
    ctime: '2020-07-21T22:00:36.362000+00:00'
    mtime: '2020-07-21T17:31:55.283394-04:00'
    size: 18792
  girder:
    id: 5f176584f63d62e1dbd06946
...
  name: sub-RAT123.nwb
  path: /sub-RAT123/sub-RAT123.nwb
...
  type: file

so it will just be a helper function in download to resolve girder id to the path in the asset store, ideally cached.

jwodder · 2020-09-17T16:54:42Z

Is it really necessary to implement this as a download option? It's not something that any normal user would need, and it would probably be cleaner if the whole thing was just a script that used dandi as a library.

yarikoptic · 2020-09-17T17:00:59Z

Sure -- could just be an outside script. I just thought it might be simpler to just implement it within dandi-cli as a DANDI_DEVEL option.

jwodder · 2020-09-17T17:04:58Z

Is there a recommended way to get a list of all Dandiset IDs? I know I can use Girder's /dandi endpoint, but it lacks decent pagination support, and I'm not sure if one of the other API components has a better endpoint.

Also, should there be some sort of handling of Dandiset versions?

jwodder · 2020-09-17T17:23:20Z

Problem: The Python on drogon is 3.5, yet this library requires 3.6.

yarikoptic · 2020-09-17T18:44:24Z

Please just install miniconda in your HOME with any suitable Python.

re list of dandisets:

In [11]: from dandi import girder                                                                                

In [12]: cl = girder.get_client("https://girder.dandiarchive.org")                                               

In [13]: [r['name'] for r in cl.listFolder("5e59bb0af19e820ab6ea6c62", parentFolderType='collection')]           
Out[13]: 
['000003',
 '000004',
 '000005',
 '000006',
 '000007',
 '000008',
 '000009',
 '000010',
 '000011',
 '000012',
 '000013',
 '000015',
...

yarikoptic · 2020-09-17T18:45:28Z

re miniconda: cut pasteable example from http://handbook.datalad.org/en/latest/intro/installation.html?highlight=hpc#linux-machines-with-no-root-access-e-g-hpc-systems

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
# acknowledge license, keep everything at default
$ conda install -c conda-forge datalad

jwodder · 2020-09-17T19:25:50Z

The asset for 000020/sub-746220081/sub-746220081_ses-751268829_icephys.nwb (which should be at girder-assetstore/47/dc/47dc1e65b27441f28d5e7d9cf1109c12) appears to be missing from the backup for some reason, making the script fail. Should failures of the cp command just be ignored?

yarikoptic · 2020-09-18T00:10:06Z

hm... for now please add an option for that -- for the "investigate metadata compliance" it is ok to miss a few files. BUT eventually we need to figure it out. (backup runs daily, so it might be that there were some changes to that dandiset today? will check later)

jwodder · 2020-09-18T12:58:56Z

The script has finished running, taking 20 minutes and 22 seconds. Should I commit it to a top-level tools/ directory in this repository or place it somewhere else?

yarikoptic · 2020-09-18T14:54:34Z

awesome! yes please - commit it under tools/.

Due to the dandi/dandiarchive-legacy#491 though we are lacking dandiset.yaml in each one of those. Could you please adjust the script to use dandi download --download dandiset.yaml to instantiate all of them so we get them "more complete"?

jwodder · 2020-09-18T15:30:29Z

Done. Pull request: #244

Script for "instantiating" Dandisets from asset store

jwodder mentioned this issue Sep 18, 2020

Script for "instantiating" Dandisets from asset store #244

Merged

yarikoptic closed this as completed in #244 Sep 21, 2020

yarikoptic added a commit that referenced this issue Sep 21, 2020

Merge pull request #244 from dandi/gh-243

05dff2d

Script for "instantiating" Dandisets from asset store

yarikoptic mentioned this issue Oct 2, 2020

Instantiate DataLad dandisets from backup #250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"instantiate" dandisets from the "backup" #243

"instantiate" dandisets from the "backup" #243

yarikoptic commented Sep 17, 2020

yarikoptic commented Sep 17, 2020

jwodder commented Sep 17, 2020

yarikoptic commented Sep 17, 2020

jwodder commented Sep 17, 2020 •

edited

jwodder commented Sep 17, 2020

yarikoptic commented Sep 17, 2020

yarikoptic commented Sep 17, 2020

jwodder commented Sep 17, 2020

yarikoptic commented Sep 18, 2020

jwodder commented Sep 18, 2020

yarikoptic commented Sep 18, 2020

jwodder commented Sep 18, 2020

"instantiate" dandisets from the "backup" #243

"instantiate" dandisets from the "backup" #243

Comments

yarikoptic commented Sep 17, 2020

yarikoptic commented Sep 17, 2020

jwodder commented Sep 17, 2020

yarikoptic commented Sep 17, 2020

jwodder commented Sep 17, 2020 • edited

jwodder commented Sep 17, 2020

yarikoptic commented Sep 17, 2020

yarikoptic commented Sep 17, 2020

jwodder commented Sep 17, 2020

yarikoptic commented Sep 18, 2020

jwodder commented Sep 18, 2020

yarikoptic commented Sep 18, 2020

jwodder commented Sep 18, 2020

jwodder commented Sep 17, 2020 •

edited