New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"instantiate" dandisets from the "backup" #243
Comments
actually it might be much easier --- we don't need to tune $> dandi ls -r https://dandiarchive.org/dandiset/000027/draft
2020-09-17 11:15:53,755 [ INFO] Traversing remote dandisets (000027) recursively
- ...
- attrs:
ctime: '2020-07-21T22:00:36.362000+00:00'
mtime: '2020-07-21T17:31:55.283394-04:00'
size: 18792
girder:
id: 5f176584f63d62e1dbd06946
...
name: sub-RAT123.nwb
path: /sub-RAT123/sub-RAT123.nwb
...
type: file so it will just be a helper function in download to resolve girder id to the path in the asset store, ideally cached. |
Is it really necessary to implement this as a download option? It's not something that any normal user would need, and it would probably be cleaner if the whole thing was just a script that used dandi as a library. |
Sure -- could just be an outside script. I just thought it might be simpler to just implement it within |
Is there a recommended way to get a list of all Dandiset IDs? I know I can use Girder's Also, should there be some sort of handling of Dandiset versions? |
Problem: The Python on drogon is 3.5, yet this library requires 3.6. |
Please just install miniconda in your re list of dandisets:
|
re miniconda: cut pasteable example from http://handbook.datalad.org/en/latest/intro/installation.html?highlight=hpc#linux-machines-with-no-root-access-e-g-hpc-systems
|
The asset for |
hm... for now please add an option for that -- for the "investigate metadata compliance" it is ok to miss a few files. BUT eventually we need to figure it out. (backup runs daily, so it might be that there were some changes to that dandiset today? will check later) |
The script has finished running, taking 20 minutes and 22 seconds. Should I commit it to a top-level |
awesome! yes please - commit it under Due to the dandi/dandiarchive-legacy#491 though we are lacking dandiset.yaml in each one of those. Could you please adjust the script to use |
Done. Pull request: #244 |
Script for "instantiating" Dandisets from asset store
To provide extensive testing for #226 on dandisets we have already in the archive, we need to download them all. But that would be increasingly prohibitive.
On
drogon
backup server we already have a datalad dataset with the backup of S3.The idea is to "instantiate" dandisets present in the archive as directories with symlinks (or could actually be actual files via
cp --refllink=always
since its BTRFS CoW filesystem!) into some location on the drive, where those "symlinks" would be coming from an asset store which is located under /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore .The culprit is that asset store is using its own UUID, it is not an id of the girder's "file" . So we would need to either follow the redirect from dandiarchive's girder to
https://girder.dandiarchive.org/api/v1/file/{file['id']}/download
to get the actual asset id:to get that
girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc
path which is on drogon:which would work but somewhat inefficient (we could cache those since mapping should not change) but work, or load from mongodb back the entire table and get all the mappings (more work - probably not).
So I think the course of action could be to
GirderCli.get_dandiset_and_assets
so it would add resolved URLs like above "https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc..." to the returned records of the assetsdandi instantiate --assetstore PATH -o TOPPATH DANDISET_ID
command (present only inDANDI_DEVEL
mode) which would just go through all the assets of the dandiset and perform aforementionedcp -L --reflink=always {assetstore}/{path-within-assetstore-fromurl}
I think it should work quite fast and would be very efficient since no heavy data transfer would be happening and no new space consumed (besides for filesystem level metadata for COW copied files)
The text was updated successfully, but these errors were encountered: