-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script download and install of data sets #221
Comments
Should this be in the RPZ file or in a "wrapper" that indicates how to combine RPZ file and data files? The same RPZ file and associated data might live in different storages and each of them might want to link to their own copy of the data. |
I don't have strong opinions about this, but here's one way you might think about it. Python's setuptools lets you specify optional "extra" dependencies through the extras_require={
'docs': ['numpydoc', 'sphinx!=1.3.1', 'sphinx_rtd_theme',
'matplotlib >= 1.5'],
'numba': ['numba >= 0.25'],
'display': ['matplotlib >= 1.5'],
} You then install them by saying I'm not sure how this would play out in RPZ land. I could imagine a simple interface where you can install a bare-bones rpz using the normal install procedure, but if you want the heavy-weight optional dependencies (eg, datasets hosted on s3 or something), you can install those by an extra flag like pip does. These would be treated as external dependencies, and not bundled within the rpz, so you'd have to have some part of the config that specifies how to collect the external dependencies. |
I actually have a The difference here is that setup.py lists dependencies by name and not location, and that is my issue here. Optionally the RPZ file could identify these missing input files by hash, but putting the location into the RPZ package (more or less meant to be immutable) can be discussed. @VickySteeves comments? |
@bmcfee I like the idea of adding "external dependencies" to ReproZip, and this is particularly useful for big datasets that do not need to be packed. I think the specification of these datasets/resources could come in the RPZ file (whoever packed the application informs ReproZip how to obtain these datasets/resources), but it should also be possible for the user, while unpacking, to associate these resources with their own copy of the data. The unpackers could then automatically download the datasets while setting up the environment, if users choose to do so via a flag. |
Also related: #220 |
Have you discussed the definition of remote file resources further? ResearchObjects do this, AFAIK, but I wonder if there is a useful shared "specification" here, e.g. a YAML file that tells a platform "I need access to these remote sources", and then the platform can say if it can manage that (in a performant way). IMO this would be useful for o2r's reproducibility service and Binder's, too. I've thought about this because with remote sensing data, you won't just put 1PB of data into a research compendium (see also this poster). If you're interested in a discussion, I can dig up more. IIRC this was also discussed in the Open Science Infrastructure Working Group calls. |
It would be useful to have a way in the config file to automate downloading data sets or other resources automatically without including them in the rpz file.
The text was updated successfully, but these errors were encountered: