Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Add gammapy download command #1369
This PR adds a functionality to download last version of datasets and notebooks from
The functionality is added as a sub-command
$ gammapy extra Downloading files [####################################] 100% The datasets and notebooks may be found in folder: /Users/jer/Desktop/extras Process finished. - In order to access datasets from scripts you need to have GAMMAPY_EXTRA shell environment variable set. You should have the following line in you .bashrc or .profile files: export GAMMAPY_EXTRA=/Users/jer/Desktop/extras $ du -hs extras/ 141M extras/
This PR is related with issue #1131
@Bultako - Thank you!
I think this is the way to go; offer functions or a CLI command to download example files, instead of telling all users as first step to git clone gammapy-extra. I do have several questions about versioning or when to re-fetch or clobber etc; but in any case we're not going to sort out everything in a first PR, but have to put in something and then improve iteratively. But still I'll make some comments now before we merge a first version.
@Bultako - If you haven't seen it - recently I added a first tutorial notebook where we don't bundle and access data from gammapy-extra, but just download directly:
IMO it would be better in you change the code here a bit to
A few points on the implementation: can you please use
The other suggestion I would make is to use
How important is the use of
Please rename the file to
I only had a brief look, but it seems that you have two functions that do two things and operate on a global list of filenames to store something. I think a class that stores this state and has two methods (besides
referenced this pull request
May 16, 2018
changed the title from
Download data and notebooks with CLI gammapy extra
Download data and notebooks with CLI gammapy download
May 17, 2018
As agreed, I will proceed by atomic PRs, converging to a refined CLI addressing all needs in #1419. At this moment
$ time gammapy download notebooks Downloading files [####################################] 100% The files have been downloaded in folder gammapy-extra Process finished. real 0m21.949s user 0m2.001s sys 0m0.449s $ time gammapy download datasets Downloading files [####################################] 100% The files have been downloaded in folder gammapy-extra Process finished. real 4m32.957s user 0m6.904s sys 0m1.922s
I have made the modifs related with the review of this PR.
@Bultako - Some comments inline.
Also: what is this JSON API call about? I think it's possible to hit URLs directly to fetch files from Github, no?
If it's really needed, please add a comment to some Github docs page that explains why hitting the API and getting JSON is done.
It has been used from the very beginning in this PR. See my first comment on this PR at the top of this page :) We will certainly need it later to retrieve different versions of notebooks, when an
Re: Where datasets should be copied ?
We proposed in PIG 4 to use stable datasets and versioned tutorials. So I would go for the tree-folder structure below where versioned tutorials and stable datasets will co-exist, and use
- tutorials | |- v9.0 |- v8.0 |- eec5874 |- datasets
There are still many things to do in order to converge into PIG 4.
I still don't understand why we need to hit the API and get a JSON response.
We have to figure out the version tag or hash anyways, and once we know it we can just retrieve those files by hitting URLs like this on Github:
This seems like a good idea to me.
Maybe change from
We could add as many git tags in gammapy-extra as we want.
I think it might be useful to have one level of indirection and something like e.g.
We can assume that gammapy.org will always be there and that we can control index files served from there. At the moment it's using https://github.com/gammapy/gammapy-webpage and we can just use Github to add / change anything there.
Well, it also serves to retrieve of the subfolders and files inside from a given folder of the repo. If we find a way to scan content inside a folder of
Moreover, this is also mentioned in the PIG.
Ah, OK, now I get it: you using the Github API to list directory content.
I thought we would have explicit listings like https://astropy.stsci.edu/data/file_index.json or https://github.com/gammapy/gammapy-extra/blob/master/notebooks/notebooks.yaml and thus not have to hit the Github API. This would also allow complete control over what users see / get via
@Bultako - Clearly I didn't read the latest version of the PIG, and unfortunately I don't have time.
I think the index file would be the better direction to go in, but I also thought that for DataStore and people didn't like the index files so much, so whatever you do here is OK with me.
Just to mention one other example: conda is also using JSON index files and static web servers to deliver versioned packages; they have a spec at https://conda.io/docs/user-guide/tasks/build-packages/package-spec.html Obviously we don't need a complex spec, but at some point within the next year we should stabilise on the versioning / download scheme so that using older versions of Gammapy will still work even a year or two after they are released. If we find serious limitations we can always develop a new index file format and access that for newer Gammapy versions, keeping the old one on gammapy.org unchanged. I just don't think we should assume Github pages exists forever.
But completely OK to merge this any time, small PRs are good!
I have decided to continue using Github API to scan and fetch the list of files in
The main reason is the high risk to get content in Github repos different from content listed in external content-listing files, and hence 404 errors in the download process. I have experienced this problem when building a content-listing file for the
The content in folders
Later on, when content gets more stable in these folders we could think on using content-listing files and get rid of Github API.
I'm merging this PR now before extending
Sep 3, 2018
Works for me, thanks!
Minor comment: The CLI description isn't quite accurate. Suggest to put "Download datasets and notebooks" on the first line so that it fits in the summary, and then to adjust the rest of the description to be more accurate (currently says things will be put in gammapy-extra folder).