New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gammapy download command #1369

Merged
merged 5 commits into from Sep 3, 2018

Conversation

3 participants
@Bultako
Member

Bultako commented Apr 10, 2018

This PR adds a functionality to download last version of datasets and notebooks from gammapy-extra GitHub repo. The user does not need to have git installed, it makes use of GitHub REST API and request standard Python library to retrieve the files.

The functionality is added as a sub-command extra of gammapy CLI.

$ gammapy extra
Downloading files  [####################################]  100%             
The datasets and notebooks may be found in folder: /Users/jer/Desktop/extras
Process finished.
-
In order to access datasets from scripts you need to have GAMMAPY_EXTRA shell environment variable set.
You should have the following line in you .bashrc or .profile files:
export GAMMAPY_EXTRA=/Users/jer/Desktop/extras

$ du -hs extras/
141M	extras/

This PR is related with issue #1131

@cdeil cdeil self-assigned this Apr 12, 2018

@cdeil cdeil added the feature label Apr 12, 2018

@cdeil cdeil added this to the 0.8 milestone Apr 12, 2018

@cdeil

This comment has been minimized.

Member

cdeil commented Apr 12, 2018

@Bultako - Thank you!

I think this is the way to go; offer functions or a CLI command to download example files, instead of telling all users as first step to git clone gammapy-extra. I do have several questions about versioning or when to re-fetch or clobber etc; but in any case we're not going to sort out everything in a first PR, but have to put in something and then improve iteratively. But still I'll make some comments now before we merge a first version.

@Bultako - If you haven't seen it - recently I added a first tutorial notebook where we don't bundle and access data from gammapy-extra, but just download directly:
http://docs.gammapy.org/dev/notebooks/hgps.html#Download-Data
Ideally later the code there to download would be replaced by a call to a utility function or CLI call into Gammapy

IMO it would be better in you change the code here a bit to gammapy download and even if for now only files from gammapy-extra are fetched; that eventually will be just one of the things that are pre-scripted to fetch. HGPS would be another example; Gamma-Cat yet another that we currently already use in the Gammapy docs.

A few points on the implementation: can you please use from gammapy.extern.pathlib import Path and use Path instead of the os.path functions throughout? Of course it's a matter of taste, but I personally prefer it and we use it consistently in Gammapy, so shouldn't start to use os.path now here.

The other suggestion I would make is to use from gammapy.extern.six.moves.urllib.request import urlretrieve instead of the chunk-based code you use now for download. I.e. prefer code simplicity over efficiency.

How important is the use of requests? If it's easy to do with urllib, please do. If it's annoying, then please add a mention of requests as a new optional dependency in setup.py and environment-dev.yml (even if it is already installed as a dependency; let's be explicit).

@cdeil

This comment has been minimized.

Member

cdeil commented Apr 12, 2018

Please rename the file to download.py.

I only had a brief look, but it seems that you have two functions that do two things and operate on a global list of filenames to store something. I think a class that stores this state and has two methods (besides __init__) might be a little bit easier to read / maintain, especially if download.py grows to a few 100 lines over time to also download a few other things, or do local checks what files are already available or things like that.

@Bultako Bultako force-pushed the Bultako:datasets branch from 5cfcfb7 to 5b8e2b8 Apr 16, 2018

@cdeil

This comment has been minimized.

Member

cdeil commented May 3, 2018

@Bultako - I plan to release Gammapy v0.8 on Monday. Do you have time for this or should we move the milestone to v0.9?

@Bultako

This comment has been minimized.

Member

Bultako commented May 3, 2018

@cdeil
I prefer to re-design a little bit the thing and refactor functionalities (CLI args) and code in order to get a more versatile CLI that would download any dataset/collection/directory from whatever public GitHub repo. Monday is a bit short if we consider the review process..
So it's better to leave it for 0.9

@cdeil cdeil modified the milestones: 0.8, 0.9 May 3, 2018

@Bultako Bultako changed the title from Download data and notebooks with CLI gammapy extra to Download data and notebooks with CLI gammapy download May 17, 2018

@Bultako Bultako force-pushed the Bultako:datasets branch from 5b8e2b8 to d78bcaf Aug 3, 2018

@Bultako Bultako force-pushed the Bultako:datasets branch from d78bcaf to 4d036f3 Aug 15, 2018

Bultako added a commit to Bultako/gammapy that referenced this pull request Aug 18, 2018

@Bultako

This comment has been minimized.

Member

Bultako commented Aug 18, 2018

As agreed, I will proceed by atomic PRs, converging to a refined CLI addressing all needs in #1419. At this moment gammapy download may be seen (mainly for those not using git) as a mean to gather the folders notebooks or/and datasets from the gammapy-extra Github repo. It puts the content in a local $CWD/gammapy-extra folder.

$ time gammapy download notebooks
Downloading files  [####################################]  100%             
The files have been downloaded in folder gammapy-extra
Process finished.

real	0m21.949s
user	0m2.001s
sys	0m0.449s

$ time gammapy download datasets
Downloading files  [####################################]  100%             
The files have been downloaded in folder gammapy-extra
Process finished.

real	4m32.957s
user	0m6.904s
sys	0m1.922s

I have made the modifs related with the review of this PR.

  • change script name to download.py
  • change command name to gammapy download
  • add subcommands gammapy download notebooks and gammapy download datasets
  • use from gammapy.extern.pathlib import Path
  • use from gammapy.extern.six.moves.urllib.request import urlretrieve
  • refactor code into a class managing the process of downloading

@Bultako Bultako force-pushed the Bultako:datasets branch from 47e2dd6 to 973972a Aug 18, 2018

Bultako added a commit to Bultako/gammapy that referenced this pull request Aug 18, 2018

@Bultako Bultako force-pushed the Bultako:datasets branch from 973972a to d4115df Aug 18, 2018

Bultako added a commit to Bultako/gammapy that referenced this pull request Aug 18, 2018

@Bultako Bultako closed this Aug 18, 2018

@Bultako Bultako deleted the Bultako:datasets branch Aug 18, 2018

@Bultako Bultako restored the Bultako:datasets branch Aug 18, 2018

@Bultako Bultako reopened this Aug 18, 2018

@Bultako Bultako force-pushed the Bultako:datasets branch from d4115df to 195bdb3 Aug 18, 2018

Bultako added a commit to Bultako/gammapy that referenced this pull request Aug 18, 2018

@Bultako Bultako force-pushed the Bultako:datasets branch from 195bdb3 to 43bac92 Aug 20, 2018

Bultako added a commit to Bultako/gammapy that referenced this pull request Aug 20, 2018

@cdeil

This comment has been minimized.

Member

cdeil commented Aug 20, 2018

@Bultako - I'll not be available much in the near future. Please always assign and ping @adonath for review or comments or if you have any questions.

Looking forward to trying this gammapy download ...

Bultako added a commit to Bultako/gammapy that referenced this pull request Aug 20, 2018

@cdeil

@Bultako - Some comments inline.

Also: what is this JSON API call about? I think it's possible to hit URLs directly to fetch files from Github, no?
If it's really needed, please add a comment to some Github docs page that explains why hitting the API and getting JSON is done.

apigitUrl = 'https://api.github.com/repos/gammapy/gammapy-extra/git/trees/master:'
rawgitUrl = 'https://raw.githubusercontent.com/gammapy/gammapy-extra/master/'
localfolder = Path('./gammapy-extra')

This comment has been minimized.

@cdeil

cdeil Aug 22, 2018

Member

Expose this folder where files are put as a command line option (either on the gammapy download click group, or on the notebooks and datasets click commands?

I'm not a fan of the default name gammapy-extra. We will fetch notebooks and data from other places, i.e. get away from gammapy-extra a bit, no?

Not sure what to suggest here: there needs to be a decision whether notebooks and data have to be next to each other or can be separate. If they have to be in the same folder, then we could use ../datasets/something from the tutorial notebooks, if it can be separate we'd use an env var to access data files, probably called GAMMAPY_DATA.

import click
import sys
import requests

This comment has been minimized.

@cdeil

cdeil Aug 22, 2018

Member

How useful is requests here?

It's currently not used in Gammapy or installed as a dependency.
I.e. this is blocking this PR from going in:
https://ci.appveyor.com/project/cdeil/gammapy/build/1.0.4278/job/2fev6npu271cehix#L1360

If you can rewrite this to just use urllib without too much trouble, even if a little painful, please do.
Alternatively, requests must be added on the CI config files and setup.py and install docs as a required dependency.

downloadproc.go()
def create_folder(folder):

This comment has been minimized.

@cdeil

cdeil Aug 22, 2018

Member

I think this whole create_folder can be deleted, and you just put folder.mkdir(parents=True, exist_ok=True) and let it do whatever it does?

@cdeil cdeil assigned cdeil and unassigned adonath Aug 22, 2018

@Bultako Bultako force-pushed the Bultako:datasets branch from d417529 to 167f9a8 Aug 23, 2018

@Bultako Bultako force-pushed the Bultako:datasets branch from 9014a3d to 6648927 Aug 23, 2018

@Bultako

This comment has been minimized.

Member

Bultako commented Aug 23, 2018

@cdeil

What is this JSON API call about?

It has been used from the very beginning in this PR. See my first comment on this PR at the top of this page :) We will certainly need it later to retrieve different versions of notebooks, when an version or hash param will be implemented as command line option. I've added a comment at the top of download.py script that references the API. I could add more info to the developers docs later as we continue with PIG 4. #1419

Re: Where datasets should be copied ?

We proposed in PIG 4 to use stable datasets and versioned tutorials. So I would go for the tree-folder structure below where versioned tutorials and stable datasets will co-exist, and use ../datasets/something in the tutorials.

- tutorials |
            |- v9.0
            |- v8.0
            |- eec5874
            |- datasets

There are still many things to do in order to converge into PIG 4.
We can proceed by atomic PRs or continue within this PR.
It's up to you when to cut :)

@cdeil

This comment has been minimized.

Member

cdeil commented Aug 23, 2018

I still don't understand why we need to hit the API and get a JSON response.

We have to figure out the version tag or hash anyways, and once we know it we can just retrieve those files by hitting URLs like this on Github:

curl -O https://raw.githubusercontent.com/gammapy/gammapy-extra/e1558d74c5ccbfd18015fdf7608873bf11fd75bb/notebooks/hgps.ipynb

I would go for the tree-folder structure

This seems like a good idea to me.

Maybe change from tutorials to gammapy-tutorials or something with gammapy in the name?
Makes it easier for people to find / recognise what it is, no?

@cdeil

This comment has been minimized.

Member

cdeil commented Aug 23, 2018

We could add as many git tags in gammapy-extra as we want.
We could also have a JSON or YAML config file that matches stable versions to git tags somewhere if that's easier to maintain.

I think it might be useful to have one level of indirection and something like e.g. http://gammapy.org/data/download.json that gammapy download uses, so that we can update / redirect even for older versions of Gammapy, in case there were mistakes or URLs changed.

We can assume that gammapy.org will always be there and that we can control index files served from there. At the moment it's using https://github.com/gammapy/gammapy-webpage and we can just use Github to add / change anything there.

@Bultako

This comment has been minimized.

Member

Bultako commented Aug 23, 2018

I still don't understand why we need to hit the API and get a JSON response.

Well, it also serves to retrieve of the subfolders and files inside from a given folder of the repo. If we find a way to scan content inside a folder of https://raw.githubusercontent.com we could certainly get rid of the GH API for now.

Re: http://gammapy.org/data/download.json

👍

Moreover, this is also mentioned in the PIG.
I've thought to implement it later, but I could it as the next step.

@cdeil

This comment has been minimized.

Member

cdeil commented Aug 23, 2018

Ah, OK, now I get it: you using the Github API to list directory content.

I thought we would have explicit listings like https://astropy.stsci.edu/data/file_index.json or https://github.com/gammapy/gammapy-extra/blob/master/notebooks/notebooks.yaml and thus not have to hit the Github API. This would also allow complete control over what users see / get via gammapy/download, e.g. we could hide that way broken notebooks, or have extra files in folders that aren't downloaded.

@Bultako - Clearly I didn't read the latest version of the PIG, and unfortunately I don't have time.
Could you please proceed at your own pace and merge yourself as long as the PRs don't cause CI to break (more than it already is)? @adonath or I will try to find time next week to try it out and give some feedback.

I think the index file would be the better direction to go in, but I also thought that for DataStore and people didn't like the index files so much, so whatever you do here is OK with me.

@cdeil

This comment has been minimized.

Member

cdeil commented Aug 23, 2018

Just to mention one other example: conda is also using JSON index files and static web servers to deliver versioned packages; they have a spec at https://conda.io/docs/user-guide/tasks/build-packages/package-spec.html Obviously we don't need a complex spec, but at some point within the next year we should stabilise on the versioning / download scheme so that using older versions of Gammapy will still work even a year or two after they are released. If we find serious limitations we can always develop a new index file format and access that for newer Gammapy versions, keeping the old one on gammapy.org unchanged. I just don't think we should assume Github pages exists forever.

But completely OK to merge this any time, small PRs are good!

@Bultako Bultako force-pushed the Bultako:datasets branch from 6648927 to 6cd5fb0 Aug 23, 2018

@Bultako

This comment has been minimized.

Member

Bultako commented Aug 23, 2018

I've found it's really easy to export the content of a folder recursively into a json file with tree -J folder, so in next commit before merging I will directly hit the generated json file and get rid of the GH API.

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 3, 2018

I have decided to continue using Github API to scan and fetch the list of files in datasets and notebooks folders, and not to use content-listing files.

The main reason is the high risk to get content in Github repos different from content listed in external content-listing files, and hence 404 errors in the download process. I have experienced this problem when building a content-listing file for the notebooks folder, which turns out to be different from the real content in the Github repo because of the declarations found in .gitignore file.

The content in folders datasets and notebooks changes very often at this moment, which means that we would need to maintain the content-listing files updated within every git commit (gammapy download will accept hash/tag as param to download a specific version of the Github repo)

Later on, when content gets more stable in these folders we could think on using content-listing files and get rid of Github API.

I'm merging this PR now before extending gammapy download with more functionalities.

@Bultako Bultako merged commit 06fded3 into gammapy:master Sep 3, 2018

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details

Documentation automation moved this from To Do to Done Sep 3, 2018

@Bultako Bultako deleted the Bultako:datasets branch Sep 3, 2018

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 3, 2018

🎉

Works for me, thanks!

Minor comment: The CLI description isn't quite accurate. Suggest to put "Download datasets and notebooks" on the first line so that it fits in the summary, and then to adjust the rest of the description to be more accurate (currently says things will be put in gammapy-extra folder).

(gammapy-dev) hfm-1804a:gammapy deil$ gammapy
Usage: gammapy [OPTIONS] COMMAND [ARGS]...

  Gammapy command line interface (CLI).

  Gammapy is a Python package for gamma-ray astronomy.

  Use ``--help`` to see available sub-commands, as well as the available
  arguments and options for each sub-command.

  For further information, see http://gammapy.org/ and
  http://docs.gammapy.org/

  Examples
  --------

  $ gammapy --help
  $ gammapy --version
  $ gammapy info --help
  $ gammapy info

Options:
  --log-level [debug|info|warning|error]
                                  Logging verbosity level
  --ignore-warnings               Ignore warnings?
  --version                       Print version and exit
  -h, --help                      Show this message and exit.

Commands:
  check     Run checks for Gammapy
  download  Download datasets and notebooks from the...
  image     Analysis - 2D images
  info      Display information about Gammapy
(gammapy-dev) hfm-1804a:gammapy deil$ gammapy download --help
Usage: gammapy download [OPTIONS] COMMAND [ARGS]...

  Download datasets and notebooks from the gammapy-extra Github repository
  into a folder gammapy-extra created in the current working directory.

Options:
  --folder TEXT  Folder where the files will be copied.
  -h, --help     Show this message and exit.

Commands:
  datasets   Download datasets
  notebooks  Download notebooks
@Bultako

This comment has been minimized.

Member

Bultako commented Sep 3, 2018

Done in a6038fb

@Bultako Bultako referenced this pull request Sep 5, 2018

Merged

Improve gammapy download #1763

@cdeil cdeil changed the title from Download data and notebooks with CLI gammapy download to Add gammapy download command Nov 23, 2018

@cdeil cdeil modified the milestones: 0.9, 0.8 Nov 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment