Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve gammapy download #1988

Merged
merged 13 commits into from Jan 25, 2019
Merged

Improve gammapy download #1988

merged 13 commits into from Jan 25, 2019

Conversation

@Bultako
Copy link
Member

@Bultako Bultako commented Jan 18, 2019

This PR contains several improvements and new functionalities for gammapy download.

  • Code base refactoring
  • Tries to solve race condition issues in parallell downloading processes
  • Adds MD5 hash checking to skip fetching previously dowloaded datasets
  • Adds functionality to download example python scripts
  • Adds functionality to download versioned datasets

It addresses issues and comments #1880, #1836 (comment), e41add2#r30632763, #172, #1994 (comment)

@cdeil cdeil added this to the 0.10 milestone Jan 18, 2019
@cdeil cdeil self-assigned this Jan 18, 2019
@cdeil
Copy link
Member

@cdeil cdeil commented Jan 18, 2019

I see that you include code to download the Gammapy conda env file via gammapy download.
Maybe we should remove that (the ComputePlan.getenvironment method)?

The current instructions (https://docs.gammapy.org/0.9/getting-started.html) are to first

curl -O https://gammapy.org/download/install/gammapy-0.9-environment.yml
conda env create -f gammapy-0.9-environment.yml
conda activate gammapy-0.9

and only then:

gammapy download tutorials
cd gammapy-tutorials
export GAMMAPY_DATA=$PWD/datasets

So downloading the gammapy-0.9-environment.yml again in gammapy download tutorials and printing to the console (in ParallelDownload.show_info) that the user should create and activate the env again isn't needed and will probably confuse users.

I can see the appeal of putting a copy of gammapy-0.9-environment.yml in the gammapy-tutorials repo, in case someone wants to recreate the env later. But overall I feel it might be best to remove the handling of gammapy-0.9-environment.yml completely from gammapy download.

@Bultako - What do you think?

@Bultako
Copy link
Member Author

@Bultako Bultako commented Jan 21, 2019

@cdeil

Note that the environment YAML file is fetched only when the user asks for the notebooks or scripts linked to a specific stable release using the release flag. In the case you describe, the environment file will not be downloaded because the release flag is not present at any moment in the instructions we provide. The information displayed in the terminal after execution of gammapy download tutorials is be the following:

*** You might want to declare GAMMAPY_DATA env variable
export GAMMAPY_DATA=/Users/jer/Desktop/gammapy-tutorials/datasets

So, I'm +1 to continue downloading the environment YAML file only when a particular release for the tutorials is asked on purpose by the user, so he/she can reproduce the versioned tutorials in a specific conda virtual env. Does this make sense?

@Bultako Bultako force-pushed the Bultako:gp-download branch from e7f67bc to 8838be3 Jan 21, 2019
@Bultako Bultako force-pushed the Bultako:gp-download branch from 55b0317 to c4a3456 Jan 21, 2019
@cdeil
Copy link
Member

@cdeil cdeil commented Jan 22, 2019

So, I'm +1 to continue downloading the environment YAML file only when a particular release for the tutorials is asked on purpose by the user, so he/she can reproduce the versioned tutorials in a specific conda virtual env. Does this make sense?

Yes, that does make sense, we can keep as-is if you like it.

I'm about 50:50 on this, but would probably remove it, because it's not really needed, and not offering it means less and simpler code for gammapy download, and also fewer options / documentation for the end user. Note that at the moment it's a hidden feature, not advertised to users. Probably we should add a section "How to Upgrade" at https://docs.gammapy.org/0.9/getting-started.html .
IMO the recommendation should just be to execute the same commands for the different version the users wants, we shouldn't explain the download --release option, it will just confuse users which way to use?

@Bultako - Should I do a final review here, or are you still working on this branch?

@Bultako
Copy link
Member Author

@Bultako Bultako commented Jan 22, 2019

I think version management in general has never been easy for most users, worst for newcomers. gammapy download is not the exception, it is complex in its design, it offers quite a collection of features (dealing with versioning in different kinds of material is only one), trying to be highly flexible in the combination of options from the user interface. This complexity in the design reflects on the code, I agree it's not easy to maintain right now. And surely we are lacking of documentation about all this.

But I think version management (the --release option) is the feature I would keep the most (you can see that I have added version management for the datasets..) There are other features i.e. fetching specific files, only notebooks or scripts or their combination with the datasets needed, that could be simplified and I'm +1 to do it. (maybe not in this PR?)

@Bultako - Should I do a final review here, or are you still working on this branch?

You can proceed with the review.

Copy link
Member

@cdeil cdeil left a comment

I left some inline comments.

A lot of the code complexity (lots of if / else) comes from having the different download tasks (notebooks, data, scripts) handled by the same class ComputePlan. I think having those as separate classes would simplify things, but I understand there's currently a bit of coupling, so probably not easily possible.

Otherwise most of the code complexity (lots of if / else) comes from the release option, but again, it's probably not easy to make it better.

Is it possible to improve the test coverage for this code a bit?
E.g. after we drop Python 2, we could try using the little nicer executor for concurrent download, and ideally doing such code changes and refactorings should be possible without fear, i.e. if tests pass, the new implementation is OK.

if release:
plan.getenvironment()
fl = plan.getfilelist()
if fl:

This comment has been minimized.

@cdeil

cdeil Jan 22, 2019
Member

Are all these if fl needed? Maybe it's possible to just make ParallelDownload work also for an empty list (and be a no-op, or at most create folders) and then the if clauses can be removed?

downdatasets.setup()
downdatasets.files()
downdatasets.run()
plan = ComputePlan(src, out, release, "notebooks")

This comment has been minimized.

@cdeil

cdeil Jan 22, 2019
Member

Did you consider calling the other cli functions here, instead of duplicating the code?
Like in this example.
(sorry if we already discussed it and I missed your reply)

with open(path, "rb") as f:
for byte_block in iter(lambda: f.read(4096), b""):
md5_hash.update(byte_block)
md5 = md5_hash.hexdigest()

This comment has been minimized.

@cdeil

cdeil Jan 22, 2019
Member

We don't have large files. Suggest to use Path.read_bytes like here to hash them. Example.

a `gammapy-notebooks` folder created at the current working directory.
"""Download notebooks, scripts anda datasets.
Download notebooks published as tutorials, example python scripts and the

This comment has been minimized.

@cdeil

cdeil Jan 22, 2019
Member

Maybe use a bullet list for the description of the options? IMO it makes text more readable.
Note that with click you need to use \b to prevent re-wrapping:
https://click.palletsprojects.com/en/7.x/documentation/#preventing-rewrapping

created at the current working directory. The option `notebooks` will
download by default the jupyter notebook files used in the tutorilas into
a `gammapy-notebooks` folder created at the current working directory.
"""Download notebooks, scripts anda datasets.

This comment has been minimized.

@cdeil

cdeil Jan 22, 2019
Member

Typo: anda -> and



def parse_imagefiles(notebookslist):
imagefiles = {}

This comment has been minimized.

@cdeil

cdeil Jan 22, 2019
Member

I think using generators ("yield") in the parse_ functions would save a few lines and be a bit simpler.
Note that if you need lists or dicts instead of generators, you can always call list or dict on the return value from the generator function. It's not about efficiency in this case, but (minor) simpler / shorter code in the function making the list.

label = "im: " + im
path = "images/" + im + ".png"
url = str(Path(record["url"]).parent)
url = url.replace(":/", "://")

This comment has been minimized.

@cdeil

cdeil Jan 22, 2019
Member

Why url.replace(":/", "://")? Is this a data entry bug in the input file? Can it be fixed there and this line removed?

@cdeil cdeil changed the title Improvements and new additions for gammapy download Improve gammapy download Jan 23, 2019
@Bultako
Copy link
Member Author

@Bultako Bultako commented Jan 24, 2019

@cdeil
I have addressed all your comments in the review.
I have also made a small refactoring of the code to get something simpler, but I don't think I have really achieved this goal. If you can have a quick look now, with the distance that I do not have, that would be nice.

There are still the test_*.py scripts to make so to improve the coverage and other thing to do could be to declare which scripts to deliver with gammapy download scripts. At this moment the index file for the scripts is really short.
https://github.com/gammapy/gammapy/blob/master/examples/scripts.yaml
Maybe we could disable this feature?

@Bultako Bultako force-pushed the Bultako:gp-download branch from 6a497fe to 0a98042 Jan 25, 2019
@cdeil
Copy link
Member

@cdeil cdeil commented Jan 25, 2019

Maybe we could disable this feature?

I would just leave it activated, and then in the coming weeks add a useful script or two.

Everything in Gammapy is work in progress, and I would always just default to working towards good v1.0 trying to make progress as fast as possible, instead of spending effort to ship "clean" v0.10 or other intermediate releases.

@Bultako Bultako force-pushed the Bultako:gp-download branch from 08718f4 to 19621ab Jan 25, 2019
@Bultako Bultako force-pushed the Bultako:gp-download branch from 19621ab to e72b864 Jan 25, 2019
@Bultako Bultako merged commit 7ba40d7 into gammapy:master Jan 25, 2019
3 of 4 checks passed
3 of 4 checks passed
Codacy/PR Quality Review Not up to standards. This pull request quality could be better.
Details
Scrutinizer Analysis: 5 new issues, 18 updated code elements – Tests: passed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
gammapy.gammapy Build #20190125.9 succeeded
Details
@Bultako Bultako deleted the Bultako:gp-download branch Jan 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants