Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIG 4 - Setup for tutorial notebooks and data #1419

Merged
merged 6 commits into from Oct 4, 2018
Merged

Conversation

@cdeil
Copy link
Member

@cdeil cdeil commented May 16, 2018

This PIG proposes changes to our setup for tutorial notebooks and example datasets.

It is the result of @Bultako and me discussing this yesterday, as well as many previous discussions, at least these: #1237, #1369, #700, #431, #405, #228

@Bultako @registerrier @joleroi @robertazanin @kosack - If you have time, please review this short proposal and leave a comment with your thoughts.

@cdeil cdeil added this to the 0.9 milestone May 16, 2018
@cdeil cdeil added this to To Do in DOCUMENTATION via automation May 16, 2018
@joleroi
Copy link
Contributor

@joleroi joleroi commented May 16, 2018

Thanks a lot for tackling this long-standing issue. Here me first thoughts after reading the 🐷

👍 to

  • move notebooks to gammapy/notebooks
  • not store the output cells (especially because it will enable useful diffs)
  • use stable datasets in notebooks

About the shipping of notebooks and data files.

  • Naively it seems easier to me to ship the notebooks with gammapy. That way you never have to think about versioning.
  • It's ok for me to download the data with gammapy download. Versioning shouldn't be an issue if - as you propose - the datasets never change. The download of the data could also happen inside the notebook, but I don't know how this works for the documentation (rendered views of the notebooks)

Loading

@cdeil
Copy link
Member Author

@cdeil cdeil commented May 16, 2018

@joleroi - Thanks!

Naively it seems easier to me to ship the notebooks with gammapy. That way you never have to think about versioning.

If you ship notebooks with the code, they will land in site-packages. Most users don't know what site-packages is or have to care and certainly shouldn't run the notebooks from site-packages. One could try to put a copy into $HOME on install, but that's difficult, I don't think we should (I can point out problems if someone is proposing this).

So overall shipping the notebooks with the code would mean that gammapy download (or whatever the command is called) makes a copy from site-packages to $HOME instead of reaching to the internet.

It is not my preference, but certainly a valid option, that should be better described in the PIG.

During lunch @Bultako and I discussed another question: we think probably it would be nicer for users if gammapy download gives rendered notebooks, with the output filled. One option to achieve this would be to have rendered notebooks in gammapy/notebooks (accepting the size increase over time in the code repo, which is not terrible), or to download a rendered version, e.g. from docs.gammapy.org.

I will mostly be offline for the next 2.5 weeks and don't have time this week to continue working on this PIG. So @Bultako - either you update the PIG at some point to reflect discussion / make clarifications, or we just collect comments here and then re-visit this in June.

Loading

datasets. If an issue is found or something is improved, a separate new dataset
should be added, instead of changing the existing one. So versioning of example
datasets is not needed.

Copy link
Contributor

@registerrier registerrier May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that only datasets at DL3 can be stored. Reduced datasets such as spectra and reduced IRFs have to be versioned, since they were produced by a version of gammapy.
This raises some questions about how to organize notebooks and tests.

Loading

Copy link
Contributor

@registerrier registerrier May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more explicit, we should get rid of $GAMMAPY_EXTRA/datasets/hess-crab4_pha/ for instance.
And tests should first create spectral files that will be used later by others tests.

Loading

Copy link
Member Author

@cdeil cdeil May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be higher-level datasets that are stable, not just DL3.
E.g. with joint-crab we will have published and archived OGIP spectra.
From Fermi-LAT or HGPS or gamma-cat for a given version we have catalogs.
These files can be used from tutorials.

Loading

output and try that workflow for a while, see if we like it. Since notebooks
aren't very large, and updates are not needed very often, we think changing to
notebooks with output would also be acceptable in the ``gammapy`` repository.

Copy link
Contributor

@registerrier registerrier May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably more natural to look at notebooks from the docs. It therefore seems reasonable not to commit the output.

Is there a way to prevent someone from committing notebooks with output cells?

Loading

Copy link
Member Author

@cdeil cdeil May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to prevent someone from committing notebooks with output cells?

Developer documentation and code review. Let's not put effort into more automation like git commit hooks or bots that review PRs. Those things we might have in the future if we get an engineer at some point, but for now it's overkill.

Loading

couple the version with ``gammapy`` somehow, e.g. via a git submodule pointer,
or via a config file in one of the repos or on gammapy.org with the version to
be used. The mono repo approach seems simpler and better.

Copy link
Contributor

@registerrier registerrier May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for notebooks in gammapy

Loading

versions will no longer work. For this specific issue, a level of indirection
like a ``gammapy.org/config.json`` file might help. As you can see, this part of
the proposal is just a draft, feedback welcome!

Copy link
Contributor

@registerrier registerrier May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see the point.
If notebooks are moved to gammapy to avoid versioning issues, it would be counter-productive to rely on an additional gammapy downloadcommand. The later could be a practical way to retrieve gammapy-extra though.

Loading

Copy link
Member Author

@cdeil cdeil May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that gammapy.version always knows it's version. So with one line you can make the URL to fetch the right file from Github or docs.gammapy.org. So introducing gammapy download does resolve the version coupling issue.

As @joleroi said, there is the option to ship the notebooks with Gammapy. But then probably there you want the rendered version with output filled for users. So when would that rendering happen? IMO it's easier if it happens as part of the docs build and then notebooks are fetched from docs.gammapy.org. But it's an open question, more comments on this welcome.

Loading

@cdeil
Copy link
Member Author

@cdeil cdeil commented May 16, 2018

@registerrier - Thanks for your inline comments. I left quick replies.

Note that Github by default hides inline comments if the corresponding lines are edited. I.e. for PIG discussions it might be better to leave non-inline comments to have the discussion better visible later.

Loading

@Bultako
Copy link
Member

@Bultako Bultako commented May 16, 2018

So @Bultako - either you update the PIG at some point to reflect discussion / make clarifications, or we just collect comments here and then re-visit this in June.

@cdeil
Let's gather all comments and discussion here, and proceed with the final PIG update when you'll be back.

Loading

@Bultako
Copy link
Member

@Bultako Bultako commented May 16, 2018

Some clarifications for the comments made.

Notebooks in gammapy repo may be considered as notebooks for developers, they have no outputs cells, and they are there not only for fixing version-coupling with code but also to ease diffing comparison when changes are made. But.. I would prefer to download executed notebooks with output cells. I think it is better for users to open a filled notebook and not kind of a blank slate, I guess they would prefer to see what is happening at the end of the notebook.

My suggestion to fix this is to trigger execution of the notebooks in the user laptop, automatically after download, and warn the user if his environment differs from the environment that has been used to make that notebook, and also tells her that all notebooks downloaded are also reproducible. I think we need an environment.yml in the gammapy repo, with the exact versions of the packs. It is not only a matter of the right gammapy version, but also about the version of all the other packs used.

I would not ship notebooks with the code, I prefer to use gammapy download, which will be versatile enough to download not only the whole collections of notebooks, but also specific ones, as well as any public GitHub folder, dataset collections or a specific dataset, accepting also a specific commit hash as argument. (all files in public GitHub repos)

Re: versioned reduced products/datasets

If it is really needed, I would suggest to store the different versions in any GitHub repo following a nomenclature in the free-folder structure. The names of the folder is up to you. For example /datasets/hess/spectra/v0.6/ in gammapy-extra repo, or /results/fit/gammapy/0.7/fermi and /results/fit/gammapy/0.8/fermi in the joint-crab repo.

gammapy download may accept GitHub URLs to download. I see this as a matter of personal organization of the digital assets we need for a specific work.

Loading

@cdeil
Copy link
Member Author

@cdeil cdeil commented May 17, 2018

@Bultako - I agree with everything, except:

My suggestion to fix this is to trigger execution of the notebooks in the user laptop, automatically after download

That's a bad user experience. You want to get started with tutorials, and then you have to wait for half an hour for all the notebooks to execute before you can start.
If we execute / render notebooks on docs build, and then gammapy download fetches from docs.gammapy.org, they get notebooks in seconds and can start.
For devs, for the dev version, I think working in gammapy/notebooks is fine.

If it is really needed, I would suggest to store the different versions in any GitHub repo following a nomenclature in the free-folder structure.

I really don't think this is needed. Suggest to just keep it simple and assume the files are stable and put them e.g. in datasets/blabla and then if something really changes, add datasets/blabla2. But I would not start putting datasets/blabla/v1/ from the start.

Loading

@Bultako
Copy link
Member

@Bultako Bultako commented May 17, 2018

Re: notebooks with outputs

I have given a second thought to all this, and it seems far more complex than how I saw things yesterday.. The main point is that notebooks do have inputs and outputs datasets, and some of them may take several minutes to run in the whole. So, in the process of building output cells automatically, they will need datasets and produce others.

If notebooks may be considered as executable recipes, now I see it is fair enough if we provide just the recipe without the income ingredients/datasets needed neither the produced datasets. As @registerrier said we already provide executed fixed notebooks in the docs.

Loading

@Bultako
Copy link
Member

@Bultako Bultako commented May 17, 2018

@cdeil
It seems our last comments crossed :)

If we execute / render notebooks on docs build, and then gammapy download fetches from docs.gammapy.org, they get notebooks in seconds and can start.

I would download the empty notebooks from gammmapy/notebooks in gammapy GitHub repo. I think that f we download from any other place we will lose the synchronicity with the code, that it is what we are trying to solve with this PIG, no?

Loading

@cdeil
Copy link
Member Author

@cdeil cdeil commented May 17, 2018

I would download the empty notebooks from gammmapy/notebooks in gammapy GitHub repo. I think that f we download from any other place we will lose the synchronicity with the code, that it is what we are trying to solve with this PIG, no?

Docs are versioned, I think this would work just fine.
My concern with Github is that it might not always be there, but gammapy.org (and thus docs.gammapy.org) should be the stable URL that we own forever.
I also think considering gammapy.org/data/config.json might be a good idea.
In principle, any of the solutions we're discussing will work, so OK to go with what you prefer.

Concerning the question of notebook execution: would you be OK to have local notebook execution as an option (e.g. --execute or --render) and have it False by default, i.e. by default gammapy download just downloads notebooks and is fast?

Loading

@joleroi
Copy link
Contributor

@joleroi joleroi commented May 17, 2018

Just a comment: I actually prefer empty tutorial notebooks. Then I can go through each cell and see what it does. Are you really sure that users want to

open a filled notebook and not kind of a blank slate

? I mean for sure in the docs, but not locally when you actually want to work with the notebooks

Loading

@Bultako
Copy link
Member

@Bultako Bultako commented May 17, 2018

@joleroi
from my small experience of doing some notebooks for two other colleagues, when they do not know what the notebook does, they search for the plots or the tables produced in order to have a small glimpse of what it is the whole bunch about. of course, it depends very much on the people..

Loading

@robertazanin
Copy link
Contributor

@robertazanin robertazanin commented May 17, 2018

I am + 1 for having empty notebooks in gammapy repository and then you have the complete version of the them with the plots on the docs (that's more than enough from the first glimpse).

Did I understand correctly that the gammapy download would just be to download data sets? Or would this include also the output files of the notebooks?

Loading

@cdeil
Copy link
Member Author

@cdeil cdeil commented May 17, 2018

Did I understand correctly that the gammapy download would just be to download data sets? Or would this include also the output files of the notebooks?

Download input data files and notebooks. Not output data files from the notebooks.

Most notebooks will not write output files, but if some do, those files would just appear locally if someone runs the notebooks.

I think we need a more detailed proposal and description how download works, i.e. where it puts the notebooks and the data files. I was expecting this PIG to just be to quickly discuss the main points, and then that discussion on details is done in #1237 where this is implemented. @Bultako - It could still be done in either place, whatever you think is more efficient, i.e. to flesh out the proposal here in the PIG, or to just implement your preferred solution in the PIG and refer to it more directly from the PIG.

Loading

@Bultako
Copy link
Member

@Bultako Bultako commented May 17, 2018

Did I understand correctly that the gammapy download would just be to download data sets? Or would this include also the output files of the notebooks?

Depending on the arguments passed to gammapy download, you will download:

  • any file or folder present in a public GitHub repository, no need to have git installed.
  • the set of *.ipynb files (notebooks) present in the /docs/0.x/_static/notebooks/ folder of the gammapy_docs GitHub repo.
  • a specific *.ipynb file present in /docs/0.x/_static/notebooks/ folder of the gammapy_docs GitHub repo.

If one specific dataset is needed to execute a notebook, it will have to be downloaded inside the notebook using gammapy download. I would prefer to refine the details of how gammapy download works, arguments and destination folders in #1369.

Loading

@cdeil
Copy link
Member Author

@cdeil cdeil commented May 17, 2018

If one specific dataset is needed to execute a notebook, it will have to be downloaded inside the notebook using gammapy download.

But what if I want to just download all example data and notebooks and go on an island?
There should be one or two quick commands for that, no?

I had assumed that from notebooks we would either access data via a GAMMAPY_DATA variable, or by relative path.

So this discussion shows that it's good that we made one PIG to discuss example data and notebook together: we have to know what options (if more than one) to allow where data can go.

I'm not sure what is best here ...

Loading

@Bultako
Copy link
Member

@Bultako Bultako commented May 17, 2018

But what if I want to just download all example data and notebooks and go on an island?
There should be one or two quick commands for that, no?

Ooh, that would be cool to go on an island.., that's a good reason for me :)
So, what about this? gammapy download tutorial will download all notebooks and the input datasets related. All files will be placed inside a gammapy-tutorial folder created in the working directory.

Loading

@cdeil
Copy link
Member Author

@cdeil cdeil commented May 17, 2018

I'd suggest "tutorials" with "s", but otherwise 👍 .
Also: it's always possible to add more options, e.g. download only some files etc, but it's easier to add options than to try to foresee things, i.e. just start with a PR that does the minimal thing.

Loading

@robertazanin
Copy link
Contributor

@robertazanin robertazanin commented May 18, 2018

but wasn't it useful also one command to download only the data sample?
This is for instance what I would do.
I personally believe that having tutorials in two different repository will strongly confuse people.

Loading

@Bultako
Copy link
Member

@Bultako Bultako commented May 18, 2018

@robertazanin

All datasets needed to run the notebooks are shipped with the notebooks using command gammapy download tutorials, in case you only want to download the whole sample dataset or a specific dataset you could do it also with gammapy download using a different argument.

Users do not need to know in which repo the tutorials are placed. There will be a single entry for static HTML tutorials in http://docs.gammapy.org and a single command gammapy download tutorials to download the whole bunch of notebooks and datasets needed to execute them.

If we mirror repos or not is just part of internals mechanisms for synchronicity that users should not be aware of.

Loading

@joleroi
Copy link
Contributor

@joleroi joleroi commented May 18, 2018

So, what about this? gammapy download tutorial will download all notebooks and the input datasets related. All files will be placed inside a gammapy-tutorial folder created in the working directory.

👍

Loading

@cdeil cdeil mentioned this pull request Jul 26, 2018
4 tasks
@Bultako Bultako changed the title PIG: Setup for tutorial notebooks and data PIG 4: Setup for tutorial notebooks and data Aug 16, 2018
These modifs try to make clarifications on most of the aspects discussed in gammapy#1419
@Bultako Bultako dismissed stale reviews from registerrier and joleroi via 23b87bd Aug 16, 2018
Add reference `nbstripout`
@Bultako Bultako changed the title PIG 4: Setup for tutorial notebooks and data PIG 4 - Setup for tutorial notebooks and data Aug 20, 2018
@cdeil
Copy link
Member Author

@cdeil cdeil commented Sep 21, 2018

The implementation of this PIG is almost complete.

It's another case where we used a PIG as a design document / for discussion, and then just went ahead and implemented it.

I think that's OK, but of course we should follow the procedure and ask for formal approvement by CC.
@registerrier - Maybe in this case we just do it in Madrid? Or should I email the CC list and we do it now to finish this up?

Just to note, e.g. with Python PEPs it's often the same: someone implements it first and only once almost complete and working and people tried it out it's approved.

Loading

@cdeil
Copy link
Member Author

@cdeil cdeil commented Oct 4, 2018

This 🐷 is approved by Gammapy coordination committee. 🎉
I'll merge and make a follow-up commit in master to change the status.

Loading

@cdeil cdeil merged commit f7bc3cf into gammapy:master Oct 4, 2018
3 of 4 checks passed
Loading
DOCUMENTATION automation moved this from To Do to Done Oct 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
DOCUMENTATION
  
Done
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants