New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move tutorial notebooks to the Gammapy repository #1791

Merged
merged 4 commits into from Sep 15, 2018

Conversation

2 participants
@Bultako
Member

Bultako commented Sep 13, 2018

This PR continues with the process of syncing Gammapy codebase and documentation tutorials, as it is discussed in PIG 4 - Setup for tutorial notebooks and data - #1419

The tutorials folder in Gammapy repo will be the place where notebooks published in the documentation will be stored, stripped of output cells. It will also be used to download the tutorials pack with gammapy download tutorials CLI. The notebooks folder in Gammapy-extra repo will still be there, and may be used by developers for any other purposes.

  • Flags test and published have been removed from the config file tutorials/notebook.yaml
  • All notebooks published in the documentation must be in the tutorials folder.
  • All notebooks in the tutorials folder must pass regression tests with test_notebooks.py.

A subsequent PR will follow to modify the documentation process accordingly, as well as for downloading tutorials from the Gammapy repo with gammapy download tutorials (added in #1786)

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 13, 2018

@Bultako - Thank you for all your work!

Do you have time to join the Gammapy call tomorrow morning?

We have to discuss whether to merge big change before or after v0.8.

There's still some work ongoing on the new fitting / modeling, as well as tutorials, and also checks / fixes for the new 3D analysis using CTA and HESS test data.

@Bultako Bultako self-assigned this Sep 13, 2018

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 13, 2018

@cdeil
Yes - tomorrow I can talk about all this :)

@Bultako Bultako added the docs label Sep 13, 2018

@cdeil cdeil added this to To Do in Documentation via automation Sep 14, 2018

@cdeil cdeil added this to the 0.8 milestone Sep 14, 2018

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 15, 2018

I'm merging this one.

From now on any updates on notebooks published in the documentation http://docs.gammapy.org/dev/tutorials.html#notebooks should be done in the tutorials folder of the Gammapy repo.

@Bultako Bultako closed this Sep 15, 2018

Documentation automation moved this from To Do to Done Sep 15, 2018

@Bultako Bultako reopened this Sep 15, 2018

Documentation automation moved this from Done to In Progress Sep 15, 2018

@Bultako Bultako merged commit 60b28a8 into gammapy:master Sep 15, 2018

1 of 4 checks passed

Scrutinizer Installing Code
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
Codacy/PR Quality Review Up to standards. A positive pull request.
Details

Documentation automation moved this from In Progress to Done Sep 15, 2018

@Bultako Bultako deleted the Bultako:move-tutorials branch Sep 15, 2018

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 15, 2018

@Bultako - Thanks!

Can we delete https://github.com/gammapy/gammapy-extra/tree/master/notebooks ?
(probably after moving / rescuing some files from there before deleting it)

Having the old and the new would be a constant source of confusion for developers for the coming weeks, no? And also users are currently directed to the old: http://docs.gammapy.org/dev/tutorials.html#set-up

So as far as I can see, there's now no turning back and we should go full steam ahead and finalise the changes to the new setup this week before we ship v0.8?

The real deadline for v0.8 is end of next week (to have a day or two to build conda packages and do some testing), because Sep 24 we have the Gammapy tutorial at the CTA consortium meeting.

@cdeil cdeil changed the title from Move tutorials notebooks from Gammapy-extra repo to Move tutorial notebooks to the Gammapy repository Sep 15, 2018

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 15, 2018

I see the CI docs build is red because of these warnings that need an edit in spectrum_models.ipynb:

/Users/deil/work/code/gammapy/docs/notebooks/spectrum_models.ipynb.rst: WARNING: file not found: 'api/gammapy.utils.modeling.Parameter.rst'
/Users/deil/work/code/gammapy/docs/notebooks/spectrum_models.ipynb.rst: WARNING: file not found: 'api/gammapy.utils.modeling.Parameters.rst'

And I'm not sure where to apply the edit.

@Bultako - I'm +1 to just do the full change to the new scheme now and delete the notebooks in gammapy-extra/notebooks. If there's something I can help with to do this in the next days, let me know.

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 15, 2018

Can we delete https://github.com/gammapy/gammapy-extra/tree/master/notebooks ?

We still need it, I will set the Binder in Gammapy repo and adapt the Sphinx doc building process in the next PRs.

Afterwards I would keep the notebooks folder as a place to store contributions of any kind in the notebooks format, but not related to the documentation. Note that the new folder in Gammapy repo is called tutorials and I think it should be seen as kind-of enhanced RST files docs.

So as far as I can see, there's now no turning back and we should go full steam ahead and finalise the changes to the new setup this week before we ship v0.8?

Yes.
But it "only" affects documentation. :)

Re: Travis CI Sphinx warnings.
And I'm not sure where to apply the edit.

I have the next PRs for doc-building nearly ready, coming this morning.
Once merged, modifs in docs-notebooks will have to be done in Gammapy repo.

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 15, 2018

So I'm already allowed to edit and add a HESS data notebook in the gammapy/tutorials today?
And you'll remove the old and now duplicate version of those notebooks in gammapy-extra in the next days, or if you need a copy there, generate that copy from the one in the Gammapy repo?

I just want to work on the Gammapy notebooks a bit, don't feel rushed to finish the changes in the setup.

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 15, 2018

So I'm already allowed to edit and add a HESS data notebook in the gammapy/tutorials today?

Yes, go ahead.
The main new steps to follow in the process are:

  • create your new notebook in your local Gammapy-extra/notebooks git folder
  • access datasets as ../datasets
  • once the notebook is ready to be in tutorials, strip the output cells and upload to Gammapy repo.

Ok, if you want to remove the duplicated notebooks of Gammapy-extra because of source of confusion, but I would keep the notebooks folder with other (still) non-published notebooks.

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 15, 2018

Ok, if you want to remove the duplicated notebooks of Gammapy-extra because of source of confusion

Yes, please. We can do it now, no? You or me?

but I would keep the notebooks folder with other (still) non-published notebooks.

Yes, that's a good idea to keep that folder for now. I just don't want duplicated notebooks to avoid confusion.

work in your local Gammapy-extra/notebooks git folder
access datasets as ../datasets
once the notebook is ready to be in tutorials, strip the output cells and upload to Gammapy repo.

I think it would be much nicer if I could just work on notebooks in-place, in the gammapy/tutorials folder. That's then the same workflow as for code or RST docs. I edit files in-place, and then before committing just run some command to test / build / polish. Working like that would be my preference, but then we should access datasets via an env var GAMMAPY_DATA, pointing to the gammapy download data folder.

Another alternative would be that I do gammapy download tutorials like a user, and work on the notebooks or author new ones there. Then I could access data always via ../../datasets, i.e. two levels up. But also there I think it's non-ideal, because of the extra step to copy notebooks back, and only there can do e.g. git diff to see what I changed.

So to summarise, my suggestion would be: developers just work on the notebook directly in the gammapy repo, and data is accessed via GAMMAPY_DATA. For notebooks that are in the gammapy-extra repo, it's similar: devs work on them in-place there and access data via GAMMAPY_DATA or in rare cases where example data files are accessed that aren't appropriate to ship via GAMMAPY_DATA, access them via GAMMAPY_EXTRA like we did until now.

Although, thinking about this some more, probably we'll still get the version coupling issues for the notebooks we leave in the gammapy-extra repo. Why not move them to gammapy/dev/notebooks now to also get the version-coupling benefit? Note that gammapy/dev is basically a playground for devs for notes and scripts, just like gammapy-extra is. For that part of Gammapy, i.e. the non-official notebooks, I think it doesn't matter so much what is in gammapy-extra or gammapy/dev, but given that file sizes are small now for notebooks, the need to have them in "the other" repo has disappeared.

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 15, 2018

work in your local Gammapy-extra/notebooks git folder

Basically I'm pushing against this workflow because I think it'll be a source of confusion, to have gammapy-extra/notebooks where previously all notebooks were under version control, and now some are and some aren't, and devs have to remember to copy things back and forth manually with gammapy/tutorials all the time. It's too easy to start editing a copy of a notebook in gammapy-extra/notebooks, not realising that someone else already edited the copy in gammapy/tutorials, and then because notebook edits are mostly impossible to merge, there is pain when one goes and tries to make the pull request.

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 15, 2018

I tried out gammapy download tutorials just now.

It worked and looks great.

Two suggestions:

  1. hess-crab4_pha are old and bad files. I'll make a new gammapy-extra/datasets/joint-crab now, which has the equivalent new version of the four Crab runs from HESS (not the other data for now) and update all Gammapy tests and notebooks to use that, and then delete gammapy-extra/datasets/hess-crab4_pha.

  2. It would be great if the download command added the few files from https://github.com/gammapy/gammapy-fermi-lat-data that we access in the tutorial to the fermi_3fhl folder. Then again, users would just get all example data without having to know about yet another repo on Github and having to clone it and set another env var. I realise this might be non-trivial to set up, to have one local "dataset" that consists of files from several remote URLs. If this is not easy to achieve, we can find another solution here, e.g. to put a copy of the files in gammapy-extra. I think it's a few 10s of MB, so still OK.

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 15, 2018

Re: removing notebooks in Gammapy-extra

I would wait until Monday to have Binder in Gammapy repo.

work on notebooks in-place, in the gammapy/tutorials folder

Yes, now you can do it, see last PR #1795
Surely, we still need to polish the docs-editing workflow a little bit. It's coming, it will be fine soon :)
I would avoid using env vars in the tutorials, and use ../datasets to access data.

move gammapy/dev to gammapy-extra/notebooks

I fully agree with it. You can do it now.

work in your local Gammapy-extra/notebooks git folder

This was just a very-short term patch/suggestion for right now, with last PR #1795 devs creating/updating tutorial-notebooks (and having local symmlink datasets) will work on gammapy/tutorials folder.

download https://github.com/gammapy/gammapy-fermi-lat-data

We could modif notebooks accessing these datasets (I think it is only one) to access data with ../datasets and copy the datasets used in gammapy-extra/datasets, no?

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 15, 2018

This all sounds good. So the main remaining point of discussion is how to access data.

You prefer relative paths or symlinks, I prefer an env var GAMMAPY_DATA.

@Bultako - Do you feel strongly about this? Are there pros of the relative path / link solution?

I think either way will work, but env var is more general. It will work for notebooks and datasets in any location and tutorial notebooks and tests and everyone can access via the env var (like we currently do in Gammapy tests and also in notebooks that aren't necessarily in fixed locations. Also note that e.g. for the 10 GB of data from the CTA first data challenge we used the recommendation to use an environment variable CTADATA both from Gammapy and ctools and in the official instructions here:
http://docs.gammapy.org/dev/notebooks/cta_1dc_introduction.html#Getting-the-1DC-data
https://forge.in2p3.fr/projects/data-challenge-1-dc-1/wiki/Getting_data#Environment-variables
I've seen many people set and use the env var successfully, even on Windows, I think env vars are well-known and Gammapy users will see them sooner or later anyways for larger datasets, like the CTA DC-1 data, or in HESS the archive of all HESS data in FITS format etc. There relative paths aren't a good solution, often the data is on a shared drive at the institute.

So you see, I have a clear preference for an env var. But it's not really super strong, so @Bultako or @adonath or @registerrier or others - if you prefer no env var, we can do that.

(I also hate the default overwrite=False in Gammapy, but that was preferred by the majority of Gammapy core devs, so we put it)

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 15, 2018

Concerning the Fermi-LAT data: if there were an index file that has a local path and remote URL for each file, then it would be OK to have a mix of remote URLs. Something like this:

dataset_name: "fermi_3fhl"
files:
    - path: gll_iem_v06_cutout.fits
      url: https://github.com/gammapy/gammapy-extra/raw/master/datasets/fermi_3fhl/gll_iem_v06_cutout.fits
    - path: fermi_3fhl_exposure_cube_hpx.fits.gz
      url: https://github.com/gammapy/gammapy-fermi-lat-data/raw/master/3fhl/allsky/fermi_3fhl_exposure_cube_hpx.fits.gz
@cdeil

This comment has been minimized.

Member

cdeil commented on dfe64bf Sep 15, 2018

@Bultako - Can you please put this code in a if __name__ == '__main__': main()?

I just accidentally typed pytest -v instead of pytest -v gammapy and then pytest imported and ran the top-level code:
https://gist.github.com/cdeil/1113fb18928761486403f274d31dcdc7#file-gistfile1-txt-L44

Basically import should never do anything, especially not things like os.unlink.

Me might still want to add some default config so that pytest by default only collects tests in the gammapy folder? If this is possible in setup.cfg, we should put it there to avoid an extra config file. I can have a look at that aspect later tonight. I'm offline now.

This comment has been minimized.

Member

Bultako replied Sep 15, 2018

Done in 5b16581

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 17, 2018

Re: access data in tutorials

I still prefer the relative path ../datasets, I'm only talking about clean tutorials living now in Gammapy repo, and delivered by gammapy download tutorials There is no need to declare env vars to run the tutorials in different systems (win, bash, tcsh, etc.). Why needing this additional step, when the tutorials are delivered together with the data in a fixed tree-folder structure? Regression tests pass also with the relative ../datasets path.

If there are use cases for a strong need to run the tutorials in a different folder, or work with big datasets that better not to deliver with gammapy download tutorials, then I see the cons of working with relative paths. And in the case of huge datasets, we would have a small issue because we would need to provide these in an alternative way. I would try to answer this question from the user's point of view, not the developer's point of view that creates/modify tutorials. So maybe, we could take a decision all together during the code sprint?

Re: Fermi-LAT data url in index file

It would be very easy to implement if we do the datasets declaration in notebooks.yaml in the following way.

- name: fermi_lat
  requires: 
  datasets:
    - gll_iem_v06_cutout:
        repo: gammapy-extra 
        path: datasets/fermi_3fhl/gll_iem_v06_cutout.fits
    - fermi_3fhl_exposure_cube_hpx:
        repo: ggammapy-fermi-lat-data
        path: 3fhl/allsky/fermi_3fhl_exposure_cube_hpx.fits.gz
@cdeil

This comment has been minimized.

Member

cdeil commented Sep 17, 2018

in notebooks.yaml in the following way

I'd really suggest to avoid something like repo: gammapy-extra that is git and Github specific.
We should get ready to have a setup that can be hosted by CTA, make it easy for them to select / adopt / maintain Gammapy as CTA ST.
So suggest to just use url for each file, it is simpler and seems more likely that data files will be delivered by a static web server than through git & github.

To avoid a huge notebooks.yaml I think it's enough to note that dataset "name" there and have the dataset files listed in a separate file. No need to have very fine-grained control and list exactly which files from which dataset are accessed in each notebook IMO.

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 17, 2018

@cdeil

I agree with your last comment, but getting rid of Github API in the gammapy dowmload will take more than my proposed 5min-solution. I'll make the Binder for Gammapy repo and then try to remove Github API at some moment this week.

@cdeil

This comment has been minimized.

Member

cdeil commented Sep 17, 2018

Maybe I could help by writing a small Python script that makes an index file listing the files for each dataset with complete URLs? I think to be maintainable it has to be a combination of listing datasets / files manually, and then for loops to generate all entries, so that results are always the same without relying on what the person that runs it has on their laptop and without having to hit the network.
Then download could be changed to use that.

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 17, 2018

I have thought of using the treecommand to build this index file easily.
For example, doing a tree -ifF ./gammapy-extra/datasets to scan the content of gammapy-extra and tree -ifF ./gammapy-fermi-lat-data/ for gammapy-fermi-lat-data, etc.. We could group these two contents in different sections of the index file, with the base url defined at the header of these sections..

Then gammapy download datasets could fetch (as it is the case now) also specific files, and not only entire folders as I think to understand from what you say. To maintain this lookup index file, we would need to just run this build_data_idx.py with no manual edition of the index files.

Maybe I'm still missing something here..

@Bultako

This comment has been minimized.

Member

Bultako commented Sep 18, 2018

I have seen that Gammapy code-base uses env vars very often to build catalogs or access datasets. So, I think it makes no sense to fix datasets access with relative paths in the cells of tutorial notebooks when env vars are needed in the code base. I would then leave the notebooks with the env vars (i.e. $GAMMAPY_EXTRA, $GAMMA_CAT, $GAMMAPY_FERMI_LAT_DATA, $FERMI_DIFFUSE_DIR, etc.)

Because of the big size of the data involved in these env vars/ catalogs I would add a flag option to gammapy download tutorials to enable/disable the datasets used in the tutorials, and inform at the end of the download process of the need of having a list of env vars declared in order to reproduce and play with tutorials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment