Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom tutorial notebooks #2

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,17 @@ Fixes:
------
- ...

0.2.0 (2021-03-30)
===================

Changes:
--------
- Custom tutorial notebooks are added when executing the Docker image entrypoint

Fixes:
------
- na

0.1.0 (2021-02-19)
===================

Expand Down
11 changes: 9 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,17 @@ ADD https://raw.githubusercontent.com/jupyter/docker-stacks/master/base-notebook
RUN chmod a+rx /usr/local/bin/start.sh /usr/local/bin/start-singleuser.sh /usr/local/bin/start-notebook.sh /usr/local/bin/fix-permissions; \
chmod a+r /etc/jupyter/jupyter_notebook_config.py

# Prepare script and tutorial notebooks folder
COPY download-notebooks.sh /download-notebooks.sh
RUN mkdir -p /notebook_dir/tutorial-notebooks
# Change notebook folder's permission so user jenkins can execute the script and add notebooks to the tutorial folder
RUN chmod a+rx /download-notebooks.sh ; \
chmod -R a+rwx /notebook_dir/tutorial-notebooks

# problem running start-notebook.sh when being root
# the jupyter/base-notebook image also do not default to root user so we do the same here
USER jenkins

# follow jupyter/base-notebook image so config in jupyterhub is simpler
# start notebook in conda environment to have working jupyter extensions
CMD ["conda", "run", "-n", "birdy", "/usr/local/bin/start-notebook.sh"]
# download tutorial-notebooks and start notebook in conda environment to have working jupyter extensions
CMD ["/bin/bash", "-c", "/download-notebooks.sh && conda run -n birdy /usr/local/bin/start-notebook.sh --SingleUserNotebookApp.default_url=/lab"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so tuto notebooks are downloaded only at the moment the personal jupyter server start. So user will only get notebook update when they destroy and restart their perso server.

So losing the auto notebook deploy feature but in return it's standalone, no dependency on an external process to fetch the notebooks. So there's pros and cons, I probably see where you're going with this.

Why --SingleUserNotebookApp.default_url=/lab isn't that already the default when launched by JupyterHub? Is this jupyter server meant to be launch standalone, outside of the Jupyterhub in PAVICS? I am okay with this, doesn't harm, I am just curious why it was needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify for --SingleUserNotebookApp.default_url=/lab, although this code might change since latest feedback. When I changed the CMD line in the Dockerfile to include the download-notebooks.sh script call, jupyter-notebook was started instead of jupyter-lab when I started the image via the birdhouse-deploy stack.

I am not 100% clear on why just changing that line would affect using notebook vs lab in birdhouse and had trouble with fixing it. Maybe it is just with the way the CMD is written in the Dockerfile. I know there is a difference in the behavior if you write the CMD line of the Dockerfile in the exec form (like the original code), or with the shell form.

Anyway, adding the argument SingleUserNotebookApp.default_url fixed the problem by making sure the right jupyter platform was used (lab instead of notebook).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So losing the auto notebook deploy feature but in return it's standalone, no dependency on an external process to fetch the notebooks. So there's pros and cons, I probably see where you're going with this.

@tlvu Actually this solution was made without considering autodeploy, since we thought only fetching at image startup would be enough. I'll have to admit that finding a compromise between standalone images and autodeploy would be wonderful.

After a great discussion with @dbyrns, the idea of triggering a scheduled task which calls the actual custom image with a "FETCH_NOTEBOOKS" switch came to the table. We would keep the custom image specific dependencies inside the docker image, plus making sur the autodeploy would update the notebooks, let's say at midnight. Then, caching the notebooks on a shared volume would make sure we avoid useless requests. The "FETCH_NOTEBOOKS" switch would ensure that only one fetch is done, instead of having 20 containers fetching the same data, at the same time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a great discussion with @dbyrns, the idea of triggering a scheduled task which calls the actual custom image with a "FETCH_NOTEBOOKS" switch came to the table. We would keep the custom image specific dependencies inside the docker image, plus making sur the autodeploy would update the notebooks, let's say at midnight. Then, caching the notebooks on a shared volume would make sure we avoid useless requests. The "FETCH_NOTEBOOKS" switch would ensure that only one fetch is done, instead of having 20 containers fetching the same data, at the same time.

@matprov @dbyrns

I like this. So removing the download-notebooks.sh script call from Dockerfile CMD but leaving the script in the image and call the script from a cronjob instead. Yeah that would work and preverve "the burden of mapping notebooks with the right image is moved to the image curator." from #2 (comment) which I agree as well.

The cronjob will write to $JUPYTERHUB_USER_DATA_DIR/tutorial-notebooks/[oe,nlp] (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/env.local.example#L232) on disc which will be volume-mount read-only to /notebook_dir/tutorial-notebooks inside the Jupyter environment and available to the end-user.

Basically instead of using our cronjob (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/components/scheduler/config.yml.template#L13-L27) which hardcode our notebooks, you roll your own cronjob for your own notebooks. Yeah that's perfect for now. Sorry my current notebook autodeploy is not pluggable so you can not easily add your own notebooks.

If possible, try to use the deploy-data (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/deployment/deploy-data) mechanism that was meant to "get some files from some git repos and put it somewhere". I was tired of repeatedly re-writing the same thing so I created that generic script. It uses git pull instead of direct wget or curl so extremely bandwidth efficient for big download (its first use-case was to autodeploy xclim and raven testdata to Thredds https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/env.local.example#L167-L189). Basically get some testdata files from some git repos and put it somewhere for Thredds to see. Exactly the same usecase as get some notebooks from some git repos and put it somewhere for Jupyter to see.

Also you will have to volume-mount either sub-folder eo or nlp under $JUPYTERHUB_USER_DATA_DIR/tutorial-notebooks depending on the image so you'll have to modify https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/config/jupyterhub/jupyterhub_config.py.template#L52-L56 for this. Hope you can find a way so the association is also with "the image curator".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlvu Actually this solution was made without considering autodeploy, since we thought only fetching at image startup would be enough. I'll have to admit that finding a compromise between standalone images and autodeploy would be wonderful.

@matprov @dbyrns

By the way, I think we can have both standalone and autodeploy together without one breaking the other.

That FETCH_NOTEBOOKS flag, if enable on docker run of the image, will perform the fetch, then launch the Jupyter perso server ! So for our PAVICS stack, we will only use that FETCH_NOTEBOOKS flag in the cronjob only but other user using the image standalone outside of the PAVICS stack can enable the flag and get the notebooks, without autodeploy of course.

Cheap for quick demo ! I am thinking https://mybinder.org/ and I did configure binder for our Jupyter env + notebooks, check this out https://mybinder.org/v2/gh/Ouranosinc/PAVICS-e2e-workflow-tests/master (it was mentioned in the README as well https://github.com/Ouranosinc/PAVICS-e2e-workflow-tests#launch-jupyter-notebook-server-using-binder)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I think we can have both standalone and autodeploy together without one breaking the other.

@ChaamC I think you probably want to ignore that post from me for now and only focus on getting the auto notebook deploy working. I was just getting ahead like usual. We can incrementally improve it later. Just get the basic working first.

3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ Base Jupyter docker image for PAVICS.
Specialized component images will be built from this image for both CRIM and Ouranos
organizations.

Tutorial notebooks are included with the docker image. Each specialized images provides its own custom notebooks, and
all generic notebooks of the base image are downloaded and added when starting the Docker container from JupyterHub.

Repos for the component images :
* crim-ca/pavics-jupyter-images : https://github.com/crim-ca/pavics-jupyter-images
* Ouranosinc/pavics-jupyter-images : to be created
11 changes: 11 additions & 0 deletions download-notebooks.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/sh

# Notebook directory used by JupyterHub in birdhouse-deploy
NOTEBOOK_DIR="/notebook_dir/tutorial-notebooks"

# Download notebooks required for the base image and add them to the notebook directory
wget -O - https://github.com/Ouranosinc/pavics-sdi/archive/master.tar.gz | \
tar -xz --wildcards -C $NOTEBOOK_DIR --strip=4 "*/docs/source/notebooks/jupyter_extensions.ipynb"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest missing feature of my previous implemention is no pluggable/customizable list of notebooks so each org deployment of PAVICS can choose what to deploy. A default list should be include for bootstrapping of course.

So I do not see this pluggable feature here. Or maybe you can add more docs explaining how it could be done? Each org override the download-notebooks.sh with their own implementation?


# Remove write permission on the tutorial-notebooks
chmod -R 555 $NOTEBOOK_DIR/*