Custom tutorial notebooks #2

ChaamC · 2021-03-30T21:34:56Z

This PR adds the feature for the customization of included tutorial-notebooks.

A new script has been added (download-notebooks.sh). For now, it just downloads a single notebook. We can add any other notebook we want to include to this script, with the right download command.
We will need to remove the volume mounted in birdhouse-deploy for the tutorial-notebooks, since we changed the behavior here. The notebooks are now copied and downloaded directly to a folder in the docker image.
We download the notebooks in the docker entrypoint in order to always have the latest notebooks versions and to avoid having to rebuild the image just to update a notebook. This choice implies that the user jenkins has to run the download script since he is the one running the docker entrypoint. This explains some of the manipulations on the folder/files permission, because user jenkins required write access on the notebooks folder for the download script. A difference with before is that the entire folder /notebook_dir/tutorial-notebooks was read-only for jenkins, but now it is writable.
I reset the write access on files in the script, but the user will still be able to create notebooks in the tutorial-notebooks folder when he is running an instance of JupyterLab. I think it is okay though, since this folder is no more a volume mounted in birdhouse, but a folder isolated in the docker container.
I guess the only thing is the user could lose his notebook if he decides to create a notebook in that folder and to work from there.
Let me know if you have a suggestion to improve this, but I feel like it is a restriction imposed by the choice of downloading notebooks in the entrypoint.

This PR is related to the PR from pavics-jupyter-images : crim-ca/pavics-jupyter-images#3

download-notebooks.sh

matprov

lgtm, bien hâte de voir ça en action sur une instance de test!

tlvu · 2021-04-01T15:18:40Z

Feedback on the PR description so far, code will come later

We will need to remove the volume mounted in birdhouse-deploy for the tutorial-notebooks, since we changed the behavior here. The notebooks are now copied and downloaded directly to a folder in the docker image.

The old way was to avoid repeatedly download the same list of notebooks multiple times for the many users on Jupyter.

Some notebooks also might (clearly not preferred) need to come with data so these data will also have to be downloaded repeatedly.

Just to explain the rationale of the previous approach. I am not against this new way neither. Maybe we can offer the choice?

We download the notebooks in the docker entrypoint in order to always have the latest notebooks versions and to avoid having to rebuild the image just to update a notebook.

So the notebook is "refreshed" only on user login? If user have his session open for many days, which is the case right now, he won't receive update of the new version of the notebook?

The current approach with the notebook autodeploy, user will always get latest notebooks, regardless if he logout and login back or not.

This choice implies that the user jenkins has to run the download script since he is the one running the docker entrypoint. This explains some of the manipulations on the folder/files permission, because user jenkins required write access on the notebooks folder for the download script. A difference with before is that the entire folder /notebook_dir/tutorial-notebooks was read-only for jenkins, but now it is writable.

Another reason we made the tutorial-notebooks read-only is to not fool the user thinking his changes will be preserved because it will be override next time we perform the notebook autodeploy.

User might find the notebook userful and start modifying it and think his change is persistent. With this new approach, user will lose their modif on logout/login or on server restart?

I reset the write access on files in the script, but the user will still be able to create notebooks in the tutorial-notebooks folder when he is running an instance of JupyterLab. I think it is okay though, since this folder is no more a volume mounted in birdhouse, but a folder isolated in the docker container.

Oh so the tuto notebooks are still read-only for the users? This solve my concern above.

I guess the only thing is the user could lose his notebook if he decides to create a notebook in that folder and to work from there.

Another way to lose their work is to restart their server. I have not looked at the code but currently only the writable-workspace folder is persisted on disk and will survive the personal jupyter container destruction and restart.

Let me know if you have a suggestion to improve this, but I feel like it is a restriction imposed by the choice of downloading notebooks in the entrypoint.

Totally agree with this one. At the time I implemented this, my generic deploy-data script is not yet available so "pluggable/customizable" list of tuto notebooks was not available.

Ouranos also need some "custom" notebooks to be deployed only on our Prod instance, not for all instances, and I started using the new deploy-data, see here bird-house/birdhouse-deploy-ouranos#8

tlvu

I need a little bit more time to think about this. I see where you're going with this. You need a standalone mechanism to download the notebooks.

There is the question of persistence since users are now allowed to create new notebooks under the tutorial-notebooks folders and the question of auto update of the notebooks as well.

But the bottom line is each org might have different requirements. Persistence and notebooks auto update might be less of an important for you than for us and that's okay.

I think we just have to ensure this new way is backward compatible with the previous way to deploy notebooks. You can use this way and we can keep our old way since we'd like to keep persistence and notebook auto update.

I will take a closer look next week. It's an interesting idea. It might be used for the case we want to give additional tuto notebooks only to some specific user (not all users). I might over engineer here. It's getting late, next week.

tlvu · 2021-04-02T03:55:41Z

Dockerfile

-# start notebook in conda environment to have working jupyter extensions
-CMD ["conda", "run", "-n", "birdy", "/usr/local/bin/start-notebook.sh"]
+# download tutorial-notebooks and start notebook in conda environment to have working jupyter extensions
+CMD ["/bin/bash", "-c", "/download-notebooks.sh && conda run -n birdy /usr/local/bin/start-notebook.sh --SingleUserNotebookApp.default_url=/lab"]


Ok so tuto notebooks are downloaded only at the moment the personal jupyter server start. So user will only get notebook update when they destroy and restart their perso server.

So losing the auto notebook deploy feature but in return it's standalone, no dependency on an external process to fetch the notebooks. So there's pros and cons, I probably see where you're going with this.

Why --SingleUserNotebookApp.default_url=/lab isn't that already the default when launched by JupyterHub? Is this jupyter server meant to be launch standalone, outside of the Jupyterhub in PAVICS? I am okay with this, doesn't harm, I am just curious why it was needed.

Just to clarify for --SingleUserNotebookApp.default_url=/lab, although this code might change since latest feedback. When I changed the CMD line in the Dockerfile to include the download-notebooks.sh script call, jupyter-notebook was started instead of jupyter-lab when I started the image via the birdhouse-deploy stack.

I am not 100% clear on why just changing that line would affect using notebook vs lab in birdhouse and had trouble with fixing it. Maybe it is just with the way the CMD is written in the Dockerfile. I know there is a difference in the behavior if you write the CMD line of the Dockerfile in the exec form (like the original code), or with the shell form.

Anyway, adding the argument SingleUserNotebookApp.default_url fixed the problem by making sure the right jupyter platform was used (lab instead of notebook).

So losing the auto notebook deploy feature but in return it's standalone, no dependency on an external process to fetch the notebooks. So there's pros and cons, I probably see where you're going with this.

@tlvu Actually this solution was made without considering autodeploy, since we thought only fetching at image startup would be enough. I'll have to admit that finding a compromise between standalone images and autodeploy would be wonderful.

After a great discussion with @dbyrns, the idea of triggering a scheduled task which calls the actual custom image with a "FETCH_NOTEBOOKS" switch came to the table. We would keep the custom image specific dependencies inside the docker image, plus making sur the autodeploy would update the notebooks, let's say at midnight. Then, caching the notebooks on a shared volume would make sure we avoid useless requests. The "FETCH_NOTEBOOKS" switch would ensure that only one fetch is done, instead of having 20 containers fetching the same data, at the same time.

After a great discussion with @dbyrns, the idea of triggering a scheduled task which calls the actual custom image with a "FETCH_NOTEBOOKS" switch came to the table. We would keep the custom image specific dependencies inside the docker image, plus making sur the autodeploy would update the notebooks, let's say at midnight. Then, caching the notebooks on a shared volume would make sure we avoid useless requests. The "FETCH_NOTEBOOKS" switch would ensure that only one fetch is done, instead of having 20 containers fetching the same data, at the same time.

@matprov @dbyrns

I like this. So removing the download-notebooks.sh script call from Dockerfile CMD but leaving the script in the image and call the script from a cronjob instead. Yeah that would work and preverve "the burden of mapping notebooks with the right image is moved to the image curator." from #2 (comment) which I agree as well.

The cronjob will write to $JUPYTERHUB_USER_DATA_DIR/tutorial-notebooks/[oe,nlp] (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/env.local.example#L232) on disc which will be volume-mount read-only to /notebook_dir/tutorial-notebooks inside the Jupyter environment and available to the end-user.

Basically instead of using our cronjob (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/components/scheduler/config.yml.template#L13-L27) which hardcode our notebooks, you roll your own cronjob for your own notebooks. Yeah that's perfect for now. Sorry my current notebook autodeploy is not pluggable so you can not easily add your own notebooks.

If possible, try to use the deploy-data (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/deployment/deploy-data) mechanism that was meant to "get some files from some git repos and put it somewhere". I was tired of repeatedly re-writing the same thing so I created that generic script. It uses git pull instead of direct wget or curl so extremely bandwidth efficient for big download (its first use-case was to autodeploy xclim and raven testdata to Thredds https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/env.local.example#L167-L189). Basically get some testdata files from some git repos and put it somewhere for Thredds to see. Exactly the same usecase as get some notebooks from some git repos and put it somewhere for Jupyter to see.

Also you will have to volume-mount either sub-folder eo or nlp under $JUPYTERHUB_USER_DATA_DIR/tutorial-notebooks depending on the image so you'll have to modify https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/config/jupyterhub/jupyterhub_config.py.template#L52-L56 for this. Hope you can find a way so the association is also with "the image curator".

@tlvu Actually this solution was made without considering autodeploy, since we thought only fetching at image startup would be enough. I'll have to admit that finding a compromise between standalone images and autodeploy would be wonderful.

@matprov @dbyrns

By the way, I think we can have both standalone and autodeploy together without one breaking the other.

That FETCH_NOTEBOOKS flag, if enable on docker run of the image, will perform the fetch, then launch the Jupyter perso server ! So for our PAVICS stack, we will only use that FETCH_NOTEBOOKS flag in the cronjob only but other user using the image standalone outside of the PAVICS stack can enable the flag and get the notebooks, without autodeploy of course.

Cheap for quick demo ! I am thinking https://mybinder.org/ and I did configure binder for our Jupyter env + notebooks, check this out https://mybinder.org/v2/gh/Ouranosinc/PAVICS-e2e-workflow-tests/master (it was mentioned in the README as well https://github.com/Ouranosinc/PAVICS-e2e-workflow-tests#launch-jupyter-notebook-server-using-binder)

By the way, I think we can have both standalone and autodeploy together without one breaking the other.

@ChaamC I think you probably want to ignore that post from me for now and only focus on getting the auto notebook deploy working. I was just getting ahead like usual. We can incrementally improve it later. Just get the basic working first.

tlvu · 2021-04-02T04:02:19Z

download-notebooks.sh

+
+# Download notebooks required for the base image and add them to the notebook directory
+wget -O - https://github.com/Ouranosinc/pavics-sdi/archive/master.tar.gz | \
+    tar -xz --wildcards -C $NOTEBOOK_DIR --strip=4 "*/docs/source/notebooks/jupyter_extensions.ipynb"


The biggest missing feature of my previous implemention is no pluggable/customizable list of notebooks so each org deployment of PAVICS can choose what to deploy. A default list should be include for bootstrapping of course.

So I do not see this pluggable feature here. Or maybe you can add more docs explaining how it could be done? Each org override the download-notebooks.sh with their own implementation?

tlvu · 2021-04-06T14:16:13Z

Just thought of something, have you tried this solution in a PAVICS stack?

I have a feeling it will not work if you use /notebook_dir/tutorial-notebooks since that dir will be volume-mount read-only (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/config/jupyterhub/jupyterhub_config.py.template#L52-L56) so the download-notebooks.sh script will not be able to do its job.

Same issue with the matching crim-ca/pavics-jupyter-images#3, the notebooks added directly into the image will also be "overridden" by the volume-mount.

I understand the need to have a independent notebook provisioning mechanism so maybe you might want to put it into a different folder instead (other than /notebook_dir/tutorial-notebooks).

dbyrns · 2021-04-06T16:01:02Z

Hi @tlvu, I will try to explain a little bit more the rationale behind this PR.

Custom images is a new concept and each of them controls packages available.
We need a way to provide some notebooks per image since they will target specific dependencies.
We also want to keep the feature that organizations are able to choose which notebooks to offer.

The solution we found makes sense because the image contributor knows the available packages and that solve 2). The new complexity of 3) would require a list of notebooks per image in the configuration. So we thought that because organizations can already choose which images to offer, they do not have to deal with the notebooks anymore. They can offer an image or not. If they want to add more notebooks they can offer their own image that inherited from a base image and adding those notebooks. We really try to make that process easy, so that one can build an image on top of another by keeping its notebooks and add some more. This way, the responsability of the offering is still on the instance configuration, but the burden of mapping notebooks with the right image is moved to the image curator.

What we oversight is the autodeploy feature and I was not aware of it, my bad. I agree that this is a great feature and we absolutely want to keep it. Your concern about downloading stuff for each user is valid too. I will continue to think about it with my team.

dbyrns

Let's try to handle properly the autoupdate situation before merging this PR.

matprov

Related to autodeploy changes

… files from the deploy scripts

ChaamC · 2021-04-19T19:50:31Z

PR has been updated for a new solution. Please refer to the other PR for more details.

@matprov @dbyrns @tlvu

dbyrns

I think you should just create a new PR for jq/yq since the description and most of the commits are now completely out of scope.
I would add a link to this PR (bird-house/birdhouse-deploy#150) saying that it solves what was tried to be done here and close it.

cwcummings added 2 commits March 30, 2021 15:38

dockerfile now downloads tutorial notebooks when started

1012b69

added documentation

b37b2a0

ChaamC requested review from matprov and dbyrns March 30, 2021 21:34

ChaamC mentioned this pull request Mar 30, 2021

custom notebooks are added to the image crim-ca/pavics-jupyter-images#3

Merged

matprov reviewed Mar 30, 2021

View reviewed changes

download-notebooks.sh Outdated Show resolved Hide resolved

matprov approved these changes Mar 30, 2021

View reviewed changes

rename variable in script

3c659df

ChaamC requested a review from tlvu March 31, 2021 12:16

tlvu reviewed Apr 2, 2021

View reviewed changes

dbyrns suggested changes Apr 6, 2021

View reviewed changes

matprov requested changes Apr 6, 2021

View reviewed changes

dbyrns mentioned this pull request Apr 8, 2021

Generalize and make this test framework pluggable Ouranosinc/PAVICS-e2e-workflow-tests#57

Open

revert first solution changes and add jq/yq installation to read yaml…

e165fb4

… files from the deploy scripts

ChaamC mentioned this pull request Apr 19, 2021

Custom notebooks bird-house/birdhouse-deploy#150

Merged

ChaamC requested review from matprov, dbyrns and tlvu April 19, 2021 19:50

dbyrns suggested changes Apr 20, 2021

View reviewed changes

ChaamC closed this Apr 20, 2021

ChaamC mentioned this pull request Apr 20, 2021

Custom tutorial notebooks #3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom tutorial notebooks #2

Custom tutorial notebooks #2

ChaamC commented Mar 30, 2021 •

edited

Loading

matprov left a comment

tlvu commented Apr 1, 2021

tlvu left a comment

tlvu Apr 2, 2021

ChaamC Apr 6, 2021

matprov Apr 6, 2021

tlvu Apr 6, 2021

tlvu Apr 6, 2021

tlvu Apr 7, 2021

tlvu Apr 2, 2021

tlvu commented Apr 6, 2021

dbyrns commented Apr 6, 2021

dbyrns left a comment

matprov left a comment

ChaamC commented Apr 19, 2021

dbyrns left a comment

Custom tutorial notebooks #2

Custom tutorial notebooks #2

Conversation

ChaamC commented Mar 30, 2021 • edited Loading

matprov left a comment

Choose a reason for hiding this comment

tlvu commented Apr 1, 2021

tlvu left a comment

Choose a reason for hiding this comment

tlvu Apr 2, 2021

Choose a reason for hiding this comment

ChaamC Apr 6, 2021

Choose a reason for hiding this comment

matprov Apr 6, 2021

Choose a reason for hiding this comment

tlvu Apr 6, 2021

Choose a reason for hiding this comment

tlvu Apr 6, 2021

Choose a reason for hiding this comment

tlvu Apr 7, 2021

Choose a reason for hiding this comment

tlvu Apr 2, 2021

Choose a reason for hiding this comment

tlvu commented Apr 6, 2021

dbyrns commented Apr 6, 2021

dbyrns left a comment

Choose a reason for hiding this comment

matprov left a comment

Choose a reason for hiding this comment

ChaamC commented Apr 19, 2021

dbyrns left a comment

Choose a reason for hiding this comment

ChaamC commented Mar 30, 2021 •

edited

Loading