Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom tutorial notebooks #2

Closed
wants to merge 4 commits into from
Closed

Conversation

ChaamC
Copy link
Collaborator

@ChaamC ChaamC commented Mar 30, 2021

This PR adds the feature for the customization of included tutorial-notebooks.

  • A new script has been added (download-notebooks.sh). For now, it just downloads a single notebook. We can add any other notebook we want to include to this script, with the right download command.

  • We will need to remove the volume mounted in birdhouse-deploy for the tutorial-notebooks, since we changed the behavior here. The notebooks are now copied and downloaded directly to a folder in the docker image.

  • We download the notebooks in the docker entrypoint in order to always have the latest notebooks versions and to avoid having to rebuild the image just to update a notebook. This choice implies that the user jenkins has to run the download script since he is the one running the docker entrypoint. This explains some of the manipulations on the folder/files permission, because user jenkins required write access on the notebooks folder for the download script. A difference with before is that the entire folder /notebook_dir/tutorial-notebooks was read-only for jenkins, but now it is writable.
    I reset the write access on files in the script, but the user will still be able to create notebooks in the tutorial-notebooks folder when he is running an instance of JupyterLab. I think it is okay though, since this folder is no more a volume mounted in birdhouse, but a folder isolated in the docker container.
    I guess the only thing is the user could lose his notebook if he decides to create a notebook in that folder and to work from there.
    Let me know if you have a suggestion to improve this, but I feel like it is a restriction imposed by the choice of downloading notebooks in the entrypoint.

This PR is related to the PR from pavics-jupyter-images : crim-ca/pavics-jupyter-images#3

download-notebooks.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@matprov matprov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, bien hâte de voir ça en action sur une instance de test!

@ChaamC ChaamC requested a review from tlvu March 31, 2021 12:16
@tlvu
Copy link
Collaborator

tlvu commented Apr 1, 2021

Feedback on the PR description so far, code will come later

We will need to remove the volume mounted in birdhouse-deploy for the tutorial-notebooks, since we changed the behavior here. The notebooks are now copied and downloaded directly to a folder in the docker image.

The old way was to avoid repeatedly download the same list of notebooks multiple times for the many users on Jupyter.

Some notebooks also might (clearly not preferred) need to come with data so these data will also have to be downloaded repeatedly.

Just to explain the rationale of the previous approach. I am not against this new way neither. Maybe we can offer the choice?

We download the notebooks in the docker entrypoint in order to always have the latest notebooks versions and to avoid having to rebuild the image just to update a notebook.

So the notebook is "refreshed" only on user login? If user have his session open for many days, which is the case right now, he won't receive update of the new version of the notebook?

The current approach with the notebook autodeploy, user will always get latest notebooks, regardless if he logout and login back or not.

This choice implies that the user jenkins has to run the download script since he is the one running the docker entrypoint. This explains some of the manipulations on the folder/files permission, because user jenkins required write access on the notebooks folder for the download script. A difference with before is that the entire folder /notebook_dir/tutorial-notebooks was read-only for jenkins, but now it is writable.

Another reason we made the tutorial-notebooks read-only is to not fool the user thinking his changes will be preserved because it will be override next time we perform the notebook autodeploy.

User might find the notebook userful and start modifying it and think his change is persistent. With this new approach, user will lose their modif on logout/login or on server restart?

I reset the write access on files in the script, but the user will still be able to create notebooks in the tutorial-notebooks folder when he is running an instance of JupyterLab. I think it is okay though, since this folder is no more a volume mounted in birdhouse, but a folder isolated in the docker container.

Oh so the tuto notebooks are still read-only for the users? This solve my concern above.

I guess the only thing is the user could lose his notebook if he decides to create a notebook in that folder and to work from there.

Another way to lose their work is to restart their server. I have not looked at the code but currently only the writable-workspace folder is persisted on disk and will survive the personal jupyter container destruction and restart.

Let me know if you have a suggestion to improve this, but I feel like it is a restriction imposed by the choice of downloading notebooks in the entrypoint.

Totally agree with this one. At the time I implemented this, my generic deploy-data script is not yet available so "pluggable/customizable" list of tuto notebooks was not available.

Ouranos also need some "custom" notebooks to be deployed only on our Prod instance, not for all instances, and I started using the new deploy-data, see here bird-house/birdhouse-deploy-ouranos#8

Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a little bit more time to think about this. I see where you're going with this. You need a standalone mechanism to download the notebooks.

There is the question of persistence since users are now allowed to create new notebooks under the tutorial-notebooks folders and the question of auto update of the notebooks as well.

But the bottom line is each org might have different requirements. Persistence and notebooks auto update might be less of an important for you than for us and that's okay.

I think we just have to ensure this new way is backward compatible with the previous way to deploy notebooks. You can use this way and we can keep our old way since we'd like to keep persistence and notebook auto update.

I will take a closer look next week. It's an interesting idea. It might be used for the case we want to give additional tuto notebooks only to some specific user (not all users). I might over engineer here. It's getting late, next week.

Dockerfile Outdated
# start notebook in conda environment to have working jupyter extensions
CMD ["conda", "run", "-n", "birdy", "/usr/local/bin/start-notebook.sh"]
# download tutorial-notebooks and start notebook in conda environment to have working jupyter extensions
CMD ["/bin/bash", "-c", "/download-notebooks.sh && conda run -n birdy /usr/local/bin/start-notebook.sh --SingleUserNotebookApp.default_url=/lab"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so tuto notebooks are downloaded only at the moment the personal jupyter server start. So user will only get notebook update when they destroy and restart their perso server.

So losing the auto notebook deploy feature but in return it's standalone, no dependency on an external process to fetch the notebooks. So there's pros and cons, I probably see where you're going with this.

Why --SingleUserNotebookApp.default_url=/lab isn't that already the default when launched by JupyterHub? Is this jupyter server meant to be launch standalone, outside of the Jupyterhub in PAVICS? I am okay with this, doesn't harm, I am just curious why it was needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify for --SingleUserNotebookApp.default_url=/lab, although this code might change since latest feedback. When I changed the CMD line in the Dockerfile to include the download-notebooks.sh script call, jupyter-notebook was started instead of jupyter-lab when I started the image via the birdhouse-deploy stack.

I am not 100% clear on why just changing that line would affect using notebook vs lab in birdhouse and had trouble with fixing it. Maybe it is just with the way the CMD is written in the Dockerfile. I know there is a difference in the behavior if you write the CMD line of the Dockerfile in the exec form (like the original code), or with the shell form.

Anyway, adding the argument SingleUserNotebookApp.default_url fixed the problem by making sure the right jupyter platform was used (lab instead of notebook).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So losing the auto notebook deploy feature but in return it's standalone, no dependency on an external process to fetch the notebooks. So there's pros and cons, I probably see where you're going with this.

@tlvu Actually this solution was made without considering autodeploy, since we thought only fetching at image startup would be enough. I'll have to admit that finding a compromise between standalone images and autodeploy would be wonderful.

After a great discussion with @dbyrns, the idea of triggering a scheduled task which calls the actual custom image with a "FETCH_NOTEBOOKS" switch came to the table. We would keep the custom image specific dependencies inside the docker image, plus making sur the autodeploy would update the notebooks, let's say at midnight. Then, caching the notebooks on a shared volume would make sure we avoid useless requests. The "FETCH_NOTEBOOKS" switch would ensure that only one fetch is done, instead of having 20 containers fetching the same data, at the same time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a great discussion with @dbyrns, the idea of triggering a scheduled task which calls the actual custom image with a "FETCH_NOTEBOOKS" switch came to the table. We would keep the custom image specific dependencies inside the docker image, plus making sur the autodeploy would update the notebooks, let's say at midnight. Then, caching the notebooks on a shared volume would make sure we avoid useless requests. The "FETCH_NOTEBOOKS" switch would ensure that only one fetch is done, instead of having 20 containers fetching the same data, at the same time.

@matprov @dbyrns

I like this. So removing the download-notebooks.sh script call from Dockerfile CMD but leaving the script in the image and call the script from a cronjob instead. Yeah that would work and preverve "the burden of mapping notebooks with the right image is moved to the image curator." from #2 (comment) which I agree as well.

The cronjob will write to $JUPYTERHUB_USER_DATA_DIR/tutorial-notebooks/[oe,nlp] (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/env.local.example#L232) on disc which will be volume-mount read-only to /notebook_dir/tutorial-notebooks inside the Jupyter environment and available to the end-user.

Basically instead of using our cronjob (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/components/scheduler/config.yml.template#L13-L27) which hardcode our notebooks, you roll your own cronjob for your own notebooks. Yeah that's perfect for now. Sorry my current notebook autodeploy is not pluggable so you can not easily add your own notebooks.

If possible, try to use the deploy-data (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/deployment/deploy-data) mechanism that was meant to "get some files from some git repos and put it somewhere". I was tired of repeatedly re-writing the same thing so I created that generic script. It uses git pull instead of direct wget or curl so extremely bandwidth efficient for big download (its first use-case was to autodeploy xclim and raven testdata to Thredds https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/env.local.example#L167-L189). Basically get some testdata files from some git repos and put it somewhere for Thredds to see. Exactly the same usecase as get some notebooks from some git repos and put it somewhere for Jupyter to see.

Also you will have to volume-mount either sub-folder eo or nlp under $JUPYTERHUB_USER_DATA_DIR/tutorial-notebooks depending on the image so you'll have to modify https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/config/jupyterhub/jupyterhub_config.py.template#L52-L56 for this. Hope you can find a way so the association is also with "the image curator".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlvu Actually this solution was made without considering autodeploy, since we thought only fetching at image startup would be enough. I'll have to admit that finding a compromise between standalone images and autodeploy would be wonderful.

@matprov @dbyrns

By the way, I think we can have both standalone and autodeploy together without one breaking the other.

That FETCH_NOTEBOOKS flag, if enable on docker run of the image, will perform the fetch, then launch the Jupyter perso server ! So for our PAVICS stack, we will only use that FETCH_NOTEBOOKS flag in the cronjob only but other user using the image standalone outside of the PAVICS stack can enable the flag and get the notebooks, without autodeploy of course.

Cheap for quick demo ! I am thinking https://mybinder.org/ and I did configure binder for our Jupyter env + notebooks, check this out https://mybinder.org/v2/gh/Ouranosinc/PAVICS-e2e-workflow-tests/master (it was mentioned in the README as well https://github.com/Ouranosinc/PAVICS-e2e-workflow-tests#launch-jupyter-notebook-server-using-binder)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I think we can have both standalone and autodeploy together without one breaking the other.

@ChaamC I think you probably want to ignore that post from me for now and only focus on getting the auto notebook deploy working. I was just getting ahead like usual. We can incrementally improve it later. Just get the basic working first.


# Download notebooks required for the base image and add them to the notebook directory
wget -O - https://github.com/Ouranosinc/pavics-sdi/archive/master.tar.gz | \
tar -xz --wildcards -C $NOTEBOOK_DIR --strip=4 "*/docs/source/notebooks/jupyter_extensions.ipynb"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest missing feature of my previous implemention is no pluggable/customizable list of notebooks so each org deployment of PAVICS can choose what to deploy. A default list should be include for bootstrapping of course.

So I do not see this pluggable feature here. Or maybe you can add more docs explaining how it could be done? Each org override the download-notebooks.sh with their own implementation?

@tlvu
Copy link
Collaborator

tlvu commented Apr 6, 2021

Just thought of something, have you tried this solution in a PAVICS stack?

I have a feeling it will not work if you use /notebook_dir/tutorial-notebooks since that dir will be volume-mount read-only (https://github.com/bird-house/birdhouse-deploy/blob/20d4f430005d6eb5e5680f05114eeaf936f7c38b/birdhouse/config/jupyterhub/jupyterhub_config.py.template#L52-L56) so the download-notebooks.sh script will not be able to do its job.

Same issue with the matching crim-ca/pavics-jupyter-images#3, the notebooks added directly into the image will also be "overridden" by the volume-mount.

I understand the need to have a independent notebook provisioning mechanism so maybe you might want to put it into a different folder instead (other than /notebook_dir/tutorial-notebooks).

@dbyrns
Copy link

dbyrns commented Apr 6, 2021

Hi @tlvu, I will try to explain a little bit more the rationale behind this PR.

  1. Custom images is a new concept and each of them controls packages available.
  2. We need a way to provide some notebooks per image since they will target specific dependencies.
  3. We also want to keep the feature that organizations are able to choose which notebooks to offer.

The solution we found makes sense because the image contributor knows the available packages and that solve 2). The new complexity of 3) would require a list of notebooks per image in the configuration. So we thought that because organizations can already choose which images to offer, they do not have to deal with the notebooks anymore. They can offer an image or not. If they want to add more notebooks they can offer their own image that inherited from a base image and adding those notebooks. We really try to make that process easy, so that one can build an image on top of another by keeping its notebooks and add some more. This way, the responsability of the offering is still on the instance configuration, but the burden of mapping notebooks with the right image is moved to the image curator.

What we oversight is the autodeploy feature and I was not aware of it, my bad. I agree that this is a great feature and we absolutely want to keep it. Your concern about downloading stuff for each user is valid too. I will continue to think about it with my team.

Copy link

@dbyrns dbyrns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to handle properly the autoupdate situation before merging this PR.

Copy link
Collaborator

@matprov matprov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to autodeploy changes

@ChaamC
Copy link
Collaborator Author

ChaamC commented Apr 19, 2021

PR has been updated for a new solution. Please refer to the other PR for more details.

@matprov @dbyrns @tlvu

@ChaamC ChaamC requested review from matprov, dbyrns and tlvu April 19, 2021 19:50
Copy link

@dbyrns dbyrns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just create a new PR for jq/yq since the description and most of the commits are now completely out of scope.
I would add a link to this PR (bird-house/birdhouse-deploy#150) saying that it solves what was tried to be done here and close it.

@ChaamC ChaamC closed this Apr 20, 2021
@ChaamC ChaamC mentioned this pull request Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants