FEAT: Jupyter-cache integration #55

AakashGfude · 2020-03-09T07:38:57Z

Simple implementation of integration of jupyter-cache in myst-nb's parser.

Current implementation

gets a local db instance, else creates it (implementation in jupyter-cache)
stages the notebook
executes the notebook
merges the executed notebook with the original one and forwards it for parsing.

Apart from making the present implementation more robust, I feel a few functionalities need to be present in MyST-NB :-

Having the functionality to use the clear_cache function of jupyter_cache. It might be necessary if someone wants to re-execute the notebooks without actually changing the code of any of them
parallel execution of notebooks. Since, the parser function takes one notebook at a time, implementing parallel execution in jupyter-cache level won't be enough, and parallel execution cannot be implemented in parser I think, because it will create a new instance of multiprocessing library for each notebook.

My suggestion for both of the above functionalities is to implement them in a sphinx builder.

Please provide your suggestions @choldgraf @chrisjsewell @mmcky, I will meanwhile continue the implementation to handle as many cases as possible.

AakashGfude · 2020-03-09T12:08:19Z

Also, we should have a proper traceback for debugging execution errors, primarily a report file to store all the tracebacks.
At present, we have an error message indicating, the failed notebook.

chrisjsewell · 2020-03-09T13:07:01Z

Also, we should have a proper traceback for debugging execution errors, primarily a report file to store all the tracebacks.
At present, we have an error message indicating, the failed notebook.

Note that all tracebacks (and eventually failed notebooks) are stored in the cache, see: https://github.com/ExecutableBookProject/jupyter-cache#staging-notebooks-for-execution

AakashGfude · 2020-03-09T13:16:11Z

Also, we should have a proper traceback for debugging execution errors, primarily a report file to store all the tracebacks.
At present, we have an error message indicating, the failed notebook.

Note that all tracebacks (and eventually failed notebooks) are stored in the cache, see: https://github.com/ExecutableBookProject/jupyter-cache#staging-notebooks-for-execution

Great, thanks. So, I should recover the cache information using the pk of the notebook. I will test this out tomorrow morning.

chrisjsewell · 2020-03-09T13:26:39Z

Great, thanks. So, I should recover the cache information using the pk of the notebook. I will test this out tomorrow morning.

Yes, note that it is stored on the staged record, not a cache record, because obviously since the notebook failed, it is not cached and there is no cache record created

chrisjsewell · 2020-03-09T13:39:19Z

parallel execution of notebooks. Since, the parser function takes one notebook at a time, implementing parallel execution in jupyter-cache level won't be enough, and parallel execution cannot be implemented in parser I think, because it will create a new instance of multiprocessing library for each notebook.

I mentioned this in an issue before (can't remember which one!), but I feel that this should be handled around step (3) of the build process below (by adding an event callback), whereby you stage all notebooks, then call execute. Then, in the parse step (7), you will just attempt to retrieve a cached notebook or, on KeyError, the trackback from the staged record.

1. event.config-inited(app,config)
2. event.builder-inited(app)
3. event.env-get-outdated(app, env, added, changed, removed)
4. event.env-before-read-docs(app, env, docnames)

for docname in docnames:
    5.  event.env-purge-doc(app, env, docname)
    if doc changed and not removed:
        6. source-read(app, docname, source)
        7. run parser: text -> docutils.document
        8. apply transforms (by priority): docutils.document -> docutils.document
        9. event.doctree-read(app, doctree)

10. event.env-updated(app, env)
11. event.env-check-consistency(app, env)

for docname in docnames:
    12. apply post-transforms (by priority): docutils.document -> docutils.document
    13. event.doctree-resolved(app, doctree, docname)

14. call builder

15. event.build-finished(app, exception)

chrisjsewell · 2020-03-09T13:45:41Z

FYI, before this is merged, make sure to add some tests, using the notebooks in https://github.com/ExecutableBookProject/MyST-NB/tree/master/tests/notebooks (for an example of a full sphinx integration test, see https://github.com/ExecutableBookProject/MyST-Parser/blob/master/tests/test_sphinx/test_sphinx_builds.py)

choldgraf · 2020-03-09T15:02:05Z

Nice! Thanks for this nice start! a few quick thoughts from me:

I may make some pushes to this branch today to add some extra functionality
I'm a bit hesitant for MyST-NB to do much in the way of fancy execution. I think that as a first step, we should focus on "simple" execution (e.g. add two configuration variables: one for "execute" one for "cache". If the first is triggered, all ipynb files will be executed at build. If the second is triggered, all ipynb files will be executed and cached, and subsequent builds will use the cache. I think we should discuss whether this tool should be used for things like large-scale parallel execution, or whether that should be instead built into jupyter-cache and manually run from the CLI for larger jobs.
Agreed people need to be able to clear the cache...could we assume that users can just delete the .jupyter_cache folder in that case? Is there a way to do this outside of a CLI? (since myst-nb has no CLI...perhaps we can just recommend the jupyter-cache CLI once it solidifies)

Another thing to remember - users of Jupyter Book should need to know nothing about MyST-NB, we could also make some of these improvements (e.g. clearing cache and parallel execution) in the jupyter-book CLI. The MyST-NB repo might be used directly by software developers in their Sphinx builds, so we can make more advanced assumptions about their comfort with coding in general.

choldgraf · 2020-03-09T19:55:07Z

I made a PR to your branch @AakashGfude - I had started some work on this earlier, and that PR is to help give an idea where I was going, and give you some code to work with if it's helpful 👍

AakashGfude · 2020-03-10T00:22:20Z

I may make some pushes to this branch today to add some extra functionality
Thank you so much. I will have a look at the PR.

Agreed people need to be able to clear the cache...could we assume that users can just delete the .jupyter_cache folder in that case? Is there a way to do this outside of a CLI? (since myst-nb has no CLI...perhaps we can just recommend the jupyter-cache CLI once it solidifies)

I assumed the user will have no knowledge about the internals of caching and DB being used. And imagined that the command to clear cache would be done in the background, when something like make clean or make clean-cache is run And, also since .jupyter_cache folder is not visible in the file explorer. But I guess, most if not all the users might be using a text editor anyway so visibility might not be an issue. I just wonder how intuitive it will be to delete the folder.

Another thing to remember - users of Jupyter Book should need to know nothing about MyST-NB, we could also make some of these improvements (e.g. clearing cache and parallel execution) in the jupyter-book CLI. The MyST-NB repo might be used directly by software developers in their Sphinx builds, so we can make more advanced assumptions about their comfort with coding in general.

This is a nice idea. We will need to decide which layers should have what functionalities. I am not entirely clear about the functionalities jupyter-book is supposed to have. missed the discussions. will get a clearer idea.

I'm a bit hesitant for MyST-NB to do much in the way of fancy execution. I think that as a first step, we should focus on "simple" execution (e.g. add two configuration variables: one for "execute" one for "cache". If the first is triggered, all ipynb files will be executed at build. If the second is triggered, all ipynb files will be executed and cached, and subsequent builds will use the cache. I think we should discuss whether this tool should be used for things like large-scale parallel execution, or whether that should be instead built into jupyter-cache and manually run from the CLI for larger jobs.

Like you said above, users of Jupyter Book need not know about MyST-NB and internals. So we should avoid situations in which they might have to use the internal complicated tools?

AakashGfude · 2020-03-10T01:21:30Z

I mentioned this in an issue before (can't remember which one!), but I feel that this should be handled around step (3) of the build process below (by adding an event callback), whereby you stage all notebooks, then call execute. Then, in the parse step (7), you will just attempt to retrieve a cached notebook or, on KeyError, the trackback from the staged record.

I had also imagined in the same lines. @choldgraf what are your thoughts on this?

choldgraf · 2020-03-10T02:30:49Z

yeah that sounds reasonable to me - it'll be hard to tell without a prototype to play around with though, I'm happy to try something out when it's ready

updating cacheing code

…T-NB into jupyter-cahe-integration

AakashGfude · 2020-03-12T12:13:20Z

@choldgraf @chrisjsewell have reorganized the code in the last few commits as a test, according to the discussion with @chrisjsewell above regarding parallel execution, where the notebooks are executed and cached during the env-get-outdated event stage. And in the parser, the cached outputs and original notebooks are merged.

Feel free to scrap this idea, or embrace it.

AakashGfude · 2020-03-12T12:21:14Z

Also @choldgraf , I got this attribute error for genutils after merging with master. Did the merge tamper anything?

PS:- I will remove this comment later to prevent cluttering.

chrisjsewell · 2020-03-12T15:29:42Z

Sorry @AakashGfude with 688e421 you'll need to refactor again 😬

myst_nb/cache.py

setup.py

…T-NB into jupyter-cahe-integration

choldgraf · 2020-03-25T03:32:01Z

Looking better! The cache worked for me this time 🎉

A few quick thoughts:

We probably don't need a message for both each execution and completion, maybe just one to avoid cluttering the messages
When Jupyter Cache isn't set I'm getting an error: WARNING: The config value jupyter_cache' has type str', defaults to bool'.`
Maybe we can collapse the number of configuration keys down by one, by making jupyter_execute_notebooks allow two values: "force" or "auto"

What other kind of functionality should I try to do to break it? :-)

AakashGfude · 2020-03-25T04:18:38Z

We probably don't need a message for both each execution and completion, maybe just one to avoid cluttering the messages

You are right, Ideally, we should have something like:-

the screenshot is from jupinx . I thought of working on that after this PR is merged, as it might require discussions. What do you think?

Maybe we can collapse the number of configuration keys down by one, by making jupyter_execute_notebooks allow two values: "force" or "auto"

I agree there are too many variables. this one sounds good. will do a commit for it

What other kind of functionality should I try to do to break it? :-)
One I think is to check if the reports are getting generated. By having one notebook failing execution with an error.

When Jupyter Cache isn't set I'm getting an error: WARNING: The config value jupyter_cache' has type str', defaults to bool'.`

can you send me the traceback for this? Not able to get it in my system

choldgraf · 2020-03-25T04:48:22Z

Re: the last point, it wasn't an error, just a warning, so there was no trace back. Sphinx will throw a warning if the value of a config key is of a different type from the default value

AakashGfude · 2020-03-25T04:53:31Z

Re: the last point, it wasn't an error, just a warning, so there was no trace back. Sphinx will throw a warning if the value of a config key is of a different type from the default value

This is weird, I am not able to reproduce and not able to see in the code where the error is. Maybe will need your help in that, can you grep that variable and see if its set to a string somewhere in your system?

Also, about this point

Maybe we can collapse the number of configuration keys down by one, by making > jupyter_execute_notebooks allow two values: "force" or "auto"

It won't handle the case where users don't want to execute at all. We can have three values for this variable. The third one being "off" ? I am not good in naming things

choldgraf · 2020-03-25T14:23:03Z

Yeah - "off" seems fine if we don't want the sphinx warning about types...otherwise None or False would work...

AakashGfude · 2020-03-26T01:03:39Z

@choldgraf Have done the necessary changes, and jupyter_execute_notebooks can now have "force", "auto", "off" values. I have also updated the docs.

choldgraf · 2020-03-26T16:21:47Z

Agree re: the jupinx loading bars, I think that style would be best!

A few more thoughts:

When I have jupyter_execute_notebooks set to "force", but jupyter_cache to True, then it seems to keep using the cache. This is a bit confusing as I assume it would force re-execution each time.
What about we collapse the config a bit further. jupyter_execute_notebooks take 4 possible options: auto (default), force (always execute), cache (equiv to jupyter_cache = True), and off. Then, jupyter_cache is always used just if a person wants to point to a different cache location.
Execution exclusion doesn't seem to work for me, I put execution_excludepatterns = ['*glue*'], and docs/use/glue.ipynb was still executed. Does it not support glob-like patterns? (if not, can we support that?)
I noticed that the "Executing notebook" log message doesn't show up if we don't use a jupyter cache - is there a reason not to tell people it's executing if we aren't cacheing? I think we could say "executing" if jupyter_cache = False, and "executing and cacehing" if jupyter_cache = True.

AakashGfude · 2020-03-26T22:41:06Z

What about we collapse the config a bit further. jupyter_execute_notebooks take 4 possible options: auto (default), force (always execute), cache (equiv to jupyter_cache = True), and off. Then, jupyter_cache is always used just if a person wants to point to a different cache location.

So, in this case, auto will use the cache as well right. What will be the difference then with cache? And the person will use the jupyter_cache variable only when he wants to point to a different location?

Execution exclusion doesn't seem to work for me, I put execution_excludepatterns = ['*glue*'], and docs/use/glue.ipynb was still executed. Does it not support glob-like patterns? (if not, can we support that?)

Yes, at present it was just doing a substring check of execution_excludepatterns with the path of files. I will have a look at how to support glob-like patterns.

I noticed that the "Executing notebook" log message doesn't show up if we don't use a jupyter cache - is there a reason not to tell people it's executing if we aren't cacheing? I think we could say "executing" if jupyter_cache = False, and "executing and cacehing" if jupyter_cache = True.

I saw some logs about execution from nbclient as well. But, I just realized that it shows the log only for markdown files. And it executes markdown files as well. I will supress that by configuring it. and give logs as mentioned

…onf variables

AakashGfude · 2020-03-27T08:22:16Z

@choldgraf have updated the functionalities, feel free to test it and let me know your thoughts

choldgraf · 2020-03-27T16:29:58Z

I just gave it another shot, and it worked really nicely 👍 - I tried all the different config options, and it worked really well, great job! I pushed a small extra bit to the documentation and also cleared out the outputs in our ipynb files in the docs so that we can actually put this to the test on a daily bases :-)

One thing that I foresee potentially confusing people in the future: we should provide some kinds of guidance for how to "clear the cache". E.g., do they delete _build? What if they only want to delete the pages to trigger a re-build, but don't want to remove the cache artifacts? That kinda thing...I don't know that it's super actionable right now though, because I'm not sure what the right path forward is, maybe something to keep an eye on as people use it.

I think that this one is nearly ready to go - we should treat this functionality as "late-alpha, early beta" for a while, so there's time for users to give the API a shot and provide feedback, squash bugs, etc. But I think it's ready to go from a user functionality and docs standpoint

@chrisjsewell what do you think?

tests/test_execute.py

chrisjsewell · 2020-03-27T17:15:33Z

Looks good to me 👍

…fude/MyST-NB into jupyter-cahe-integration

AakashGfude · 2020-03-27T22:36:12Z

One thing that I foresee potentially confusing people in the future: we should provide some kinds of guidance for how to "clear the cache". E.g., do they delete _build? What if they only want to delete the pages to trigger a re-build, but don't want to remove the cache artifacts? That kinda thing...I don't know that it's super actionable right now though, because I'm not sure what the right path forward is, maybe something to keep an eye on as people use it.

What I think we can have is have targeted clean methods like make clean to clear everything. make clean-html to clear html outputs while preserving the cache. make clear-cache to clear .jupyter_cache only(We will have to take custom jupyter_cache path provided into account).

deleting the pages from the cache to trigger a re-build without removing the artifacts is something we need to think about. And also where the functionality should be implemented like for example the above make commands can be a part of the cli Makefile.

choldgraf · 2020-03-28T05:25:21Z

fantastic! I'll plan on giving this a merge when I'm back at my computer (prob tomorrow morning)

chrisjsewell · 2020-03-28T05:46:36Z

You could then bump the version to 0.5.0a1 and create a release: https://github.com/ExecutableBookProject/MyST-NB/releases, and it will automatically be published to PyPi as a pre-release during the triggered travis-CI

choldgraf · 2020-03-28T17:47:06Z

alright let's 🚢 it!

choldgraf · 2020-03-28T18:01:05Z

OK I merged, bumped the version on master (#103) , and also added a release: https://github.com/ExecutableBookProject/MyST-NB/releases/tag/0.5.0a1. I bumped master before making the official release on github, so I triggered a travis re-build because I'm assuming that needs to happen once a release already exists

chrisjsewell · 2020-03-28T18:35:19Z

I bumped master before making the official release on github, so I triggered a travis re-build because I'm assuming that needs to happen once a release already exists

Not sure if I understand this, but when you make a release that triggers Travis (or Gihub actions as we may move to soon), so you don't need to manually trigger. But anyway its all sorted 👍

FYI, if ppl didn't know, you now have the pre-release on https://pypi.org/project/myst-nb/#history, whereby this will not be installed doing pip install myst-nb, it will only be installed if you specifically specify pip install myst-nb==0.5.0a1

* added functionality of jupyter-cache * writing tests to cover the all/most cases * adding execution and cacheing documentation * moved fixtures to conftest

AakashGfude added 2 commits March 9, 2020 17:09

added functionality of jupyter-cache

536950c

documenting the execution functions

6fb9daf

AakashGfude changed the title ~~FEAT: Jupyter cahe integration~~ FEAT: Jupyter-cache integration Mar 9, 2020

AakashGfude mentioned this pull request Mar 9, 2020

First-pass thoughts on the CLI etc executablebooks/jupyter-cache#16

Closed

handling errored and excepted notebooks

08a9a63

updating cacheing code

10115b0

AakashGfude and others added 4 commits March 10, 2020 13:34

Merge pull request #1 from ExecutableBookProject/holdgraf_updates

35e36b0

updating cacheing code

first draft to run execution during env-get-outdated event

6d2fb27

keeping pk_list None by default to handle empty case in executor

63a3237

Merge branch 'master' of https://github.com/ExecutableBookProject/MyS…

2e6b3e1

…T-NB into jupyter-cahe-integration

choldgraf mentioned this pull request Mar 12, 2020

adding simple execution #62

Closed

chrisjsewell mentioned this pull request Mar 12, 2020

Error accessing notebook.metadata.language_info.file_extension #63

Closed

chrisjsewell reviewed Mar 13, 2020

View reviewed changes

myst_nb/cache.py Outdated Show resolved Hide resolved

chrisjsewell reviewed Mar 13, 2020

View reviewed changes

setup.py Outdated Show resolved Hide resolved

AakashGfude added 2 commits March 15, 2020 12:19

Merge branch 'master' of https://github.com/ExecutableBookProject/MyS…

04b6e1f

…T-NB into jupyter-cahe-integration

chris S's review commit

7f2d6ca

AakashGfude added 2 commits March 26, 2020 11:52

jupyter_execute_notebooks change and path for jupyter_cache allowed

0422cec

added a test for jupyter_cache path

3544ba9

choldgraf mentioned this pull request Mar 26, 2020

Discussion for beta release executablebooks/meta#48

Closed

AakashGfude added 2 commits March 27, 2020 17:32

refactored code and tests to make nbclient more primary, and adjust c…

5cb5791

…onf variables

updating documentation

ff11ec1

small docs improvements and clearing notebook outputs

7f588c6

chrisjsewell reviewed Mar 27, 2020

View reviewed changes

tests/test_execute.py Outdated Show resolved Hide resolved

AakashGfude added 2 commits March 28, 2020 09:27

moved fixtures to conftest

db6ffff

Merge branch 'jupyter-cahe-integration' of https://github.com/AakashG…

5538173

…fude/MyST-NB into jupyter-cahe-integration

choldgraf merged commit f186cc5 into executablebooks:master Mar 28, 2020

This was referenced Apr 1, 2020

Add the ability to execute notebooks #48

Closed

How would this use jupyter-cache? #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Jupyter-cache integration #55

FEAT: Jupyter-cache integration #55

AakashGfude commented Mar 9, 2020 •

edited

AakashGfude commented Mar 9, 2020

chrisjsewell commented Mar 9, 2020

AakashGfude commented Mar 9, 2020

chrisjsewell commented Mar 9, 2020

chrisjsewell commented Mar 9, 2020 •

edited

chrisjsewell commented Mar 9, 2020

choldgraf commented Mar 9, 2020 •

edited

choldgraf commented Mar 9, 2020

AakashGfude commented Mar 10, 2020 •

edited

AakashGfude commented Mar 10, 2020

choldgraf commented Mar 10, 2020

AakashGfude commented Mar 12, 2020 •

edited

AakashGfude commented Mar 12, 2020

chrisjsewell commented Mar 12, 2020

choldgraf commented Mar 25, 2020

AakashGfude commented Mar 25, 2020 •

edited

choldgraf commented Mar 25, 2020

AakashGfude commented Mar 25, 2020 •

edited

choldgraf commented Mar 25, 2020

AakashGfude commented Mar 26, 2020

choldgraf commented Mar 26, 2020

AakashGfude commented Mar 26, 2020

AakashGfude commented Mar 27, 2020

choldgraf commented Mar 27, 2020

chrisjsewell commented Mar 27, 2020

AakashGfude commented Mar 27, 2020 •

edited

choldgraf commented Mar 28, 2020

chrisjsewell commented Mar 28, 2020

choldgraf commented Mar 28, 2020

choldgraf commented Mar 28, 2020 •

edited

chrisjsewell commented Mar 28, 2020

FEAT: Jupyter-cache integration #55

FEAT: Jupyter-cache integration #55

Conversation

AakashGfude commented Mar 9, 2020 • edited

AakashGfude commented Mar 9, 2020

chrisjsewell commented Mar 9, 2020

AakashGfude commented Mar 9, 2020

chrisjsewell commented Mar 9, 2020

chrisjsewell commented Mar 9, 2020 • edited

chrisjsewell commented Mar 9, 2020

choldgraf commented Mar 9, 2020 • edited

choldgraf commented Mar 9, 2020

AakashGfude commented Mar 10, 2020 • edited

AakashGfude commented Mar 10, 2020

choldgraf commented Mar 10, 2020

AakashGfude commented Mar 12, 2020 • edited

AakashGfude commented Mar 12, 2020

chrisjsewell commented Mar 12, 2020

choldgraf commented Mar 25, 2020

AakashGfude commented Mar 25, 2020 • edited

choldgraf commented Mar 25, 2020

AakashGfude commented Mar 25, 2020 • edited

choldgraf commented Mar 25, 2020

AakashGfude commented Mar 26, 2020

choldgraf commented Mar 26, 2020

AakashGfude commented Mar 26, 2020

AakashGfude commented Mar 27, 2020

choldgraf commented Mar 27, 2020

chrisjsewell commented Mar 27, 2020

AakashGfude commented Mar 27, 2020 • edited

choldgraf commented Mar 28, 2020

chrisjsewell commented Mar 28, 2020

choldgraf commented Mar 28, 2020

choldgraf commented Mar 28, 2020 • edited

chrisjsewell commented Mar 28, 2020

AakashGfude commented Mar 9, 2020 •

edited

chrisjsewell commented Mar 9, 2020 •

edited

choldgraf commented Mar 9, 2020 •

edited

AakashGfude commented Mar 10, 2020 •

edited

AakashGfude commented Mar 12, 2020 •

edited

AakashGfude commented Mar 25, 2020 •

edited

AakashGfude commented Mar 25, 2020 •

edited

AakashGfude commented Mar 27, 2020 •

edited

choldgraf commented Mar 28, 2020 •

edited