Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Proposed improvement/alternative to jupyter-sphinx #47

Closed
wants to merge 9 commits into from

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Feb 16, 2020

@choldgraf @akhmerov following discussion in #42 I've created a proposal here for how I would forsee the execution working.

To recap, the Sphinx build phases and Sphinx core events can be summarised as:

1. event.config-inited(app,config)
2. event.builder-inited(app)
3. event.env-get-outdated(app, env, added, changed, removed)
4. event.env-before-read-docs(app, env, docnames)

for docname in docnames:
    5.  event.env-purge-doc(app, env, docname)
    if doc changed and not removed:
        6. source-read(app, docname, source)
        7. run parser: text -> docutils.document
        8. apply transforms (by priority): docutils.document -> docutils.document
        9. event.doctree-read(app, doctree)

10. event.env-updated(app, env)
11. event.env-check-consistency(app, env)

for docname in docnames:
    12. apply post-transforms (by priority): docutils.document -> docutils.document
    13. event.doctree-resolved(app, doctree, docname)

14. call builder

15. event.build-finished(app, exception)

jupyter_sphinx.main contains the extension setup, and utilises this approach:

In (2) Setup a JupyterDB, that is stored within the doctrees build folder. This is imported from a separate package jupyter_db whose sole purpose is to maintain a database of all the kernels code cells, and their outputs. It doesn't know anything about sphinx, hence having it as a separate package. I have written this as an SQL database (with sqlalchemy to provide an OO layer on top), since I think that's best for fast look-up and ansynchronous access. But you could essentially write this in anyway you see fit, as long as it has the same interface (i.e. class methods)

During (3) any documents (and their related cells, kernels, outputs) that have been changed/removed are removed from JupyterDB.

During (7) directives save kernels and code cells to the JupyterDB and the JupyterDB return primary keys (pk) to those elements, which are saved on the docutils elements for later lookup.

During (10) the JupyterDB is parsed to an entry-point defined function: run_execution. This function's job is to act 'in-place' on JupyterDB to populate it with outputs.
Again it doesn't know anything about sphinx, so could be external to jupyter-sphinx.
To decide which kernels need re-running, it can check which kernels / cells no longer have outputs, or could even check the outputs 'last modified time'.

During (12) The primary keys are used to access the outputs, and insert them where necessary.

@chrisjsewell
Copy link
Member Author

FYI this is what the internal structure of the SQL database is:

SQL_Structure

@akhmerov
Copy link
Contributor

This is really cool! After I understand it better, I think it would be great if this could go to jupyter-sphinx, and fix a bunch of issues there!

@akhmerov
Copy link
Contributor

akhmerov commented Feb 16, 2020

@chrisjsewell a few questions:

  • In jupyter-sphinx we let the users download the notebooks directly. Isn't that a more natural way to store the inputs and outputs, as opposed to a database? At which step would it negatively influence performance?
  • Can you explain more about the JupyterViewNode: what is its purpose?

@choldgraf
Copy link
Member

I'll take a look at this as well, it sounds like quite an interesting proposal!

At a first glance, my main concern is with maintainability and using an SQLite database. It may be the best technical solution to this, but we should think about it balanced against the likelihood that potential maintainers will have experience with sqlite. I can get a better feel for this after looking through the implementation code. To me the question is always "will contributing this project be accessible to somebody with only volunteer time, and a PhD's worth of Python experience in a scientific field"

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 16, 2020

  1. So if you look at JupyterDB, there is a get_notebook() method that will return an nbformat notebook.
  2. JupyterDB is written to be ‘backend agnostic’, so you should be able to switch out the SQL backend with anything you want.
  3. The ‘natural’ reason to use an SQL backend, is because cells and kernels (and mimebundles) utilise integer primary keys to reference them, that are unique across all documents/notebooks (which are not necessarily one and the same). At the ‘post transform’ stage you need to be able to return a cells outputs by its pk (or a synonymous name string). Using a folder of notebooks as a backend, for example, you would have to work out how to cache file reads, so you don’t have to re-read a notebook every time you want to access a cell. You would then have to keep an internal database of pk’s to cells/kernels.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 16, 2020

To me the question is always "will contributing this project be accessible to somebody with only volunteer time, and a PhD's worth of Python experience in a scientific field"

Well I think sqlalchemy makes it relatively easy to work with sql databases, and hides a lot of the complexity. I only learnt it in January and didn't find it too traumatising. But have a look at the code and see what you think.

@chrisjsewell
Copy link
Member Author

Can you explain more about the JupyterViewNode: what is its purpose?

The idea of this is that there is no inherent constraint that the outputs of a code cell must have one defined visualisation. Along the lines of the classic Model-View pattern for data visualisation, ideally the data and the views would be split. Granted that for simple use cases, the user may want to reduce the syntax, and only use the single jupyter-execute directive for both. But to have greater control of the output (e.g. controlling whether to render latex as math), this is a better approach. This also opens up possibilities like having an inline role views of the output, and achieving things like #46

@akhmerov
Copy link
Contributor

Thanks for the explanation. I agree that storing all outputs, and referring to them from the JupyterView node is much better than what we have jupyter-sphinx right now—at the very least to better support multiple writers.

How would the user control rendering of a single output? By including a separate .. jupyter-view?

I would like to suggest sticking to the constraint that the cell outputs only occur after the cell. This choice was a result of a lengthy discussion when we were working on jupyter-sphinx, and the main motivation was the robustness. When working on documentation, it is usually easy to keep in your memory what happens within a single kernel, and manipulate its outputs explicitly. Allowing to include outputs elsewhere may lead to subtle errors because the effects of the changes become much more nonlocal. Similarly, I propose to make the inline role also execute a small cell, rather than refer to some code. I expect that the inline executable code would mainly yield text or latex output, and would be relatively short. This approach (inline code rather than inline references) is also used by Rmd.

@akhmerov
Copy link
Contributor

akhmerov commented Feb 16, 2020

Also I'd like provide more background about the jupyter-sphinx design: a principle I found really useful was to critically examine the amount of configurability we allow. For example, I see that you control that whether errors are tolerated is controlled on a per-kernel basis in your implementation, whereas the jupyter-sphinx only has a global setting or a per-cell setting. This is a result of considering usage scenarios outlined in jupyter/jupyter-sphinx#73.

Basically, since adding features is much easier than removing them, we opted for implementing a minimal set that we consider definitely useful, and to wait for requests with implementing anything else.


# Document Elements
app.add_directive("jupyter-kernel", JupyterKernel)
app.add_directive("jupyter-exec", JupyterExec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: is there a reason why you use jupyter-exec, and not jupyter-execute? I personally don't have a preference either way, but if possible I'd like to avoid changing conventions. Or is there another package that introduces a jupyter-exec directive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it was absolutely an arbitrary choice, and can be changed to jupyter-execute 👍

@chrisjsewell
Copy link
Member Author

Also I'd like provide more background about the jupyter-sphinx design

Thanks 👍 , I will certainly take this into consideration. Actually the only reason I added the :allow-errors option, was because I wanted to check errors were being correctly renderer in my test case: https://github.com/ExecutableBookProject/myst_parser/blob/jupyter/tests/jupyter_sphinx/sourcedirs/basic/contents.rst

@chrisjsewell
Copy link
Member Author

How would the user control rendering of a single output? By including a separate .. jupyter-view?

Yeh that's the idea:

.. jupyter-exec::
    :timeout: 20
    :show-code:

    print("x")
    a = 1
    a

.. jupyter-view::

Although as I mentioned before, for simple use cases this may not be ideal; to have to put a .. jupyter-view:: after every code block. So maybe there is some basic options for rendering on the jupyter-execute, but then if you want to have finer output control, you use a jupyter-view. Something to think about...

@akhmerov
Copy link
Contributor

So maybe there is some basic options for rendering on the jupyter-execute, but then if you want to have finer output control, you use a jupyter-view.

jupyter-sphinx now defines some basic options like :hide-output:. Wouldn't enabling a more fine-grained control over the representation be a worthy extension of the interface once there's a specific need for it?

@chrisjsewell
Copy link
Member Author

jupyter-sphinx now defines some basic options like :hide-output:. Wouldn't enabling a more fine-grained control over the representation be a worthy extension of the interface once there's a specific need for it?

Well you still at least need to make sure that the code/process is structured in such a way that it can be extended like this. Also, to my mind there is already some potential needs for it; adding things like captions and labels to outputs, allowing for complimentary latex/PDF output, etc...

@akhmerov
Copy link
Contributor

Ah, but I have nothing against the view node or the code structure. Rather I think the view directive may be omitted at first—that faces the users and complicates the interface. Instead, I imagine the view node would be inserted during the execution.

@akhmerov
Copy link
Contributor

Also: what are output captions and labels? Is that similar to latex figure captions and labels? If that's the case, wouldn't it be better specified outside of the jupyter-specific directives?

@chrisjsewell
Copy link
Member Author

Ah, but I have nothing against the view node or the code structure. Rather I think the view directive may be omitted at first—that faces the users and complicates the interface. Instead, I imagine the view node would be inserted during the execution.

Yeh no problem, I wouldn't be completely against that lol

@chrisjsewell
Copy link
Member Author

Also: what are output captions and labels? Is that similar to latex figure captions and labels? If that's the case, wouldn't it be better specified outside of the jupyter-specific directives?

Well it would be like the figure directive: https://github.com/docutils-mirror/docutils/blob/e88c5fb08d5cdfa8b4ac1020dd6f7177778d5990/docutils/parsers/rst/directives/images.py#L101

You mean wrapping the jupyter-execute? That would make it very difficult to achieve round-trip conversion between the text representation and the notebook.

@akhmerov
Copy link
Contributor

I think a round-trip conversion would be nearly impossible to achieve regardless of that, due to the notebook format limitations. Both v2 and v2.1 component listings in executablebooks/meta#21 don't include round-tripping. See also discussions in #27. Come to think of this, even following this PR, a text representation with more than one kernel cannot round-trip to a notebook.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 16, 2020

Well it should be round-trippable for a basic subset of the text format. Yes you won’t be able to do a 1:1 mapping for multiple kernels, but then generally if your just using them for separate sections you can just split those sections into multiple documents (would be nice in Sphinx if multiple source documents could end up as a single HTML page).

For single kernel pages particularly, it might be nice (and facilitate this round trip) if you could specify the kernel in the front matter. In RST it would look like this:

:kernel_name: name

In myst:

—-
kernel_name: name
—-

This would be available (after passing the whole document) in env.metadata[docname][“kernel_name”]]. The only drawback is that this would necessitate handling back population of code cells with the kernel name in an additional transform after the initial parse (step (8)).

@akhmerov
Copy link
Contributor

akhmerov commented Feb 16, 2020

Well it should be round-trippable for a basic subset of the text format.

Sure, but why would figure captions and labels belong to this basic subset?

My concern here is that this way a directive for code execution also takes responsibility for something unrelated to execution, which is also provided by another directive.

EDIT: kernel in the frontmatter sounds like a great idea.

@chrisjsewell
Copy link
Member Author

My concern here is that this way a directive for code execution also takes responsibility for something unrelated to execution, which is also provided by another directive.

Well surely this was my whole point with the jupyter-view directive: at present the jupyter-execute directive is not only responsible for code execution, it also takes responsibility for the output rendering.

This is not already provided by another directive; the figure directive only works if you have an image uri, which is not the way that jupyter-execute works.

@akhmerov
Copy link
Contributor

@chrisjsewell great point. Since caption itself may include arbitrary rst content, it also seems hard to implement a wrapper that would create figures.

Continuing with the usage of jupyter-view: do you envision that it'd have a reference to the jupyter-execute that generates the view? Or would it only be correct to include the view immediately after the execute?

Also: how would this interact with thebelab? Since its users aren't the code devs, they will be confused if the input code is separated in the document from the result it produces (also thebelab only inserts the output right after the input). Should the code appear in the figure body after thebelab is activated? What should happen to the original code? Same questions apply to the case when there are multiple references to the execute call, or when the execute and view are separated in the document. Currently in jupyter-sphinx the code only moves when thebelab is activated if initially if it was rendered below the output, and that's not too far.

Further questions: do we expect that the users will need to tweak the rendering of the mime bundle on a case by case basis, or would a global configuration suffice?

mime: OrmMimeBundle = session.query(OrmMimeBundle).filter_by(pk=pk).one()
source = mime.source
if mime.mimetype == "html":
source = dedent(source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always a safe operation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean safe? It’s not actually executing the source?

Copy link
Contributor

@akhmerov akhmerov Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's a noob question: is dedenting html guaranteed to not change its looks?

@choldgraf
Copy link
Member

What's the status of this PR? Is it superseded by the existence of jupyter-cache?

@chrisjsewell
Copy link
Member Author

What's the status of this PR? Is it superseded by the existence of jupyter-cache?

Eventually it will be, but I'll leave it open for now since it has some different/additional code than is currently in jupyter-cache, that I may need at a later stage.

@chrisjsewell
Copy link
Member Author

Now being implemented in jupyter-cache

@chrisjsewell chrisjsewell deleted the jupyter branch July 20, 2020 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants