Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Myst-to-NB converter #82

Closed
wants to merge 7 commits into from
Closed

Add Myst-to-NB converter #82

wants to merge 7 commits into from

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Mar 15, 2020

This PR introduces a function for converting text documents, written in the myst format, to a notebook. It could be used in the parser and in #55 to convert these documents to notebooks before executing/reading. See the test input/output for an example.

I would envisage this could be utilised by also adding myst to https://github.com/ExecutableBookProject/MyST-NB/blob/4c3f36b884086878dc0af5dfcfd735b98ebbbacf/myst_nb/parser.py#L27
This would denote files with the .myst extension as ones that should be parsed as notebooks, as opposed to normal .md files, then in https://github.com/ExecutableBookProject/MyST-NB/blob/4c3f36b884086878dc0af5dfcfd735b98ebbbacf/myst_nb/parser.py#L35
we would check the documents extension and apply this function, instead of nbf.reads, before proceeding with the rest of the process.

This addresses to a large extent #12

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 15, 2020

Or we could have a parser subclass:

import nbformat as nbf
from myst_nb.parser import NotebookParser
from myst_nb.convert import myst_to_nb


class MystNbParser(NotebookParser):
    supported = ("mystnb",)

    def parse(self, inputstring, document):
        ntbk = myst_to_nb(inputstring)
        return super().parse(nbf.writes(ntbk), document)

(it would be good here to allow NotebookParser.parse to optionally take NotebookNode directly, so we don't have to do a spurious round trip conversion)

@chrisjsewell
Copy link
Member Author

FYI there is an edge case that should probably be raised with sphinx: if you have multiple documents with the same name but different extensions; sphinx will not complain but will read the same one multiple times, e.g. if you had

a.md
a.rst
a.ipynb
a.myst

Only a.rst would be converted to a.html

@choldgraf
Copy link
Member

How do you see this usecase being used relative to writing in a .md file? Or more specifically - do you imagine users commonly writing three different kinds of documents: markdown, myst-notebook-markdown, and notebooks-with-myst-inside?

@chrisjsewell
Copy link
Member Author

Yes, I think these are three distinct document types. Although note, users could override this behaviour with source_suffix

source_suffix = {
    '.md': 'myst',
}

Which would make all .md be read "as notebooks"

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 15, 2020

Note this is kind of an alternative to using the jupyter-sphinx directives; the problem with them being:

(a) at present you end up with different formatting/css relative to NotebookParser, but mainly
(b) how can they be tied in to #55; where we want to identify all the notebooks to execute, before any parsing has been done?

with the approach outlined in this PR, that is relatively trivial, and means we don't have to have any code (other than this function) that is different for notebooks and text documents; they'll be treated identically.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 15, 2020

Perhaps the default extension for these files could be .mystnb (or Nmd would be analogous to Rmd), and this is also what jupytext would convert to/from

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 15, 2020

In 5d50424 I have added the full round-trip conversion (well obviously sans code outputs). nb_to_myst could probably be added directly to jupytext, but myst_to_nb uses myst-parser so they probably wouldn't want that as a dependency and would want to re-write.

@choldgraf
Copy link
Member

Actually, this comment: mwouts/jupytext#447 (comment) makes me think that marc would be happy for us to make myst a requirement for the ntbk <--> myst markdown sync. When people want to use pandoc markdown he just calls pandoc, and myst-parser is a more lightweight dependency than pandoc.

Maybe @AakashGfude could give a look at @chrisjsewell 's code and see how tough it'd be to use that as the basis of a PR to jupytext?

@choldgraf
Copy link
Member

just to confirm - if the jupytext PR gets merged, then that will supercede this one right?

@choldgraf
Copy link
Member

(and, just another thought in case it seems relevant in the other PR, we could also keep the merging logic in this repository and have Jupytext use a function that myst_nb provides, or something like this...that might give us more iterative control over the code without needing to wait for upstream reviews in jupytext)

@chrisjsewell
Copy link
Member Author

just to confirm - if the jupytext PR gets merged, then that will supercede this one right?

It would yes

(and, just another thought in case it seems relevant in the other PR, we could also keep the merging logic in this repository and have Jupytext use a function that myst_nb provides, or something like this...that might give us more iterative control over the code without needing to wait for upstream reviews in jupytext)

The problem with this is that all the testing is in jupytext. So any changes to the function might break these tests, but you wouldn't know until you tried to use the function in jupytext

@choldgraf
Copy link
Member

@chrisjsewell makes sense - and to be clear I prefer that this code is an upstream contribution to jupytext, I'm just a bit worried if we need to make some rapid changes to the codebase etc.

@choldgraf
Copy link
Member

choldgraf commented Mar 16, 2020

also let's discuss the {execute} vs {nb-code} thing here, so it doesn't clutter the other PR (since Marc has no context for that conversation).

I added {execute} as a synonym for {jupyter-execute} only as a stopgap in this repository until we could get a proper "myst-based notebook format" ready to go. I think that asking users to write {execute} is more intuitive than asking them to write {nb-cell}, so my preference is to use that pattern, rather than to only use {execute} as a short-hand for {jupyter-execute}

@choldgraf
Copy link
Member

(happy to hop on a phone call to discuss this if that's a faster way to resolve this. I think that the "notebook format in myst" should match exactly what we have in the issue in this repository, unless the group decides otherwise, since that's what everybody agreed on the last time we discussed)

@chrisjsewell
Copy link
Member Author

I think that asking users to write {execute} is more intuitive than asking them to write {nb-code}

I wouldn't say I have any strong preference either way TBH.

Firstly though, I wanted to clarify the distinction, and note that if we do use execute, then we should remove https://github.com/ExecutableBookProject/MyST-NB/blob/4c3f36b884086878dc0af5dfcfd735b98ebbbacf/myst_nb/__init__.py#L99

Secondly, from a different perspective, if you're thinking of the file in terms of a notebook, isn't it more intuitive to use nb-code which implies directly that this is a notebook code cell?

Thirdly, {raw} is already a sphinx directive, so we would have to use a different name.

@chrisjsewell
Copy link
Member Author

Also, it does feel a bit of a "fudge" that these directives aren't technically directives (i.e. that can be directly used by sphinx), but I don't see any better trade-off, to achieve a full round-trip capability?

@choldgraf
Copy link
Member

choldgraf commented Mar 16, 2020

Firstly though, I wanted to clarify the distinction, and note that if we do use execute, then we should remove

Yep, that was my intention all along. I just added the add_directive bit to give us something to work with for development purposes. If we're not going to use {execute} as the way to represent code cells in MyST, then I'll probably remove this line anyway because I think it's confusing to have two words for the same directive split across two packages

Secondly, from a different perspective, if you're thinking of the file in terms of a notebook, isn't it more intuitive to use nb-code which implies directly that this is a notebook code cell?

Thirdly, {raw} is already a sphinx directive, so we would have to use a different name.

I think these are both reasonable points - but the point is that we need to decide as a group what the syntax should be, we can't choose different names just at the time of a PR without getting feedback from others. I think these are reasonable options, and you make a good point about clobbering things like {raw}, but especially for things like user-facing API of the core myst-markdown format, others should have an opportunity to weigh in. We need to treat #12 as the canonical source of truth for "representing a notebook in myst markdown"

Another option is to treat this as a stop-gap for the sake of making progress, let marc know that the names may change but we're happy to make a PR to do this if it happens, then update this issue to have further discussion about how the proposed format should change.

WDYT?

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 16, 2020

Yep, that was my intention all along. I just added the add_directive bit to give us something to work with for development purposes.

👍

but especially for things like user-facing API of the core myst-markdown format, others should have an opportunity to weigh in

Oh yeh absolutely. I have actually written these functions to take the directive names as variables:
https://github.com/ExecutableBookProject/jupytext/blob/8999b1d93ee4938ddbc36fd749c02639b5d9b588/jupytext/myst.py#L98-L100
with the original intention in this PR of leaving that decision until later.
But then for the jupytext PR obviously they have to be solidified

Another option is to treat this as a stop-gap for the sake of making progress, let marc know that the names may change but we're happy to make a PR to do this if it happens, then update this issue to have further discussion about how the proposed format should change.

yes I'd be happy with this route

@chrisjsewell
Copy link
Member Author

then update this issue to have further discussion about how the proposed format should change.

The good thing about having now made this initial jupytext implementation, is that we have a whole suite of output .mystnb/.mnb files, for everyone to inspect and get a feel for how they are looking:

https://github.com/ExecutableBookProject/jupytext/tree/mystnb/tests/notebooks/mirror/ipynb_to_myst

@choldgraf
Copy link
Member

@mmcky and @jstac would be great to hear your thoughts

@jstac
Copy link
Member

jstac commented Mar 16, 2020

The syntax here looks perfectly fine to my eyes: https://github.com/ExecutableBookProject/jupytext/blob/mystnb/tests/notebooks/mirror/ipynb_to_myst/Notebook%20with%20metadata%20and%20long%20cells.mystnb

I don't have strong opinions on this point.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 17, 2020

authors shouldn't have to think in notebooks.

Mostly I am trying to not deviate as much as possible from RMarkdown's patterns of {r}and {python}...ideally users would be able to just use that exactly in order to define executable code, since it's such a common pattern already. I'm just not sure if that would be possible without colliding with other things in Sphinx, which is why I was thinking {nb:python} etc.

The common pattern here is jupyter notebooks, and how they run code, not however RMarkdown runs code; we, and jupyter-sphinx, and sphinxcontrib-jupyter are all literally converting the blocks to notebooks and running them. I think we should absolutely be saying to authors "think of the code blocks running in terms of notebooks", that is the simplest way for them to understand the execution: one kernel per document, that dictates the language/executable that all of the cells in that notebook are run with, with all cells running consecutively.

This is why {r} or {python} or {nb:python} makes no sense to me, because there is no guarantee that that is the language that the code cells will actually be run with. You are just going to have multiple names for exactly the same underlying directive. We would also have to have one of these directive names for at least every one of the languages supported by jupytext:

_JUPYTER_LANGUAGES = ['R', 'bash', 'sh', 'python', 'python2', 'python3', 'javascript', 'js', 'perl',
                      'html', 'latex', 'markdown', 'pypy', 'ruby', 'script', 'svg',
                      'matlab', 'octave', 'idl', 'robotframework', 'spark', 'sql']

Lastly by having {r}, {python}, ... you are suggesting to the user that they can do something like:

```{python}
python code
```
```{r}
r code
```

and have that execute, which is just not the case, unless we completely re-write myst-nb and scrap notebook round-tripping and jupyer-cache.

@jstac
Copy link
Member

jstac commented Mar 17, 2020

I agree. Jupyter notebooks is the right mental model for authors, one kernel per notebook. I imagine most people will start off authoring in ipynb anyway, and then gradually start to use text based source as they seek extra features.

@mmcky
Copy link
Member

mmcky commented Mar 17, 2020

Jupyter notebooks is the right mental model for authors, one kernel per notebook.

I agree that thinking in terms of notebooks is a good way to think about how the blocks are executed but I don't think the syntax should be tied to notebooks. It is ok if I get outvoted on this but I guess my point is when I am writing a myst text file I wouldn't think to naturally write {nb-code} for an executable code block as I am writing a document not a notebook.

@chrisjsewell
Copy link
Member Author

when I am writing a myst text file I wouldn't think to naturally write {nb-code}

But the converse is, if I had converted from a notebook then seeing {nb-code} would be pretty intuitive

@choldgraf
Copy link
Member

choldgraf commented Mar 17, 2020

Personally, I agree with you both in terms of {nb-code} or {execute} being better names than {python} or {r}. I'm just thinking in terms of the behemoth in this space, which is RMarkdown, and our opportunity to bring some of those folks onto this stack by making it as painless as possible.

I'd be fine with either {nb-code} or {execute}/{exec}....hmmm

what about {cell} or {code-cell}? Those are not quite "notebook-specific" but they convey the same idea. If you're coming from a notebook you'll understand what they mean, but if you aren't then they still make sense?

why do we not like {execute} anymore, again? a matter of {nb-code} being more intuitive for notebook users?

@chrisjsewell
Copy link
Member Author

. I'm just thinking in terms of the behemoth in this space, which is RMarkdown, and our opportunity to bring some of those folks onto this stack by making it as painless as possible.

Yeh I absolutely take your point, its just trying to get the balance right. Remember that, since jupytext now round-trips both RMarkdown and MyST-NB, we do have a ready-made conversion tool from RMarkdown 😄

{code-cell} would be fine by me, not cell because we would also have {raw-cell}

@mmcky
Copy link
Member

mmcky commented Mar 17, 2020

I think {code-cell} would make the markup syntax clearer.

@chrisjsewell
Copy link
Member Author

See mwouts/jupytext@d37f73a

@choldgraf
Copy link
Member

thanks for the quick discussion and turnaround on this one all, much appreciated!

it definitely does make me more responsive when I'm not supposed to leave the house 😆

@jstac
Copy link
Member

jstac commented Mar 17, 2020

I hear what you're saying about Rmarkdown, @choldgraf , although perhaps it's possible to overstate the number of would-be authors who are coming from that environment. I imagine that the great majority will be notebook users hoping to soup up the presentation of their results, while knowing little about Rmarkdown.

(Like me 😁)

I suspect most (but not all) Rmarkdown users are R users, who have good tools already in terms of EBP substitutes.

Anyways, +1 for {code-cell}.

@choldgraf
Copy link
Member

choldgraf commented Mar 17, 2020

Hey all - another decision-point has come up in @chrisjsewell's PR to jupytext:

Should the extension coming out of jupytext be

  1. .md
  2. .mnb
  3. .myst.md
  4. .myst

My preference as you can guess is for plain old .md, I think it will confuse casual users (AKA, most of our users) if there is anything other than .md in their file extensions. But, I am open to alternative points. Chris made the good point that having a myst-specific extension would make it much easier to know when a markdown file is "myst markdown" for things like parsing, translation, etc. That said, the more we make people do anything different from what they're doing, the more people will not try out this stack in the first place.

@jstac
Copy link
Member

jstac commented Mar 17, 2020

Looks like you got your preference: mwouts/jupytext@62351f4

@chrisjsewell
Copy link
Member Author

Looks like you got your preference: mwouts/jupytext@62351f4

Yeh I can't be bothered to argue about it lol. It does mean that I'll probably put the LSP on the back-burner though. Because, if there is no way to distinguish myst from md, then it doesn't really work.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 17, 2020

Also these files won't work directly with the myst-parser extension (because it won't know what to do with code-cell and raw-cell). You need to add a separate parser to myst-nb that calls jupytext.myst.matches_mystnb, to work out if the markdown file should be read as pure markdown or converted to a notebook. (it should just need to be a small subclass of NotebookParser)

@AakashGfude will also need to use it in some manner in #55 to work out which markdown files to convert/execute/cache

@jstac
Copy link
Member

jstac commented Mar 17, 2020

Yeh I can't be bothered to argue about it lol. It does mean that I'll probably put the LSP on the back-burner though. Because, if there is no way to distinguish myst from md, then it doesn't really work.

In that case I can definitely be bothered to argue about it.

In terms of the user interface, this is hugely important. I can tell you this as someone who has been authoring in jupinx for a long time. The lack of editor support for the source format is absolutely the first thing I would change if I could.

I think user confusion between md and mnb is likely to be minor. In fact, as a user rather than a developer, I would strongly prefer mnb because this is a one-to-one mapping with a notebook. It is, from the very start, a markdown representation of a notebook. This is exactly the mental model that we want users to have in mind. So mnb is perfect.

@jstac
Copy link
Member

jstac commented Mar 17, 2020

Sorry I'm late to the party on this --- clearly I didn't understand all the implications --- but should we put a hold on mwouts/jupytext#458 until this is decided?

@choldgraf
Copy link
Member

Given what @chrisjsewell mentioned I think I'd be fine if we used .myst or .mnb. We can always revisit this later if it seems like a big problem. Would it be enough for our usecases?

I am also happy to hop on a call to discuss this if that's helpful to hash out the details.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Mar 17, 2020

Sorry I'm late to the party on this --- clearly I didn't understand all the implications --- but should we put a hold on mwouts/jupytext#458 until this is decided?

No, that's probably not necessary. Note that, with the current PR state, if you have jupytext installed and this at the top of your file:

---
jupytext:
  text_representation:
    format_name: myst
    format_version: '0.7'
    jupytext_version: 1.4.0+dev
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

Then any of the following extensions will open directly as notebooks (and changes saved back to them):

image

@chrisjsewell
Copy link
Member Author

I think theres a way forward on this, so mwouts/jupytext#458 can definitely be merged. Then @choldgraf can decide if you want to work out how to make the default pairing (if going from ipynb -> text based) be to .md, rather than default now which is .mnb (see mwouts/jupytext#458 (comment))

@choldgraf
Copy link
Member

Let's keep it mnb for now, and see how things go.

@chrisjsewell
Copy link
Member Author

Let's keep it mnb for now, and see how things go.

Yeh cool cool, if you mention that to mwouts in the PR then, and hopefully we can get a jupytext release with this code in soon, to utilise with #82 (comment)

@choldgraf
Copy link
Member

done 👍

@chrisjsewell
Copy link
Member Author

Cheers 😁

@jstac
Copy link
Member

jstac commented Mar 17, 2020

Thanks all!!

@mmcky
Copy link
Member

mmcky commented Mar 17, 2020

Nice work! Can I just say it's great working in a team that is open to debate and dialogue on these issues. The project will be better off for it.

@chrisjsewell
Copy link
Member Author

MyST format is now merged into jupytext 🎉 https://github.com/mwouts/jupytext/releases/tag/1.4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants