# Jupyter Notebooks to markdown and html with Pandoc

For several months now, the universal [document converter pandoc](https://pandoc.org/) has
had [support for Jupyter Notebooks](https://pandoc.org/MANUAL.html#creating-jupyter-notebooks-with-pandoc). This means that with a single call,
you can convert `.ipynb` files to any of the output formats that Pandoc
supports (and vice-versa!). This post is a quick exploration of what this
looks like.

**Note that for this post, we're using Pandoc version 2.7.3**. Also, some of what's below is hard
to interpret without actually opening the files that are created by Pandoc. For the sake
of this blog post, I'm going to stick with the raw text output here, though you can expand the
outputs if you wish, I recommend copy/pasting some of these commands on your own if you'd like to try.

In [1]:
from subprocess import run as sbrun
from subprocess import PIPE, CalledProcessError
from pathlib import Path
from IPython.display import HTML, Markdown

# A helper function to capture errors and outputs
def run(cmd, *args, **kwargs):
    try:
        out = sbrun(cmd.split(), stderr=PIPE, stdout=PIPE, check=True, *args, **kwargs)
        out = out.stdout.decode()
        if len(out) > 1:
            print(out)
    except CalledProcessError as e:
        print(e.stderr.decode())

# Our base notebook

First off, let's take a look at our base notebook. We'll convert this document
to both Markdown and HTML using Pandoc.

The notebook will be fairly minimal
in order to make it easier to inspect its contents. It has a collection
of markdown with mixed content, as well as code cells with various outputs.

{download}`See this link <pandoc_ipynb/inputs/notebooks.ipynb>` for the notebook we'll use.

# `.ipynb` to markdown

Let's try converting this notebook to markdown. This should preserve as much
information as possible about the input Jupyter notebook. This should include
all markdown cells, cell metadata, and outputs with code cells.

## A few pandoc options

Here are a few pandoc options that are relevant to our use-case:

* `--resource-path` defines the path where Pandoc will look for resources that are linked in the notebook.
  This allows us to discover images etc that are in a different folder from where we are invocing `pandoc`.
* `--extract-media` is a path where images and other media will be *extracted* at conversion time. Any links
  to images etc should point to files at this path in the output format.
* `-s` (or `--standalone`) tells Pandoc that the output should be a "standalone" format. This does different
  things depending on the output, such as adding a header if converting to HTML.
* `-o` the output file, and implicitly the output file type (e.g., markdown)
* `-t` the *type* of output file if we want to override the default (e.g., GitHub-flavored markdown vs. Pandoc markdown).

## Converting to GitHub-flavored markdown

Let's start by converting to GitHub-flavored markdown. By not specifying an output file
with `-o`, we'll cause Pandoc to print the result to the screen, which we'll display here.

In [10]:
# ipynb -> gfmd
run(f'pandoc pandoc_ipynb/inputs/notebooks.ipynb --resource-path=inputs -s --extract-media=outputs/images -t gfm')

<div class="cell markdown">

# Here's a demo notebook

This is a demo notebook to play around with the pandoc ipynb support

## Markdown

As it is markdown, you can embed images, HTML, etc into your posts\!

![](outputs/images/ca17e56d65946db885db7f8f50a9605a6a94e6a7.jpg)

Here's one \(inline_{math}\) and

\[
math^{blocks}
\]

``` python
def my_functino():
    mystring = "you can also include python cells"
    return mystring
```

</div>

<div class="cell markdown" data-tags="[&quot;heresatag&quot;]">

# Code cells

## Matplotlib output with metadata

The below code cell has some metadata attached to it. It also outputs a
figure. Both should be included in the output format.

</div>

<div class="cell code" data-execution_count="7" data-slideshow="{&quot;slide_type&quot;:&quot;subslide&quot;}" data-tags="[&quot;mytag&quot;,&quot;parameters&quot;]">

``` python
from matplotlib import rcParams, cycler
import matplotlib.pyplot as plt
import numpy as np
plt.ion()

data = np.random.rand(2, 1

Note that cells are divided by hard-coded `<div>`s, and cell-level metadata (such as tags)
are encoded within the HTML (e.g. `data-tags`). Also note that we haven't gotten the bibliography
to render, probably because we didn't enable the `citeproc` processor on pandoc (we'll try that later).
Finally, note that there's no notebook-level metadata in this output because GFM doesn't support
a YAML header.

## To pandoc-flavored markdown

In [11]:
# ipynb -> pandoc md
run(f'pandoc pandoc_ipynb/inputs/notebooks.ipynb --resource-path=inputs -s --extract-media=outputs/images')

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>notebooks</title>
  <style>
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
  </style>
  <style>
code.sourceCode > span { display: inline-block; line-height: 1.25; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
code.sourceCode { white-space: pre-wrap; }
code.sourceCode > span { tex

Now we've got something a little bit cleaner without all the hard-coded HTML. The `:::` fences
are how Pandoc-flavored markdown denote different divs, and cell-level metadata is encoded
similar to how GFM worked.

# `.ipynb` to HTML

Next let's try converting `.ipynb` to HTML. This should let us view the notebook as a web-page
as well as include all of the extra metadata inside the HTML elements. We'll start with
a vanilla HTML conversion. Note that the only thing we had to do was change the output
file extension to `.html` and Pandoc inferred the output type for us:

In [12]:
# ipynb -> HTML
run(f'pandoc pandoc_ipynb/inputs/notebooks.ipynb --resource-path=inputs -s --extract-media=outputs/images')

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>notebooks</title>
  <style>
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
  </style>
  <style>
code.sourceCode > span { display: inline-block; line-height: 1.25; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
code.sourceCode { white-space: pre-wrap; }
code.sourceCode > span { tex

This time our math rendered properly, along with everything else except for the
bibliography. Let's get that working now.

We've included a bibliography with our input file. With this (and using the
[citeproc citation style](https://pandoc.org/demo/example19/Extension-citations.html), we can use `pandoc-citeproc` to automatically render a
bibliography within each page. To do so, we've used the following extra options:

* `--bibliography` specifies the path to a BibTex file
* `-f ipynb+citations` tells Pandoc that our *input* format has citations in it. Before, the `ipynb` was
  inferred from the input extension. Now we've made it explicit as well.

In [13]:
# ipynb -> HTML with citations
run(f'pandoc pandoc_ipynb/inputs/notebooks.ipynb -f ipynb+citations --bibliography pandoc_ipynb/inputsreferences.bib --resource-path=inputs -s --extract-media=outputs/images')

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>notebooks</title>
  <style>
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
  </style>
  <style>
code.sourceCode > span { display: inline-block; line-height: 1.25; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
code.sourceCode { white-space: pre-wrap; }
code.sourceCode > span { tex

Now we've got citations at the bottom of the page, and in-line references interspersed
in the text. Pretty cool!

# Wrapping up

It seems like we can get pretty far with converting `.ipynb` files into
various flavors of markdown or HTML. My guess is that things will get a bit
trickier if we tried to do this with more complex cell outputs or metdata,
but it's a good start. Using Pandoc also means that it would be relatively
straightforward to convert notebooks into **latex**, **pdf**, or even **Microsoft Word**
format. I'll try to dig into this more in the future.

