# Intro 

This notebook aims to show how to integrate `git` with Jupyter notebooks in a reproducible manner. It demonstrates how `nbstripout` and `nbdime` work to solve two of the main problem with `git` integration.

# The problem 

The problem with Jupyter notebooks is that they are not plain text files but actually have a complex JSON structure which contains both the code and its output. 

We usually don't see this structure while working on a notebook if we try to read them as text files (say for instance by checking the raw version of a notebook on Github) or try to use `git` on them we are almost lost. 

This is problematic for doing version control, first because we are unable to check effectively what is happening from commit to commit, but also because we don't need `git` to monitor *everything* inside our notebooks: if we change mundane details, say from `plt.scatter([1,2,3],[1,4,9],'red')` to `plt.scatter([1,2,3],[1,4,9],'blue')`, we would like to know the parameter we've changed but not which pixel on our screen is different from the previous version. 

So we have two issues:
1. Stripping out from commit the unnecessary details (i.e cells output) of our notebooks. 
2. Interpreting notebook git diffs in a better way

# The setup 

To see which is the problem we're trying to solve, let's use a test notebook. Let's create a notebook which once executed will create some random output which will trigger git. 

Execute the following two cells to create the notebook in your folder:

In [1]:
%%writefile test.py
# <codecell>
from matplotlib.pyplot import subplots,scatter
from numpy import random,arange
from time import localtime, strftime

# <codecell>
# print a timestamp
print("Last time the notebook has been executed: {}".format(strftime("%a, %d %b %Y %H:%M:%S", localtime())))

# <codecell>
# print a random list
print("A random list: {}".format(random.randint(1,100,5)))

# <codecell>
# print a figure
fig, ax = subplots(figsize=(5,5))
x_points = random.randint(1,100,20)
exponent_1 = random.choice(arange(-1,3.5,0.5))
exponent_2 = random.choice(arange(-1,3.5,0.5))
ax.scatter(x_points, x_points**exponent_1, label=exponent_1, marker='+')
ax.scatter(x_points, x_points**exponent_2, label=exponent_2, alpha=0.5)
ax.legend();
ax.set_title('A randomized figure:');

Writing test.py


In [2]:
# taken from 
# https://stackoverflow.com/questions/23292242/converting-to-not-from-ipython-notebook-format
from nbformat import v3, v4
with open("test.py") as fpin:
    text = fpin.read()
nbook = v3.reads_py(text)
nbook = v4.upgrade(nbook)  # Upgrade v3 to v4
jsonform = v4.writes(nbook) + "\n"
with open("test_notebook.ipynb", "w") as fpout:
    fpout.write(jsonform)

##  Run test notebook and see the diff 

Executing the content of the notebook will show that even if we don't touch anything in the code, `git` will register changes and require us to commit or discard them. 

We will use the `!` magic to execute the notebook from command line and then call `git`.

- More one how to execute notebook from command line can be found in the [offical documentation](https://nbconvert.readthedocs.io/en/latest/execute_api.html). Here are also two [SO](https://stackoverflow.com/questions/35471894/can-i-run-jupyter-notebook-cells-in-commandline) [questions](https://stackoverflow.com/questions/35545402/how-to-run-an-ipynb-jupyter-notebook-from-terminal) which could help too.  

In [None]:
# execute the notebook
!jupyter nbconvert --execute --to notebook --inplace test_notebook

In [None]:
# show changes are registered by git
!git status

In [None]:
# add and commit
!git add test_notebook.ipynb
!git commit -m 'Run the notebook, get new results.' --author="author <name.surname@mail.org>"

In [None]:
# check log to see the content of changes
!git log -p -1 

We can see that cell output changes are registered by git and that the image change is gibberish (for us).

# Chosen solutions

## Solution 1: remove output with `nbstripout` 

`nbstripout` [(documentation here)](https://github.com/kynan/nbstripout) is a tool that automatically removes the cell outputs in our notebooks. It has the following features:
- can be run from command line
- it's specific to the folder (hence can be run selectively on certain projects)
- allows customization (see the documentation for more)

### Installation and setup

Follow installation instructions explained in the readme. In my case it was sufficient to run from command line the following:

```bash
# install
>>> conda install -c conda-forge nbstripout
>>> nbstripout --install

# check installation
>>> nbstripout --status
nbstripout is installed in repository etc etc
```

You can also check the installation inside the notebook by running `!nbstripout --status`

### Test 

#### Remove output 

First let's remove the output and commit the "empty" notebook:

In [None]:
!nbstripout test_notebook.ipynb

In [None]:
# add and commit
!git add test_notebook.ipynb
!git commit -m 'Run the notebook, get new results, strip new results.' --author="author <name.surname@mail.org>"

#### Check effect of stripping output 

Then let's rerun the notebook (hence generating the output), remove it and see what git tells us: 

In [None]:
# execute the notebook
!jupyter nbconvert --execute --to notebook --inplace test_notebook

In [None]:
# strip
!nbstripout test_notebook.ipynb

In [None]:
# show changes are registered by git
!git status

We can see that by stripping out our randomly generated output we are left only with the code. Since it has not changed from the previous commit git tells us that there are no changes to commit for our notebook.

#### Make actual changes

 Now let's try to add new code to the notebook to see the actual power of nbstripout. 

## Solution 2: Analyze diffs with `nbdime` 

# Other solutions 