# Intro 

This notebook aims to show, in a reproducible manner, how to integrate `git` with Jupyter notebooks. It demonstrates explicitly how `nbstripout` and `nbdime` work to solve two of the main issues with `git` integration. It's a step by step guide so not expect a quick summary. 

# The problem 

The problem with doing version control of Jupyter notebooks is that they are not plain text files (as .py scripts) but actually have a complex JSON structure which contains both the code and its output. 

We usually don't see this structure while working on a notebook, but if we try to read them as text files (say for instance by checking the raw version of a notebook on Github) or try to use `git` on them we are almost lost. 

This is problematic for version control, first because we are unable to check effectively what is happening from commit to commit, but also because we don't need `git` to monitor *everything* inside our notebooks: if we change mundane details, say from `plt.scatter(x=[1,2,3],y=[1,4,9],c='red')` to `plt.scatter(x=[1,2,3],y=[1,4,9],c='blue')`, we would like to know which parameter we've changed but not which pixel on our screen is different from the previous version. 

So we have two issues to solve:
1. Stripping out from commit the unnecessary details (i.e cells output) of our notebooks. 
2. Interpreting notebook git diffs in a better way

# The setup 

To see which is the problem we're trying to solve, let's use a test notebook. Let's create a notebook which once executed will create some random output which will normally trigger git. 

Execute the following two cells to create the notebook in your folder:

In [1]:
%%writefile test.py
# <codecell>
from matplotlib.pyplot import subplots,scatter
from numpy import random,arange
from time import localtime, strftime

# <codecell>
# print a timestamp
print("Last time the notebook has been executed: {}".format(strftime("%a, %d %b %Y %H:%M:%S", localtime())))

# <codecell>
# print a random list
print("A random list: {}".format(random.randint(1,100,5)))

# <codecell>
# print a figure
fig, ax = subplots(figsize=(5,5))
x_points = random.randint(1,100,20)
exponent_1 = random.choice(arange(-1,3.5,0.5))
exponent_2 = random.choice(arange(-1,3.5,0.5))
ax.scatter(x_points, x_points**exponent_1, label=exponent_1, marker='+')
ax.scatter(x_points, x_points**exponent_2, label=exponent_2, alpha=0.5)
ax.legend();
ax.set_title('A randomized figure:');

Writing test.py


In [2]:
# taken from 
# https://stackoverflow.com/questions/23292242/converting-to-not-from-ipython-notebook-format
from nbformat import v3, v4
with open("test.py") as fpin:
    text = fpin.read()
nbook = v3.reads_py(text)
nbook = v4.upgrade(nbook)  # Upgrade v3 to v4
jsonform = v4.writes(nbook) + "\n"
with open("test_notebook.ipynb", "w") as fpout:
    fpout.write(jsonform)

##  Run test notebook and see the diff 

Executing the content of the notebook will show that even if we don't touch anything in the code, `git` will register changes and require us to commit or discard them. 

### Run once

We proceed as if we were creating the notebook from scratch and start tracking it with git after executing its content:

In [3]:
# start monitoring the test notebook and commit it as it is
!git init
!jupyter nbconvert --execute --to notebook --inplace test_notebook
!git add test_notebook.ipynb
!git commit -m 'Created test notebook' --author="author <name.surname@mail.org>"

Initialized empty Git repository in /home/gibbone/Desktop/github_tutorial/git_on_jupyter_testing/.git/
[NbConvertApp] Converting notebook test_notebook.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 11317 bytes to test_notebook.ipynb
[master (root-commit) 7f5169c] Created test notebook
 Author: author <name.surname@mail.org>
 1 file changed, 90 insertions(+)
 create mode 100644 test_notebook.ipynb


### Run again and check the difference

We want to check what happens with git when the output (but not the code) of our notebook changes, hence we run it again.

In [5]:
# execute the notebook
!jupyter nbconvert --execute --to notebook --inplace test_notebook

[NbConvertApp] Converting notebook test_notebook.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 13213 bytes to test_notebook.ipynb


In [6]:
# show changes are registered by git
!git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   test_notebook.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m.ipynb_checkpoints/[m
	[31mHow to integrate Git on Jupyter Notebooks.ipynb[m
	[31mREADME.md[m
	[31mtest.py[m

no changes added to commit (use "git add" and/or "git commit -a")


We see that git find some changes in the output and require us to commit, so we do it:

In [7]:
# add and commit
!git add test_notebook.ipynb
!git commit -m 'Ran the notebook, got new output. Had to commit.' --author="author <name.surname@mail.org>"

[master 9bdddda] Ran the notebook, got new output. Had to commit.
 Author: author <name.surname@mail.org>
 1 file changed, 90 insertions(+), 90 deletions(-)
 rewrite test_notebook.ipynb (82%)


Finally we want to check the diff of our commit:

In [8]:
# check log to see the content of changes
!git log -p -1 

[33mcommit 9bddddab00b25a9c9272906499d5eca1fc966ed2[m
Author: author <name.surname@mail.org>
Date:   Mon Jul 30 19:35:30 2018 +0200

    Ran the notebook, got new output. Had to commit.

[1mdiff --git a/test_notebook.ipynb b/test_notebook.ipynb[m
[1mindex 4f471e4..0135ca0 100644[m
[1m--- a/test_notebook.ipynb[m
[1m+++ b/test_notebook.ipynb[m
[36m@@ -24,7 +24,7 @@[m
      "name": "stdout",[m
      "output_type": "stream",[m
      "text": [[m
[31m-      "Last time the notebook has been executed: Mon, 30 Jul 2018 19:28:33\n"[m
[32m+[m[32m      "Last time the notebook has been executed: Mon, 30 Jul 2018 19:34:01\n"[m
      ][m
     }[m
    ],[m
[36m@@ -44,7 +44,7 @@[m
      "name": "stdout",[m
      "output_type": "stream",[m
      "text": [[m
[31m-      "A random list: [99  9 73 92 93]\n"[m
[32m+[m[32m      "A random list: [29 86 92 25 63]\n"[m
      ][m
     }[m
    ],[m
[36m@@ -62,7 +62,7 @@[m
    "outputs": [[m
    

[32m+[m[32m      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVkAAAE/CAYAAADsX7CcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3X2UHFW97vHvL5kknZCXyRsEZkaSmAghOYI45A09hxcxBAlhueCCeiBqXLiOGJHjWgLXs0A9esW7vKCiolwR4VwkIHpJwoFweXOtq4GEyRVwyACJBJkJCUxeZhJCJuTld/+oPVIz9Mz0zGR3T3c/n7V6VdeuXVW7UsmT3buqq83dERGROAYVugEiIqVMISsiEpFCVkQkIoWsiEhEClkRkYgUsiIiESlkZcAxs8+a2R8jbfstM5t6hLf5BzP7QhfLzMzuMLNdZrbOzD5qZi8dyf3LwFZR6AbIwGFmfwBOBia5+/4CNycKdx+Z511+BDgHqHb3vaHshDy3QQpIPVkBwMwmAx8FHLigF+uZmenvUdeOB15NBWw0ZjY49j6k9/SPQ9pdDjwN/BpY0l3F8PH4u2b2J+BtYKqZfc7MGsxsj5m9YmZfTNU/w8yazOxrZvammW01s8+llo83s5VmttvM1gHv77S/+Wb2jJm1hun8Tm35jpmtCUMBq8L27g7beyb8B9Je381smpkdF+q3v942M0/V+3w4nl1m9oiZHZ9ado6ZvRja8xPAuvhzWgr8EpgX9vGt9j+LVJ1TzezP4c/tt2Z2r5l9Jyx7z7BJe/vD+1+b2a1m9pCZ7QXONLNhZvYDM3vNzN4ws5+b2fDuzqdE5u566QWwCfgS8GHgAHBMN3X/ALwGzCQZchoCfIIkHA34J5LwPTXUPwM4CHw71D0vLB8bli8H7gOOAmYBW4A/hmXjgF3AZWFfnwrz41Nt2R

We can see that cell output changes are registered by git and that the image change is gibberish (for us).

# Solution 1: remove output with `nbstripout` 

The first of our problem was that git was tracking too much of our notebook, so we would like it to just monitor the changes in code. `nbstripout` [(documentation here)](https://github.com/kynan/nbstripout) is a tool that automatically removes the cell outputs in our notebooks. 

It has the following features:
- Can be run from command line
- It's specific to the folder: you have to install it in each folder you're tracking. Hence can be run selectively on certain projects only. 
- Allows customization: for instance we can choose for some notebook to not have the output stripped. See the documentation for more.

## Installation and setup

Follow installation instructions from the documentation. In my case it was sufficient to run from command line the following:

```bash
# install module
>>> conda install -c conda-forge nbstripout

# initiate program inside the folder
>>> nbstripout --install

# check installation
>>> nbstripout --status
nbstripout is installed in repository etc etc
```

## Demonstration 

### Reset notebook

First let's install nbstripout in the folder and check it:

In [13]:
!nbstripout --install

In [14]:
!nbstripout --status

nbstripout is installed in repository /home/gibbone/Desktop/github_tutorial/git_on_jupyter_testing

Filter:
  clean = "/home/gibbone/anaconda3/envs/env_2.7/bin/python2.7" "/home/gibbone/anaconda3/envs/env_2.7/lib/python2.7/site-packages/nbstripout.pyc"
  smudge = cat
  required = true
  diff= nbstripout -t

Attributes:
  *.ipynb: filter: nbstripout

Diff Attributes:
  *.ipynb: diff: ipynb


Then let's remove the output and commit the "empty" notebook so that we're back to the initial state:

In [16]:
# strip
!nbstripout test_notebook.ipynb

# add and commit
!git add test_notebook.ipynb
!git commit -m 'Ran the notebook, got new results, stripped them.' --author="author <name.surname@mail.org>"

[master f553535] Ran the notebook, got new results, stripped them.
 Author: author <name.surname@mail.org>
 1 file changed, 55 insertions(+), 90 deletions(-)
 rewrite test_notebook.ipynb (91%)


### Check effect of stripping output 

Finally let's rerun the notebook as we did at the beginning, hence generating the  problematic output cells, remove them with nbstripout and see what git tells us: 

In [17]:
# execute the notebook
!jupyter nbconvert --execute --to notebook --inplace test_notebook

# strip
!nbstripout test_notebook.ipynb

# show changes are registered by git
!git status

[NbConvertApp] Converting notebook test_notebook.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 12353 bytes to test_notebook.ipynb
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m.ipynb_checkpoints/[m
	[31mHow to integrate Git on Jupyter Notebooks.ipynb[m
	[31mREADME.md[m
	[31mtest.py[m

nothing added to commit but untracked files present (use "git add" to track)


We can see that by stripping out our randomly generated output we are left only with the code. Since it has not changed from the previous commit git tells us that there are no changes to commit for our notebook.

### Make code changes to see if they're registered

Now let's try to add new code to the test notebook to see the actual power of nbstripout. We add a new code cell at the bottom of the file then we see what git tells us:

In [18]:
%%writefile test.py
# <codecell>
from matplotlib.pyplot import subplots,scatter
from numpy import random,arange
from time import localtime, strftime

# <codecell>
# print a timestamp
print("Last time the notebook has been executed: {}".format(strftime("%a, %d %b %Y %H:%M:%S", localtime())))

# <codecell>
# print a random list
print("A random list: {}".format(random.randint(1,100,5)))

# <codecell>
# print a figure
fig, ax = subplots(figsize=(5,5))
x_points = random.randint(1,100,20)
exponent_1 = random.choice(arange(-1,3.5,0.5))
exponent_2 = random.choice(arange(-1,3.5,0.5))
ax.scatter(x_points, x_points**exponent_1, label=exponent_1, marker='+')
ax.scatter(x_points, x_points**exponent_2, label=exponent_2, alpha=0.5)
ax.legend();
ax.set_title('A randomized figure:');

# <codecell>
# new code <--------------
print("This some new code I've added to test nbstripout")

Overwriting test.py


In [19]:
# taken from 
# https://stackoverflow.com/questions/23292242/converting-to-not-from-ipython-notebook-format
from nbformat import v3, v4
with open("test.py") as fpin:
    text = fpin.read()
nbook = v3.reads_py(text)
nbook = v4.upgrade(nbook)  # Upgrade v3 to v4
jsonform = v4.writes(nbook) + "\n"
with open("test_notebook.ipynb", "w") as fpout:
    fpout.write(jsonform)

In [20]:
!git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   test_notebook.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m.ipynb_checkpoints/[m
	[31mHow to integrate Git on Jupyter Notebooks.ipynb[m
	[31mREADME.md[m
	[31mtest.py[m

no changes added to commit (use "git add" and/or "git commit -a")


We've added some code, so it's normal that git ask us to commit the changes. But since we don't know if it is code or simply the output we execute the notebook again and then pass it through nbstripout before asking again. If git tells us that there's something new now we know that it is only code:

In [21]:
# execute the notebook
!jupyter nbconvert --execute --to notebook --inplace test_notebook

# strip
!nbstripout test_notebook.ipynb

# show changes are registered by git
!git status

[NbConvertApp] Converting notebook test_notebook.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 12419 bytes to test_notebook.ipynb
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   test_notebook.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m.ipynb_checkpoints/[m
	[31mHow to integrate Git on Jupyter Notebooks.ipynb[m
	[31mREADME.md[m
	[31mtest.py[m

no changes added to commit (use "git add" and/or "git commit -a")


It is code! So let's commit and see the diff:

In [22]:
# add and commit
!git add test_notebook.ipynb
!git commit -m "'Ran the notebook, got new code. Committing it's mandatory now." --author="author <name.surname@mail.org>"

# check log to see the content of changes
!git log -p -1 

[master e1bf32b] 'Ran the notebook, got new code. Committing it's mandatory now.
 Author: author <name.surname@mail.org>
 1 file changed, 10 insertions(+)
[33mcommit e1bf32bacf9a7aba7b1769a8f1e3bf9c7e2518d8[m
Author: author <name.surname@mail.org>
Date:   Mon Jul 30 20:30:24 2018 +0200

    'Ran the notebook, got new code. Committing it's mandatory now.

[1mdiff --git a/test_notebook.ipynb b/test_notebook.ipynb[m
[1mindex 0a0edde..1db6f4f 100644[m
[1m--- a/test_notebook.ipynb[m
[1m+++ b/test_notebook.ipynb[m
[36m@@ -47,6 +47,16 @@[m
     "ax.legend();\n",[m
     "ax.set_title('A randomized figure:');"[m
    ][m
[32m+[m[32m  },[m
[32m+[m[32m  {[m
[32m+[m[32m   "cell_type": "code",[m
[32m+[m[32m   "execution_count": null,[m
[32m+[m[32m   "metadata": {},[m
[32m+[m[32m   "outputs": [],[m
[32m+[m[32m   "source": [[m
[32m+[m[32m    "# new code\n",[m
[32m+[m[32m    "print(\"This some new code I've added to test nbstripout\")"[m
[32m+[m[32m

Now we can see only the change in the code in the diff, exactly what we needed.

# Solution 2: Analyze diffs with `nbdime` 

Even after solving the output problem with `nbstripout` we're left with diffs which are a little bit convoluted to interpret (cfr how a one line change is registered above). Moreover sometimes it's fine for us that git stores also the changes in output, since they contain valuable, but we would like a better tool to check differences in output and have a clearer way to proceed with merge.

Enters `nbdime`[(documentation here)](http://nbdime.readthedocs.io/en/stable/). `nbdime` is a tool with several features designed to help preview, compare and merge notebooks:

- `nbdiff` compare notebooks in a terminal-friendly way
- `nbmerge` three-way merge of notebooks with automatic conflict resolution
- `nbdiff-web` shows you a rich rendered diff of notebooks
- `nbmerge-web` gives you a web-based three-way merge tool for notebooks
- `nbshow` present a single notebook in a terminal-friendly way

## Installation and setup

As before, installing `nbdime` is straightforward when following the instructions. Like `nbstripout` it must be activate singularly in each folder so that one can choose if he wants to use for a particular project. In my case it has been enough to type:

```bash
# install module
>>> pip install --upgrade nbdime

# initiate program inside the folder
>>> !nbdime config-git --enable
```

## Demonstration 

In [29]:
# activate it in the actual folder of this notebook
!nbdime config-git --enable

First let's use the command line tool to check the last two diffs. (Note: here I'm using the hashes of *my* commits, you have to check your log to get yours otherwise the command will fail):

In [36]:
!nbdiff f553535ec6c3d6fb85ef598bc30c52836a6fdd3f e1bf32bacf9a7aba7b1769a8f1e3bf9c7e2518d8

nbdiff test_notebook.ipynb (f553535ec6c3d6fb85ef598bc30c52836a6fdd3f) test_notebook.ipynb (e1bf32bacf9a7aba7b1769a8f1e3bf9c7e2518d8)
--- test_notebook.ipynb (f553535ec6c3d6fb85ef598bc30c52836a6fdd3f)  (no timestamp)
+++ test_notebook.ipynb (e1bf32bacf9a7aba7b1769a8f1e3bf9c7e2518d8)  (no timestamp)
[34m## inserted before /cells/4:[0m
[32m+  code cell:
[32m+    source:
[32m+      # new code
[32m+      print("This some new code I've added to test nbstripout")

[0m

We can see that `nbdime` recognize the content of the notebook and gives us a clearer diff. Now let's use its web visualization tool to visualize the diffs:

In [34]:
!nbdiff-web e1bf32bacf9a7aba7b1769a8f1e3bf9c7e2518d8 f553535ec6c3d6fb85ef598bc30c52836a6fdd3f

[I nbdimeserver:374] Listening on 127.0.0.1, port 63572
[W webutil:18] No web browser found: could not locate runnable browser.
[I webutil:29] URL: http://127.0.0.1:63572/difftool
[W handlers:443] Blocking request with non-local 'Host' 127.0.0.1 (127.0.0.1:63572). If the notebook should be accessible at that name, set NotebookApp.allow_remote_access to disable the check.
[W log:48] 403 GET /difftool (127.0.0.1) 6.04ms referer=http://localhost:8887/notebooks/github_tutorial/git_on_jupyter_testing/How%20to%20integrate%20Git%20on%20Jupyter%20Notebooks.ipynb
[W handlers:443] Blocking request with non-local 'Host' 127.0.0.1 (127.0.0.1:63572). If the notebook should be accessible at that name, set NotebookApp.allow_remote_access to disable the check.
[W log:48] 403 GET /difftool (127.0.0.1) 6.19ms referer=None
^C
Traceback (most recent call last):
  File "/home/gibbone/anaconda3/envs/env_2.7/bin/nbdiff-web", line 11, in <module>
    sys.exit(main())
  File "/home/gibbone/anaconda3/envs/env_2