# Documenting Code

The learning goals for _documenting and versioning code_ include:

* Describe Jupyter notebooks and their benefits
* Explain how version control (i.e., git) works

After the exercise, you will be able to 

* Create a Jupyter notebook for one of your projects
* (Briefly) explain your workflow in Markdown in that Jupyter notebook
* Check in that notebook to a version control system

# Why Document?

## tl;dr 

Reproducibility (by me and by others) and Transparency

* Nosek, B.A. et al. (2015) [Promoting an open research culture](http://dx.doi.org/10.1126/science.aab2374). _Science_, __348__, 1422–1425.
* Wilson, G. et al. (2014) [Best practices for scientific computing](http://dx.doi.org/10.1371/journal.pbio.1001745![image.png](attachment:image.png)). _PLoS Biol._, __12__, e1001745.



# Writing Documentation

### What to Document

What should you document? Wilson and colleagues provide some useful advice:

![Write programs for people from Wilson et al. (2014)](../img/1-writing.png)

![Documenting from Wilson et al. (2014)](../img/7-documenting.png)

# Tools for Documenting Effectively

## Jupyter Notebooks

[Jupyter Notebook](https://jupyter.org/) is an interactive, browser-based application that lets you store and run code, stories or descriptions of code, and visualizations in the same place. I'm using a Jupyter notebook for this presentation. I also use notebooks in my research which falls under the general heading of computational social science. A lot of what I do is analogous to the work you do. Here's a very high level list of my tasks:

1. get a bunch of data that's automatically generated by a sensor or process, 
1. manipulate or munge the data into the format I need, 
1. generate computational and statistical models, 
1. run those models, 
1. analyze the output, and 
1. describe my results to others.

### Replication and Repoducibility

In the course of a normal day, I switch among these tasks in different projects (i.e., data collection on one project and model building on another). Code I wrote two hours ago might as well have been written by a stranger. I've thought about too many other things to remember what I was working on, how far I'd gotten in that particular task, and what decisions I'd made along the way. Enter notebooks. They allow me to store my thought process, my work, and its output all in the same place. I don't have to remember where I put a particular file or how my table was produced because the line of code that gets the data is in the same place as the line of code that generates the table. In between notebooks let me add comments that are human-readable (no # marks or weird colors or fonts) by using Markdown. 

[Private example](https://gitlab.si.umich.edu/posm/mpsa-2019/blob/master/notebooks/models/create/16.0-ams-all-models-EVALUATE.ipynb): results section of [MPSA paper](https://deepblue.lib.umich.edu/handle/2027.42/148323)

![MPSA paper notebook](../img/private_example.png)

The private example is a notebook one of my students created that analyzes the results from some topic models we built. The notebook generates the figures and regression tables that went into the ```Results``` section of our paper. When drafting the paper, we were able to copy directly from her markdown in that document to the results section -- she explained in the notebook what figure she was generating, and the visible code makes it easy to see what data is included in the figure. You can also see places where she left herself notes about what work to return to (before ```In [17]:```).

Again, enter notebooks. Jupyter notebooks are _playable_ which means I can issue a ```Run``` command (by pushing a button, using a keyboard shortcut, or choosing it from a menu) and watch the notebook unfold.

![Jupyter Toolbar](../img/jupyter_notebook_toolbar.jpg)

[Public example](https://github.com/tapilab/icwsm-2018-hostility/blob/master/Replication.ipynb): replication notebook of [ICWSM paper](https://aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/view/17875/0)

![ICWSM replication notebook](../img/icwsm.png)

In the public example, you can see a replication notebook one of my collaborators created for a paper we published last year. You can play this notebook with our data and retrain the models we created, regenerate the tables in our paper, and inspect the data at the same points we did. You won't see many explicit comments (i.e., #'ed lines) or markdown cells in that notebook because storing the generation, analysis, and output together minimizes the need for documentation. This code is nearly _self-documenting_ because it uses elements such as variables names and structure to guide readers through the work. It also makes output easy to find because you also don't have to point to carefully named ```.tex``` or ```.png``` files somewhere else because the tables and figures are right there.

### Transparency

The notebook approach makes documentation easier, meaning we are each more likely to do it. It also improves the transparency of our work by revealing the code we used to manipulate and analyze our data. Being able to understand the code requires a level of expertise with whatever language is employed &mdash; my examples were both in Python, but Julia, R, and 90 other languages are supported. But I think we can agree that the notebooks contain far more detail in a more usable structure than a standard methods section or even static appendix. Version control systems, which Cristina will talk about shortly, also render Jupyter in the browser so that notebooks can be read without running or exporting. 

## ENCODE Example

For example, we talked about [ENCODE](https://www.encodeproject.org/) on Monday, and they use Jupyter to provide documentation about how to use data. Let's take a look: 

[Exploring ENCODE data from EC2 with Jupyter notebook](https://github.com/ENCODE-DCC/encode-data-usage-examples/blob/master/mount_s3_bucket_and_run_jupyter_on_ec2.ipynb)


Here's a snippet from the example notebook:

![Filtering ENCODE manifest](../img/encode-filter.png)

In this pair of Jupyter cells, we can see

1. the rational for the code
1. the code itself
1. a sample of the resulting data

These are a good example of embedding documenting, documenting reasons, and making code style and formatting consistent. Jupyter helps with the embedding. (1) explains why the filter is being applied, and in (2) we see consistent tabs, spacing, quotation styles, and variable naming.

## How Notebooks Work

The ENCODE example demostrates many of the practices related to using Python in Jupyter, especially

* Combining Markdown, external images, and code in one place
* Writing self-documenting code


## Cell Types

You've seen the colors and text change on my screen as I walk through this notebook. Nearly all of the _cells_ in my notebook contain [Markdown](https://daringfireball.net/projects/markdown/) a lightweight markup language that enable you to [render text in notebooks](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html). Generally, Markdown renders text in sans serif font and code in a ```fixed-width, serif font```. By using both Markdown and code cells, you can embed documentation and output directly in a notebook.

You can set the type for each cell using the type dropdown:

![Type dropdown](../img/cell_types.png)

From the ENCODE example, here's a code cell:

![Encode code cell](../img/encode-code.png)

Jupyter can run other types of code as well. Here's an example from ENCODE of Jupyter calling a shell to get some infomration about files in a folder:

![Encode ls cell](../img/encode-ls.png)
 

In each of those examples from the ENCODE notebook, the cell has already been run. If the code in a cell produces output, then the output will appear under the cell in the notebook. Here that means that the histogram produced by the code appears after the cell is run, and the output of the ```ls``` command appears after that cell is run. In this presentation, running a cell almost always renders the Markdown, as in this example. 

# Next Steps

I've just talked about why we produce documentation &mdash; for reproducibility and transparency &mdash; and show you some of the basics of Jupyter notebook. Next, Cristina is going to talk about version control, and Armand will talk about testing your code. Then we'll take a quick break and come back to do an exercise that combines documenting, versioning, and testing.