# 1. This is my house

### Environment reproducibility for Python

## 1.1 The [watermark](https://github.com/rasbt/watermark) extension

Tell everyone when your notebook was run, and with which packages. This is especially useful for nbview, blog posts, and other media where you are not sharing the notebook as executable code.

In [16]:
# if you don't have the watermark extension installed:
%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
    
# once it is installed, you'll just need this in future notebooks:
%load_ext watermark

Installed watermark.py. To use it, type:
  %load_ext watermark
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


In [18]:
%watermark -a "Peter Bull" -d -v -p numpy,pandas -g

Peter Bull 2016-03-17 

CPython 2.7.10
IPython 4.1.2

numpy 1.10.4
pandas 0.17.1
Git hash: 70983d51efd458b305a6326b89f5bf6437add450


## 1.2 Laying the foundation

[`virtualenv`](https://virtualenv.pypa.io/en/latest/installation.html) and [`virtualenvwrapper`](http://virtualenvwrapper.readthedocs.org/en/latest/#) give you a new foundation.

 - Start from "scratch" on each project
 - Choose Python 2 or 3 as appropriate
 - Packages are cached locally, so no need to wait for download/compile on every new env
 
Installation is as easy as:
 - `pip install virtualenv`
 - `pip install virtualenvwrapper`
 - Add the following lines to `~/.bashrc`:
 
------

```
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Devel
source /usr/local/bin/virtualenvwrapper.sh
```

-----


To create a virtual environment:

 - `mkvirtualenv <name>`
 
To work in a particular virtual environment:

 - `workon <name>`
 
To leave a virtual environment:

 - `deactivate`
 
 
**`#lifehack`: create a new virtual environment for every project you work on**


## 1.1 The `pip` [requirements.txt](https://pip.readthedocs.org/en/1.1/requirements.html) file

Track your "Minimum reproducible environment" in a `requirements.txt` file

**`#lifehack`: never again run `pip install <package>`. Instead, update `requirements.txt` and run `pip install -r requirements.txt`**

In [12]:
!head -n 15 ../requirements.txt

engarde==0.3.1
notebook==4.1.0
ipython==4.1.2
jupyter==1.0.0
numpy==1.10.4
pandas==0.17.1
seaborn==0.7.0
matplotlib==1.5.1
q==2.6


# 2. The Life-Changing Magic of Tidying Up

## 2.1 Consistent project structure means

 - relative paths work
 - other collaborators know what to expect
 - order of scripts is self-documenting

In [21]:
! tree ..

..
├── LICENSE
├── README.md
├── data
│   └── water-pumps.csv
├── notebooks
│   ├── data-science-is-software-talk.ipynb
│   └── edit-run-repeat.ipynb
├── requirements.txt
├── slides
│   ├── Data\ Science\ is\ Software\ -\ Lightning.pptx
│   └── Data\ Science\ is\ Software\ -\ Slides.pptx
└── src
    └── features

5 directories, 8 files


# 3. Edit-run-repeat: how to stop the cycle of pain

The goal: don't edit, execute and verify any more. It's a fine way to start a project, but it doesn't scale as code runs longer and gets more complex.

### Debugging, refactoring, testing

 - Start with repeated code
 - Write functions - test with asserts
 - Refactor to modules - test with `unittest` 
 - Special testing tools for data science (`numpy.testing`, `engarde`)

## 3.1 No more docs-guessing

In [27]:
import pandas as pd

In [24]:
df = pd.read_csv("../data/water-pumps.csv")
df.head(1)

## STEP: Try adding parameter index=0

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


In [25]:
pd.read_csv?

In [None]:
df = pd.read_csv("../data/water-pumps.csv",
                 index_col=0,
                 parse_dates="date_recorded")
df.head(1)

## 3.2 No more copy pasta

Don't repeat yourself.

In [28]:
import seaborn as sns

In [26]:
plot_data = df['construction_year']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()

plot_data = df['longitude']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()

plot_data = df['amount_tsh']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()

## STEP: Paste for 'latitude'

NameError: name 'sns' is not defined

In [29]:
def kde_plot(dataframe, variable, upper=0.0, lower=0.0, bw=0.1):
    plot_data = dataframe[variable]
    plot_data = plot_data[(plot_data > lower) & (plot_data < upper)]
    sns.kdeplot(plot_data, bw=bw)
    plt.show()

In [30]:
kde_plot(df, 'construction_year', upper=2016)
kde_plot(df, 'longitude', upper=42)

NameError: global name 'plt' is not defined

In [None]:
kde_plot(df, 'amount_tsh', lower=20000, upper=400000)

## 3.3 No more guess-and-check

Interrupt execution with:
 - `%debug` magic: drops you out into pdb in IPython
 - `import q;q.d()`: drops you into pdb, even outside of IPython
 
Interrupt execution on an Exception with `%pdb` magic. Use [pdb](https://docs.python.org/2/library/pdb.html) the Python debugger to debug inside a notebook.  Key commands for `pdb` are:

 - `p`: Evaluate and print Python code
 
 
 - `w`: Where in the stack trace am I?
 - `u`: Go up a frame in the stack trace.
 - `d`: Go down a frame in the stack trace.
 
 
 - `c`: Continue execution
 - `q`: Stop execution

In [31]:
kde_plot(df, 'date_recorded')

ValueError: `dataset` input should have multiple elements.

In [32]:
def kde_plot_debug(dataframe, variable, upper=0.0, lower=0.0, bw=0.1):
    plot_data = dataframe[variable]
    plot_data = plot_data[(plot_data > lower) & (plot_data < upper)]
    
    %debug
    
    sns.kdeplot(plot_data, bw=bw)
    plt.show()
    
kde_plot_debug(df, 'date_recorded')

In [33]:
# "1" turns pdb on, "0" turns pdb off
%pdb 1

kde_plot(df, 'date_recorded')

Automatic pdb calling has been turned ON


ValueError: `dataset` input should have multiple elements.

> [0;32m/Users/bull/Envs/dsis/lib/python2.7/site-packages/scipy/stats/kde.py[0m(168)[0;36m__init__[0;34m()[0m
[0;32m    166 [0;31m        [0mself[0m[0;34m.[0m[0mdataset[0m [0;34m=[0m [0matleast_2d[0m[0;34m([0m[0mdataset[0m[0;34m)[0m[0;34m[0m[0m
[0m[0;32m    167 [0;31m        [0;32mif[0m [0;32mnot[0m [0mself[0m[0;34m.[0m[0mdataset[0m[0;34m.[0m[0msize[0m [0;34m>[0m [0;36m1[0m[0;34m:[0m[0;34m[0m[0m
[0m[0;32m--> 168 [0;31m            [0;32mraise[0m [0mValueError[0m[0;34m([0m[0;34m"`dataset` input should have multiple elements."[0m[0;34m)[0m[0;34m[0m[0m
[0m[0;32m    169 [0;31m[0;34m[0m[0m
[0m[0;32m    170 [0;31m        [0mself[0m[0;34m.[0m[0md[0m[0;34m,[0m [0mself[0m[0;34m.[0m[0mn[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mdataset[0m[0;34m.[0m[0mshape[0m[0;34m[0m[0m
[0m
ipdb> c


In [34]:
# turn off debugger
%pdb 0

Automatic pdb calling has been turned OFF


## 3.4 No more "Restart & Run All"

`assert` is the poor man's unit test: stops execution if condition is `False`, continues silently if `True`

In [37]:
import numpy as np

In [38]:
def gimme_the_mean(series):
    return np.mean(series)

assert gimme_the_mean([0.0]*10) == 0.0

## 3.5 No more copy-pasta between notebooks 

Refactor to module

## 3.6 No more letting other people (including future you) break your things

testing the code

## 3.7 Specialty testing for datascience

In [39]:
data = np.random.normal(0.0, 1.0, 1000000)
assert gimme_the_mean(data) == 0.0

AssertionError: 

In [41]:
np.testing.assert_almost_equal(gimme_the_mean(data),
                               0.0,
                               decimal=1)