In [2]:
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import cv2
import os
import urllib
import datetime
import re
import time
import matplotlib as mpl
colors = ['#0055A7', '#2C3E4F', '#26C5ED', '#00cc66', 
          '#D34100', '#FF9700', '#091D32']
mpl_update = {'font.size':16, 'xtick.labelsize':14,
              'ytick.labelsize':14, 'figure.figsize':[12.0,8.0],
              'axes.labelsize':20, 'axes.labelcolor':'#677385',
              'axes.titlesize':20, 'lines.color':'#0055A7',
              'lines.linewidth':3, 'text.color':'#677385'}
mpl.rcParams.update(mpl_update)
from IPython import display

<img src="figures/svds.png" alt="SVDS" width="100" align="right">

# Catching trains: Iterative model development with Jupyter Notebook

### Data Day Seattle 2016

<h3>chloe@svds.com | @chloemawer <br />
Silicon Valley Data Science | @SVDataScience </h3>

* DS
* Mountain View, CA 
* 3 years
* Projects with clients from lots of industries
* Also R&D to dev new skills and give back to communtity

# Agenda

<img src="figures/agenda.png" alt="Window" align="left" width="700">

* is a train from SF to SJ
* Will tell about project
* some work on a proof of concept 
* Why we usually use jupyter notebooks for such projects
* Best practices for data c
which for those of you who don't know, is a train that runs from San francisco down to San Jose and further. Today, I'm going to tell you a bit about our Caltrain project, some work that I did on a proof of concept and then discuss why we usually use Jupyter Notebooks for such data science projects. Then I will focus on some best practices that will allow this medium to be fully utilized for not only developing data scientists, but other data scientists who consume their work and managers and stakeholders that wish to make decisions from it, focusing on two major aspects - communication and reproducibility. Lastly, I will walk you through a Jupyter Notebook that was the product of my proof of concept work, where you'll get to see a JN in action as well as learn a little bit about motion detection. 

<img src="figures/caltrain_header.jpg" alt="CaltrainHeader" width="960" height="200">
## The Caltrain obsession


* ~50,000 weekly including many of our company
* Long history - early meeting took place outside at happy hour (aka rush hour)
* First office
* Annoyed with lack of adequate predictions 

<center><img src="figures/caltrain_tweet.jpg" alt="Tweet" width="425" style="horizontal-align:middle"></center>

* Warn Caltrain not client 
* But appreciate our work

<center><img src="figures/appdemo.gif" alt="AppScreenShots" width="350"></center>

* Want to predict Caltrain delays 
* No adequate data

<br />
<br />
<img src="figures/caltrain-sign.jpg" alt="Window" align="right" width="400">
<br />

# Did the train really leave?

* Only data is that which shows up on the sign 
* Have to infer, disappearance means departure

<br />
<br />
<img src="figures/train_in_window.jpg" alt="Window" align="right" width="400">
<br />
<br />
<br />
# Did the train really leave?
<br />

* Take advantage of location
* Maybe we can use video to detect train and direction
* If possible, set up Raspberry Pi in real-time 
* First step, POC to prove we should put resources into making it deployable

# Proof of concept (POC)
<br />
<video width="640" height="480" controls>
  <source src="video/orig.mp4" type="video/mp4">
</video>

* A while ago, set out to develop a PoC to see if this would be worth setting up

# Proof of concept
* Fast, iterative development
* Record of prior work
* Easy QA/feedback
* Communication of results for decision making 
* Base of work for team to iterate from
* Educational material 
* Easy repurposing 

* What I want in developing a PoC like this or those for clients is

<center><img src="figures/jupyterplus.svg" alt="Jupyter" width="450"></center>


<img src="figures/jupyter.png" alt="Jupyter" width="250" align='right'>
<br />
# Jupyter notebook 

* Born in 2011 as the IPython Notebook
* Browser-based interactive web app for creating documents with
    * Live code
    * Explanatory text
    * Visualizations
    * Equations
* Three cell types
    * Code
    * Markdown and latex
    * Straight html


<img src="figures/jupyter.png" alt="Jupyter" width="250" align='right'>
<br />
# Jupyter notebook 

* In 2014, moved to Jupyter project 
* Supports over 40 programming languages including:
    * Python
    * R
    * Julia
    * Scala

# Why use the Jupyter Notebook? 
* Aligns with how we think 
* Documentation, code, figures, and results combined
* Immediate feedback during development 
* Html, pdf, markdown, slide output options
* Reduces the content creation necessary for communication to other parties

# Jupyter notebook for everyone

For these reasons, jupyter notebook becomes an effective development medium for not only the developin gdata scientist, but also others he/she works with as well as the stakeholders and managers that consume their work. 

# Me, me, me

* Quickly iterate
* Easily recall where certain results occurred
* Streamline process
* Reproduce

If properly used, Jupyter notebooks can for data scientists:
* Maximize productivity
* Increase learnings
* Reduce documentation

# Others like me

* QA
* Learn
* Reproduce
* Reuse 

For other data scientists: 
* Serve as learning materials
* Allow easier QA 
* Provide starting work for new projects 

# Not me

* Make decisions
* Assess value 

For managers: 
* Serve as a medium to make decisions from 
* Create a knowledge base that will lead to 
    * Onboarding and training materials
    * Faster analysis 
    * Higher quality work

# Basic requirements
* Clear communication
* Reproduciblility 

# To enable:
* Streamlined scientific process
* Reusability
* Decision making
* QA 

# Communication

# Communicating to future me
* Why did you do that? 
* What will you want to remember later on?
* What didn't work? Why not? 
* What did I learn here? 
* What are my next steps? 

# Communicating to other data scientists
* What were the limitations of this analysis? 
* What assumptions were made? 
* What choices were made and why? 

# Communicating to non-data scientists
* Who? 
* When?
* Why is this important? 
* What should the audience known when communicating this work to others?

# Template and automate

# Template and automate

In `.bashrc` file: 

`newnb (){cp /path/to/template/template.ipynb $(date +"%Y-%m-%d")-initials-$1.ipynb}`

`newnb data-day-seattle-demo` results in a file: `2016-07-23-cmm-data-day-seattle-demo.ipynb`

# Call out boxes

<br />
<img src="figures/alert-boxes.png" alt="CaltrainHeader" width="960" height="200">

Mention ability to process html, markdown

# ToC2 Extension

<br />
<img src="figures/toc.png" alt="CaltrainHeader" width="960" height="200">

What are extensions? 

# Hiding code

# Hiding code

# Keep everything connected

<br />
<img src="figures/related-nbs.png" alt="CaltrainHeader" width="560">

# Knowing when to stop

# Reproducibility

# Environment
* `requirements.txt`
* Docker container
* List packages

<img src="figures/packages.png" alt="CaltrainHeader" width="960" height="200">

# The Data

<br />
<img src="figures/data-input.png" alt="CaltrainHeader" width="960" height="200">

# Analysis is never linear, but code is

* Code won't know that you ran all of this and then added something up above. 
* Don't go and add things nilly willy above where you are. 
* Consider inputs and outputs of each stage of your analysis. 

# Think before you add any code above

# Don't delete

# Scratch Pad Extension
<img src="figures/scratchpad.gif" alt="Scratchpad" width="600" align='left'>


# ExecuteTime Extension

<center><img src="figures/executetime.png" alt="End" width="960"></center>


# Pay attention to inputs and outputs

<img src="figures/inout1.png" alt="End" width="160" align="right">
<img src="figures/inout2.png" alt="End" width="160" align="left">


# Pay attention to inputs and outputs
<center><img src="figures/dag.png" alt="End" width="360" align="center"></center>


# Lastly... know when to get out of the notebook

# DEMO

# Conclusions

<center><h4>To view SVDS speakers or to receive a copy <br /> of our slides, go to: www.svds.com/DDSea2016 </h4></center>
<center><img src="figures/thankyou.png" alt="End" width="700"></center>
<center><h3>chloe@svds.com | @chloemawer  </h3></center>
