# Data Science is Software
From: SciPy Con 2016

**Instructors:** Peter Bull and Isaac Slavitt, co-founders of DrivenData

This tutorial provides vital developer lifehacks for the working Jupyter data scientist. You’ll learn the basics of writing good software, which will prepare you to be a valuable contributor to your company’s wider engineering organization. You’ll learn the following topics:
   * How to effectively structure your project, using the `cookiecutter-datascience` package
   * How to **set up a virtual environment**, allowing you to abstract the current project you’re working from your other projects
   * How to use a Linux tool called **“make” to create automating parts of your project easier**
   * How to **better write your code, so it’s reproducible**, meaning you can come back to a project six months later and easily figure out all the things you’ve done
   * How to **modularize your code into packages** so you don’t end up writing the same things repeatedly


In [2]:
from tqdm import tqdm
from time import sleep

* If we wrap whatever we're iterating over in `tqdm()`, then we get a progress bar automatically printed out whenever that for loop is running (also works in jupyter):

In [3]:
for i in tqdm(range(1000)):
    sleep(0.01)

100%|██████████| 1000/1000 [00:11<00:00, 85.70it/s]


#### AGENDA
* project structure
* environments and reproducibility
* coding for reusability
* testing
* collaboration

#### Intro
* **DrivenData** runs machine learning competitions for non-profits, NGOs and government (social-impact organizations)
* "There's a little bit of overhead to some of these best practices, but the investment is definitely going to be worth it."
* [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/)

## Folders:

<img src='data/ds_software1.png' width="600" height="300" align="center"/>

### Data:

   #### Raw
   * Untouched, original data
   * Example: data straight from client or database

   #### External 
   * Data that comes from other places
   * Example: U.S. Census Data

   #### Interim
   * Example: If we're generating data that we might not be using for the final analysis, but may be used by other pieces of the analysis

   #### Processed
   * Processed data used for final model
   
<img src='data/ds_software2.png' width="600" height="300" align="center"/>

***
***
#### References
* Data dictionaries
* manuals
* any explanatory materials

#### Requirements.txt
* where we have codified all the dependencies that our code needs to be able to run 

#### src
* **Source directory**
* If we're writing code or refactoring code from our notebooks, this is where that would live

* "Master notebook pattern" is fine for some projects

<img src='data/ds_software3.png' width="600" height="300" align="center"/>

<img src='data/ds_software4.png' width="600" height="300" align="center"/>

<img src='data/ds_software5.png' width="600" height="300" align="center"/>

* Now we have made this data process into a directed, acyclic graph
* packages that have given us ways to declare what tasks are and make sure that they get run in a very specific order
    * "make"
    * AirFlow
    * Luigi
    * joblib
    * snakemake

<img src='data/ds_software6.png' width="600" height="300" align="center"/>

* Let notebooks be what notebooks are good at, which is: exploration, experimentation, writing & sharing ideas
* Notebooks are not good at doing thing the same way everytime

#### MREs (Minimal Reproducible Environment)
* Giving someone who's using your project, or even yourself in the future, the absolute minimum they (or you) need to reproduce this (code).
* You don't want to include every package in the whole world in every project that you have; that's going to be a huge headache for anyone that's installing things
    * Say you have a project with complicated dependencies that you have to compile, if you don't have to include that because it's not used in your project (?), then that shouldn't be a part of your dependencies for the project that you're working on
    
<img src='data/ds_software7.png' width="600" height="300" align="center"/>

#### Tools that help you to manage a reproducible environment
* **watermark**
* **wonda env**
* **wotenv**
* **pip requirements.txt**

#### Options for more complex environments
* **docker**
* **vagrant**

#### Watermark extension
* "bare minimum" way to help people reproduce your environment

* Highly recommended: name your environment the same as your project root

* `conda env list`
* lists environments with star for current environment
* `cmd + k` for mac is `cls` of Windows

* The next thing to think about: What is that minimum reproducible environment for someone else that's using this project?
* You want an explicit declaration of what all the packages are that my project will need

* In Python projects, there's a convention that lives in a file called `requirements.txt`
    * explicitly list a package name and optionally a version number with >=, ==, <=
* Never use `pip install` and then a package name
    * **Always** put the package name into a txt and then do `pip install -r requirements.txt`
    * This way you can always keep track of what packages you've installed in a particular environment
    * `pip freeze` will dump out everything that is currently in your environment
    * You can pipe `pip freeze` into a file 
        * `pip freeze > new-requirements.txt`
        * Only downside: will pin all requirements to exact versions (`==`)
        * This does not work with `conda install stuff`

<img src='data/efficient26.png' width="600" height="300" align="center"/>