#Building Better Data Science Projects, One Codeblock at a Time



#The Golden Pipeline

Lots of people are thinking about pipelines these days:

- https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
- https://github.com/dssg/UPSG
- https://github.com/dssg/eights
- https://github.com/allenai/pipeline
- https://github.com/ropensci/rrrpkg
- https://github.com/dssg/feature_gen_pipeline


#Pipelines are a great idea.

It’s true that ultimately most data science projects can be crisply defined as 
data ingestion -> feature extraction -> training -> ensembling ->validation

This makes it sound as though doing a data science projects just 

#What makes building a data science project so hard?
You are splitting between telling a computational narrative & building production software.*

*for more on computational narrative, read the first part of the blog post on the Jupyter project: http://blog.jupyter.org/2015/07/07/project-jupyter-computational-narratives-as-the-engine-of-collaborative-data-science/

That means you often have to go from writing code interactively for exploration and discovery, to writing code for reproducible and rapidly repeatable tasks, then back to interactive code for discovery.

On top of that, many of you are just getting a handle on many of these tools.

This is very challenging. It can cause your repos to get out of hand. 
- https://github.com/dssg/nfp2

#The hard truth: data science projects are more than the pipeline

A full project will contain
- documentation and background materials
- presentations and reports
- bash scripts and configuartion files
- exploritory scripts and figuring shit out spaghetti code
- main pipeline library
- tests
- helper functions and utilities
- web applications
- devops junk

##Example
https://github.com/dssg/WorldBank2015

You are in the awkward, gangly phase of your project where you are facing growing pains from having wonderful, wide-eyed exploratory childhood to the much more structured and predictable adulthood. 

This is about helping with that difficult transition. This is your project’s coming of age novel. From exploratory scripts to reliable, modular, reusable code.


#Goals

##Increase Productivity.
- I want to help you make changes to your code that will make you and your team more productive during the last four

##Increase Reusability
- I want to help you make changes and start using best practices that will increase the likelihood that someone, anyone, can make sense of what you’ve done after the fellowship is over, and may even want to reuse your code. This is a high bar. 

#Who is this Aimed at?
- beginner, intermediate, pros
- going to try and hit on some tips for everyone
- we’ll start with the basics and go from there

# How is it structured?

##Big Tip!
- some explanatory text
##Examples
##Hands-on applications and Q&A (for the big stuff)


#Tips on Repo Organization

#Describe your repo!
- this reminds you what you’re doing and let’s other people know what to do

##Examples
https://github.com/dssg

#Reoganization can make your life easier
- suggested files in root
     - README.md
     - .gitignore
     - license.txt
     - global_config
     - [setup.py]
     - [requirements.txt]
     - [travis]
- suggested directories in root
     - pipeline_module/app
     - scripts/notebooks
     - db
     - docs
     - reports
     - test


#Tips on root files

#Separate config files that serve two purposes
- Config files can make things safer by hiding information, and making things faster and reproducible by consolidating information. Keep the two separate.
- I prefer a global configuration file called settings.py and then json or yaml from model configuration
- put your settings.py in your .gitignore. 
- put a settings.py.example in your repos root directory

#Tips on refactoring

#Each journey starts with a single script
- The jorney of a code snippet
https://www.youtube.com/watch?v=tyeJ55o3El0
     - exploratory scripts
     - functional scripts
     - modules
     - Classes and Objects

#Everyone should get to modules
- modules make everyone’s life easier
- they organize your functions
- they simplfy the coding

#Everything doesn’t have to be OO
- OO programming is powerful.
- It can also be unnecessarily confusing to beginners. Don’t feel pressure to make it object oriented just because the cool kids are doing it. Only do it if it’s going to make you more productive, which, if you’re relatively new to programming, it probably won’t.

#Basics on transitioning your code from interactive scripts to functions

#Think functionally
- every function takes something or things and returns a thing.
- the most important thing to keep track of is what a function receives and what it returns
- for almost all of your functions in your pipeline will fall into three buckets
     - takes 

#Use Docstrings, your teammates will thank you. The World will thank you.
- docstrings help you and others now what a function does and how to use it.
- put the docstring directly below the definition line
- ‘’’ This is a docstring ‘''
- Tell what a function does, what it receives and what it returns

##Examples

#Give default values when defining your functions
- this will mean things run the way you want, even when you forget
- 

##Action: Find a function in your code you can add a docstring to and do it now


#When you call a function, use named arguments 
- your team will thank you, you will thank you
- we’ll cover keyword arguments in a bit

Action: find a function call and name the arguments


#Tips on going from functions in a single file to modules


#Organize your functions by functional grouping in pipeline
- Basic structure
     - ETL
     - featured
     - models
     - evaluation
     - utils
     - tests
- start with files that group functions
- then move to directories that group files

##Example
https://github.com/dssg/cincinnati2015
https://github.com/dssg/Australian_Conservation_Foundation/blob/master/Models/Clustering/clustering_template.ipynb

##Example
https://github.com/dssg/Australian_Conservation_Foundation/tree/master/Models



#Making your grouping of files into a module is not hard
- add __init__py
- bring in all the functions from a file by puting
`from directory_name import file_name`

Now you’ve got all the functions!

##Example



#Tips on going from groupings of functions to classes

#Go OO if you know how

#Protips on Functions

#Functions in python are first class citizens. They are objects, so you can do a lot with them
- assign functions to variables
- define functions inside of other functions
- functions can be passed as parameters to other functions
- functions can return other functions
- inner functions have access to enclosing scope (closure)

 #Decorators let you change things without changing things!
- decorators extend the behavior of functions that we don’t want to modify
- decorators “wrap” functions, so that you can define a function that takes a function as an argument, generates a new function that augments the work of the original function, and returns the augmented function for use.

##Example
https://github.com/dssg/education-college/blob/master/code/modeling/featurepipeline/abstractboundedfeature.py



#Remember **kwargs lets you pass any named value and creates an arbitrary dict

##Example

def my_function(**kwargs):
     print str(kwargs)

my_function(a=12, b=“abc”)



##Resources
http://thecodeship.com/patterns/guide-to-python-function-decorators/


# Think functional when performance matters
- Main reason: for a program to take advantage of many processors simultaneously efficiently and reliably, the most compute-intensive units of code should behave like pure functions
- Simple way to get started in python: map, filter, and reduce
- map: get everything back, differently
     - pass a function and a python object and get a transformed object back (many to many)
- filter: get fewer things back
     - pass a function that returns a boolean value and an object, get a subset back 
- reduce: get one thing back
     - pass a function and an object, get a single value back


#Using anonymous functions is compact, but can be confusing
- necessary once you start
- a must for cluster jobs (Spark)

##Example

Draw a grid with data types across the top and operations on those data types down the left side. If you slice the grid vertically, you're doing OO; if you slice the grid horizontally, you're doing FP



#Protips on Python


#Iterators
- python generally doesn’t need intexes


#Named tuples

#List comprehension

#Protips on Testing your Code

#Real programmers write unit tests
- I prefer doctest or py.test

Illustrative Example:

DSSG Example:

#Go all the way on testing with Travis.CI
- automatically runs test for you and tells you what is passing and failing
- free for all open source projects
- 

#Resources
Engineering practices for data science
http://blog.kaggle.com/2012/10/04/engineering-practices-in-data-science/


Notes on projects:

Australia
https://github.com/dssg/Australian_Conservation_Foundation

Babies
https://github.com/dssg/babies

Feeding America
https://github.com/dssg/Feeding_America

Police
https://github.com/dssg/police

Infonavit
https://github.com/dssg/infonavit