#Introduction to Working with Data in Python: Cleaning and Munging

###Goals

- Become familiar with basic tools and methods for data munging and cleaning in Python

Tasks

- Learn a little history of Python & iPython
- Start an iPython Notebook server, create an notebook and navigate around the notebook 
- Load data from a csv into a pandas dataframe 
- Write a function that changes column names to snake_case
- Remove missing values
- Fill missing values with an interpolation
 

#Background on Python, iPython, and Pandas

[Python](https://www.python.org/) is a high-level general purpose programming language named after a [British comedy troup](https://www.youtube.com/user/MontyPython), created by a [Dutch benevolent dictator](http://en.wikipedia.org/wiki/Guido_van_Rossum) and maintained by an international group of fiendly but opinionated python enthusiasts (`import this!`). 

It's popular for data science because it's powerful, fast, plays well with others, runs everywhere, is easy to learn, highly readable, and open. Because it's general purpose it can be used for full-stack development. It's got a growing list of useful libraries for scientitic programming, data manipulation, data analysis. (Numpy, Scipy, Pandas, Scikit-Learn, Statsmodels, Matplotlib, Pybrain, etc.)

[iPython](http://ipython.org/) is an enhanced, interactive python interpreter started as a grad school project by [Fernando Perez](http://fperez.org/). iPython (jupyter) notebooks allow you to run a multi-language (Python, R, Julia, Markdown, LaTex, etc) interpreter in your browser to create rich, portable, and sharable code documents.

[Pandas](http://pandas.pydata.org/) is a libary created by [Wes McKinney](http://blog.wesmckinney.com/) that introduces the R-like dataframe object to Python and makes working with data in Python a lot easier. It's also a lot more efficient than the R dataframe and pretty much makes Python superior to R in every imaginable way (except for ggplot 2). 

##Getting started with iPython Notebooks

To start up an iPython notebook server, simply navigate to the directory where you want the notebooks to be saved and run the command

```
ipython notebook
```

A browser should open with a notebook navigator. Click the "New" button and select "Python 2".

A beautiful blank notebook should open in a new tab

Name the notebook by clicking on "Untitled" at the top of the page.

Notebooks are squences of cells. Cells can be markdown, code, or raw text. Change the first cell to markdown and briefly describe what you are going to do in the notebook. 

##Getting started with Pandas

We start by importing the libraries we're going to use: `pandas` and `matplotlib`

In [81]:
import pandas as pd
import numpy as np

##Loading data into a DataFrame

The pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. It's basically a spreadsheet you can program and it's an incredibly useful Python object for data analysis. 

You can load data into a dataframe using Pandas' excellent `read_*` functions.

Pro tip: jupyter will pull of the doc string for a command just by asking it a question

In [52]:
pd.read_csv?

In [53]:
df = pd.read_csv('/Users/matthewgee/Building_Violations_sample_50000.csv')
#don't forget to check out tab completion in iPython!


##Viewing your dataframe

Just like we did in the command line, you can use `head` and `tail` to get a view of your data.

In [54]:
df.head()
df.tail(3)

Unnamed: 0.1,Unnamed: 0,ID,VIOLATION LAST MODIFIED DATE,VIOLATION DATE,VIOLATION CODE,VIOLATION STATUS,VIOLATION STATUS DATE,VIOLATION DESCRIPTION,VIOLATION LOCATION,VIOLATION INSPECTOR COMMENTS,...,INSPECTION STATUS,INSPECTION WAIVED,INSPECTION CATEGORY,DEPARTMENT BUREAU,ADDRESS,PROPERTY GROUP,SSA,LATITUDE,LONGITUDE,LOCATION
49997,49997,1430847,10/14/2008,01/01/2006,CN031013,COMPLIED,07/10/2008,"FIRE EXTNGSHR REQ, RESDNTL",,ALL ELEVATION MISSING FIRE EXTINGUISHER,...,FAILED,N,PERIODIC,CONSERVATION,6501 S LOWE AVE,19693,,41.776074,-87.640625,"(41.7760739361563,-87.64062455203374)"
49998,49998,1734149,09/16/2008,01/01/2006,CN135016,COMPLIED,09/05/2008,MICE/RODENTS,,MICE ON PREMISES,...,FAILED,N,PERIODIC,CONSERVATION,4836 S INDIANA AVE,18261,,41.806381,-87.621283,"(41.80638088214982,-87.62128295874425)"
49999,49999,1834762,01/12/2007,01/01/2006,CN190019,COMPLIED,01/11/2007,ARRANGE PREMISE INSPECTION,,NO ENTRY TO INTERIOR TO VERIFY OCCUPANCY AND D...,...,FAILED,N,PERIODIC,CONSERVATION,200 N KOSTNER AVE,1531,,41.88331,-87.735709,"(41.88330951242491,-87.73570901056128)"


We can get a sense for the size and shape of the data using `shape`

In [55]:
df.shape

(50000, 23)

Get a sense for the type of each column using `dtypes`

In [56]:
df.dtypes

Unnamed: 0                        int64
ID                                int64
VIOLATION LAST MODIFIED DATE     object
VIOLATION DATE                   object
VIOLATION CODE                   object
VIOLATION STATUS                 object
VIOLATION STATUS DATE            object
VIOLATION DESCRIPTION            object
VIOLATION LOCATION               object
VIOLATION INSPECTOR COMMENTS     object
VIOLATION ORDINANCE              object
INSPECTOR ID                     object
INSPECTION NUMBER                 int64
INSPECTION STATUS                object
INSPECTION WAIVED                object
INSPECTION CATEGORY              object
DEPARTMENT BUREAU                object
ADDRESS                          object
PROPERTY GROUP                    int64
SSA                              object
LATITUDE                        float64
LONGITUDE                       float64
LOCATION                         object
dtype: object

Unnamed is a useless column. Let's get rid of it.

In [57]:
del df['Unnamed: 0']

##Cleaning up column names

Notice that the column names have spaces. 

In [58]:
df.columns

Index([u'ID', u'VIOLATION LAST MODIFIED DATE', u'VIOLATION DATE', u'VIOLATION CODE', u'VIOLATION STATUS', u'VIOLATION STATUS DATE', u'VIOLATION DESCRIPTION', u'VIOLATION LOCATION', u'VIOLATION INSPECTOR COMMENTS', u'VIOLATION ORDINANCE', u'INSPECTOR ID', u'INSPECTION NUMBER', u'INSPECTION STATUS', u'INSPECTION WAIVED', u'INSPECTION CATEGORY', u'DEPARTMENT BUREAU', u'ADDRESS', u'PROPERTY GROUP', u'SSA', u'LATITUDE', u'LONGITUDE', u'LOCATION'], dtype='object')

That's a bummer because columns without spaces in their names allow us to take a shortcut in selecting columns. 

Instead of referencing columns like this

```
df['VIOLATION STATUS DATE']
```

we would love to be able to use tab completion and reference the columns like this

```
df.violation_status_date
```

Let's fix it, and learn a little about defining python functions, real expressions, and list comprehension in the process.

In [59]:
import re

def spaces_to_snake(column_name):
    """
    converts a string that has spaces into snake_case
    Example:
        print camel_to_snake("KENNY BROUGHT HIS WIFE")
        > KENNY_BROUGHT_HIS_WIFE
    To see how to apply this to camel case, see:
        http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case
    """
    s = re.sub(r"\s+", '_', column_name)
    return s.lower()

df.columns = [spaces_to_snake(col) for col in df.columns]

#Note: a much more elegant and pythonic way to do this would be to use the rename method and lambda syntax (e.g. df.rename(columns=lambda x: x.strip())), but we'll cover this later.


In [60]:
df.columns

Index([u'id', u'violation_last_modified_date', u'violation_date', u'violation_code', u'violation_status', u'violation_status_date', u'violation_description', u'violation_location', u'violation_inspector_comments', u'violation_ordinance', u'inspector_id', u'inspection_number', u'inspection_status', u'inspection_waived', u'inspection_category', u'department_bureau', u'address', u'property_group', u'ssa', u'latitude', u'longitude', u'location'], dtype='object')

Great. Now let's learn how to reference columns

In [61]:
df.violation_date.head() #don't forget to try tab completion!

0    05/21/2015
1    05/21/2015
2    05/21/2015
3    05/21/2015
4    05/21/2015
Name: violation_date, dtype: object

##Converting to datetime

Pandas has some fantastic methods for timeseries data. We need to convert the date columns (currently strings) to datetimes. 

In [62]:
df.violation_date = pd.to_datetime(df.violation_date)
df.violation_date.head()

0   2015-05-21
1   2015-05-21
2   2015-05-21
3   2015-05-21
4   2015-05-21
Name: violation_date, dtype: datetime64[ns]

Try writing code that will change all the date columns to datetimes with as few lines as possible.

##Exploring Data

Lets get a better sense what these fields look like. There are 

Let's start by describing the entire dataset using the describe command.

In [63]:
df.describe()

Unnamed: 0,id,inspection_number,property_group,latitude,longitude
count,50000.0,50000.0,50000.0,49952.0,49952.0
mean,3300031.32964,6195288.98632,197729.74142,41.845539,-87.673442
std,1088123.083684,4296079.998463,183132.260836,0.087437,0.057008
min,742158.0,375113.0,1001.0,41.644712,-87.914436
25%,2400057.5,2021217.5,20600.75,41.77144,-87.714271
50%,3421152.0,2813561.0,142605.0,41.854001,-87.670709
75%,4223617.75,10630848.75,363929.25,41.912843,-87.634551
max,5064815.0,11597625.0,663759.0,42.022645,-87.525898


It looks like `describe` only works on numerical columns. For categorical, we can use `value_counts`.

In [64]:
pd.value_counts(df.inspection_status)

FAILED    38371
PASSED     6426
CLOSED     5199
HOLD          4
dtype: int64

Let's see if there are missing values.

In [65]:
pd.value_counts(df.violation_inspector_comments.isnull())

False    44705
True      5295
dtype: int64

What if we wanted to fill or drop the missing values? We can use `fillna` and `dropna`

In [66]:
df.violation_inspector_comments = df.violation_inspector_comments.fillna('No Comment')

In [67]:
pd.value_counts(df.violation_inspector_comments.isnull())

False    50000
dtype: int64

##Selecting and Subsetting Data

Let's say we just wanted to work with the inspectors that left no comments. We can subset using conditional logic.

In [70]:
df[df.violation_inspector_comments=='No Comment'].head()

Unnamed: 0,id,violation_last_modified_date,violation_date,violation_code,violation_status,violation_status_date,violation_description,violation_location,violation_inspector_comments,violation_ordinance,...,inspection_status,inspection_waived,inspection_category,department_bureau,address,property_group,ssa,latitude,longitude,location
13,5063459,05/20/2015,2015-05-20,CN193019,OPEN,,REPAIR/WRECK DANGER RESID PREM,,No Comment,Repair or wreck dangerous and vacant residenti...,...,CLOSED,N,COMPLAINT,DEMOLITION,3051 S BROAD ST,285348,,41.838454,-87.660979,"(41.83845438362799,-87.6609787384341)"
15,5063885,05/21/2015,2015-05-20,BR1001,OPEN,,OWNER OR LICENSED CONTRACTOR,,No Comment,The code violations listed below must be corre...,...,CLOSED,N,PERIODIC,BOILER,6448 S TRIPP AVE,387870,,41.775328,-87.728922,"(41.775328312329904,-87.72892174006991)"
16,5063912,05/21/2015,2015-05-20,PL151137,OPEN,,OPEN,,No Comment,,...,FAILED,N,PERMIT,PLUMBING,3835 W CERMAK RD,280823,,41.851485,-87.721159,"(41.85148492680666,-87.72115941535648)"
18,5063320,05/20/2015,2015-05-20,CN193019,OPEN,,REPAIR/WRECK DANGER RESID PREM,,No Comment,Repair or wreck dangerous and vacant residenti...,...,CLOSED,N,COMPLAINT,DEMOLITION,11025 S ESMOND ST,516079,,41.693018,-87.667286,"(41.69301757344688,-87.66728567428113)"
20,5064410,05/20/2015,2015-05-20,CN193039,OPEN,,POST OWNER NAME OF VACNT BLDG,,No Comment,"Post conspicuously name, address, and telephon...",...,CLOSED,N,COMPLAINT,DEMOLITION,6928 S KIMBARK AVE,393870,,41.769012,-87.594164,"(41.76901189738516,-87.59416364991507)"


We can also subset using indexing like R.

You can slice a Series by range using the [] operator

In [76]:
df.violation_status[3:10]

3    OPEN
4    OPEN
5    OPEN
6    OPEN
7    OPEN
8    OPEN
9    OPEN
Name: violation_status, dtype: object

Or by location in the DataFram using the `.iloc` method

In [80]:
df.iloc[3:10,1:4]

Unnamed: 0,violation_last_modified_date,violation_date,violation_code
3,05/21/2015,2015-05-21,CN196029
4,05/21/2015,2015-05-21,CN104015
5,05/21/2015,2015-05-21,CN190019
6,05/21/2015,2015-05-21,CN015062
7,05/21/2015,2015-05-21,CN065034
8,05/21/2015,2015-05-21,CN196029
9,05/21/2015,2015-05-21,CN190019


The `.ix` method supports lookup by label as well as integers.

##Applying functions

Often we want to apply a function to an entire column to create a new column. You can do this by using the `apply` method. Let's say we wanted to add noise to the

In [86]:
df['log_lat'] = df.latitude.apply(np.log)
df[['latitude','log_lat']].head()

Unnamed: 0,latitude,log_lat
0,41.733089,3.731294
1,41.777088,3.732348
2,41.937995,3.736192
3,41.937995,3.736192
4,41.79147,3.732692


Let's say we wanted to anonlymize location by add noise to the latitude. We can do this using apply with `lambda` syntax.

In [88]:
df['new_lat'] = df.latitude.apply(lambda x: x + np.random.rand())


In [89]:
df[['latitude','new_lat','log_lat']].head()

Unnamed: 0,latitude,new_lat,log_lat
0,41.733089,41.820477,3.731294
1,41.777088,42.345738,3.732348
2,41.937995,42.259092,3.736192
3,41.937995,42.865434,3.736192
4,41.79147,42.657788,3.732692


##Groupby

Often we want to examine difference among groups based on categorical values. For this `groupby` is incredibly valuable

In [96]:
df.groupby("department_bureau").mean()

Unnamed: 0_level_0,id,inspection_number,property_group,latitude,longitude,log_lat,new_lat
department_bureau,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
BOILER,3187754,6534207,250911,41.841682,-87.670488,3.733891,42.351961
CONSERVATION,3182004,5599306,182792,41.846195,-87.673469,3.733999,42.344833
CONSTRUCTION EQUIPMENT,3843828,9226801,296070,41.928653,-87.684458,3.735969,42.457416
DEMOLITION,4230535,10161867,303974,41.807409,-87.669775,3.733072,42.310347
ELECTRICAL,3148057,6082962,226475,41.839267,-87.670183,3.733833,42.338321
ELEVATOR,4030558,8923590,99430,41.900869,-87.659214,3.735305,42.408477
IRON,3440557,8065067,87547,41.906648,-87.671767,3.735444,42.41261
NEW CONSTRUCTION,2984252,5171424,229395,41.863878,-87.683687,3.734421,42.376014
PLUMBING,3162547,5747402,249322,41.817108,-87.66369,3.733304,42.307793
REFRIGERATION,3187633,5082480,200453,41.874878,-87.679022,3.734684,42.373713
