#Agenda

- Define the problem and the approach
- <p style="color: red">Data basics: loading data, looking at your data, basic commands</p>
- Handling missing values
- Intro to scikit-learn
- Grouping and aggregating data
- Feature selection
- Fitting and evaluating a model
- Deploying your work

##In this notebook you will

- Learn how to load data into Python
- Learn the basics of working with data in `pandas`
- Clean and manage your data
- Wrangle missing data

##Reading from a file

In [None]:
import pandas as pd
import pylab as pl
import numpy as np
import re

In [None]:
np.sum

We're going to use the <code>read_csv</code> function in pandas

In [None]:
?pd.read_csv

In [None]:
! head -n 2 ./data/credit-training.csv

In [None]:
df = pd.read_csv("./data/credit-training.csv")

##What is <code>df</code>?
Our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program. It's a collection of series (or columns) with a common set of commands that make managing data in Python super easy.

##Handling Missing Values
One of the most frustrating parts of data science can be handling null or missing data. pandas has a lot of built in features for making is super easy to handle missing data. The first thing we need to do is determine which fields have missing data. To do that we're going to use `pd.melt`.

###[Long vs. Wide Data](http://en.wikipedia.org/wiki/Wide_and_narrow_data)
Depending on the problem you're solving, you may need to rotate between having your data in wide/long format.

Wide data is probably what you think of when the work "spreadsheet" comes to mind. We're talking about data in which each row represents 1 datapoint and each value is in a particular column. This is well suited for things like modeling and producing summary statistics.

I often find that having data in `long` format is often best for doing the same task against multiple variables. Things like plotting distributions of each variable, making frequency tables, or, in our case, determining what portion of a dataframe's variables are null.

###pd.melt()
For converting data from `wide` to `long` format.
```
>>> df
A B C
a 1 2
b 3 4
c 5 6

>>> pd.melt(df, id_vars=['A'], value_vars=['B'])
A variable value
a B        1
b B        3
c B        5
```

In [None]:
?pd.melt

In [None]:
# By not specifying id_vars, we're going to melt EVERYTHING
df_lng = pd.melt(df)
# now our data is a series of (key, value) rows. 
#think of when you've done this in Excel so that you can
#create a pivot table 
df_lng.head()

In [None]:
null_variables = df_lng.value.isnull()
null_variables.sum()

In [None]:
# crosstab creates a frequency table between 2 variables
# it's going to automatically enumerate the possibilities between
# the two Series and show you a count of occurrences 
#in each possible bucket
pd.crosstab(df_lng.variable, null_variables)

In [None]:
# let's abstract that code into a function so we can easily 
# recalculate it
def print_null_freq(df):
    """
    for a given DataFrame, calculates how many values for 
    each variable is null and prints the resulting table to stdout
    """
    df_lng = pd.melt(df)
    null_variables = df_lng.value.isnull()
    return pd.crosstab(df_lng.variable, null_variables)
print_null_freq(df)

####Use pd.melt to create a data frame in the following format:
```
     serious_dlqin2yrs variable	  value
0	                1	 age	    45
1	                0	 age	    40
2	                0	 age	    38
3	                0	 age	    30
4	                0	 age	    49
...	                ...	 ...	    ...
299999              1	 debt_ratio 0.423
300000              0	 debt_ratio 0.8923
```
Only include values for `age` and `debt_ratio`

In [None]:
melted = pd.melt(..., id_vars=[...], value_vars=[...])

print len(melted)==300000
print melted.variable.unique()==np.array(['age', 'debt_ratio'])

###Filling NA's

In [None]:
s = pd.Series([1, 2, None, 4])
s

In [None]:
s.fillna(3)

In [None]:
s.ffill()

In [None]:
s.bfill()

In [None]:
s.fillna(s.mean())

If you look at `df` you can see that there are 2 columns which don't have a full 150,000 values: `monthly_income` and `number_of_dependents`. In order to incorporate these variables into our analysis, we need to specify how to treat these missing values.

For number_of_dependents let's keep things simple and intuitive. if someone didn't specify how many dependents they had then let's assume it's becasue they don't have any to begin with.

Taking a look at `monthly_income` we see that it's a bit more complicated than `number_of_dependents`. We have a few options for replacing missing data. We could do something like set it to the mean or median or the dataset but this might skew our distribution. We could also set it to 0 but this might not be right either. Instead we're going to use a technique called imputation. We'll go into this more after we take a look at `scikit-learn`.

In [None]:
df['DebtRatio']
df.DebtRatio

###head(n=5)

In [None]:
df.head()

In [None]:
df.head(1)

In [None]:
df.SeriousDlqin2yrs.head()

###tail(n=5)

In [None]:
df.tail()

In [None]:
df.RevolvingUtilizationOfUnsecuredLines.tail()

###describe(percentile_width=50)

In [None]:
df.describe()

In [None]:
df.age.describe(percentile_width=25)

###unqiue() and nunique()

In [None]:
df.NumberOfDependents.unique()

In [None]:
df.NumberOfDependents.nunique()

###pd.value_counts(values_to_count)

In [None]:
def camel_to_snake(column_name):
    """
    converts a string that is camelCase into snake_case
    Example:
        print camel_to_snake("javaLovesCamelCase")
        > java_loves_camel_case
    See Also:
        http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case
    """
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', column_name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [None]:
camel_to_snake("javaLovesCamelCase")

In [None]:
df.columns = [camel_to_snake(col) for col in df.columns]
df.columns.tolist()

##Slicing and Indexing Data
pandas (like R) uses a system of boolean indexing. What this means is that when selecting particular rows or columns in your dataset...

###Grabbing columns

In [None]:
df['monthly_income'].head()
df.monthly_income.head()

In [None]:
df[['monthly_income', 'serious_dlqin2yrs']].head()

In [None]:
columns_i_want = ['monthly_income', 'serious_dlqin2yrs']
df[columns_i_want].head()

##Adding Columns

In [None]:
df.newcolumn = 1
# this will throw an error
df['newcolumn']

In [None]:
df['one'] = 1
df.one.head()

###Removing a column

In [None]:
pd.value_counts(df.NumberOfDependents)
df.NumberOfDependents.value_counts()

In [None]:
pd.value_counts(df.NumberOfDependents, ascending=True)

In [None]:
pd.value_counts(df.NumberOfDependents, sort=False)

In [None]:
#chain value_counts together with head() to give you the top 3
pd.value_counts(df.NumberOfDependents).head(3)

In [None]:
pd.value_counts(df.NumberOfDependents).plot(kind='bar')

##pd.crosstab(rows, cols)

In [None]:
pd.crosstab(df.NumberOfTimes90DaysLate, df.SeriousDlqin2yrs)

####Use `pd.crosstab` to make a table that contains customer's ages in the lefthand column and the number of dependents they have in the right

In [None]:
pd.crosstab(df.age, df.NumberOfDependents)

##Basic Cleanliness

Let's fix for formatting of the column names. I personally like snake_case (and so does Python). I found [this handy function](http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case) on stackoverflow for converting camelCase to snake_case.

Now we can apply the camel_to_snake function on each column name.

In [None]:
%load https://gist.github.com/glamp/6529725/raw/e38ffd2fc4cb840be21098486ffe5df991946736/camel_to_snake.py