# Week 2 - Let's get classifying

In this session, we will try to get some classifiers running. If time permits, we will do a bit more complex example as well

## Step 0 - Library and imports, as always
We will make the essential imports. Further imports will be done later

In [1]:
import sys
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
import matplotlib.pyplot as plt
import seaborn as sns
import sys


#Ignore Warnings - save some confusion
import warnings
warnings.filterwarnings('ignore')

#Pandas more columns
pd.options.display.max_columns = None
pd.set_option('display.max_columns', None)

# Add input as import path
sys.path.insert(0,'../input')

# Plot style
plt.style.use('fivethirtyeight')

# Import the data from the dataset
train_data = pd.read_csv('../input/train.csv',index_col='id')
test_data = pd.read_csv('../input/test.csv',index_col='id')


## Step 1 - Getting our input ready
Oops, we spoke too soon. Before we can think of classifying, we got to make sure our inputs are the way we would like them

### MiniStep 1 - Feature Transforms
What we would like to be doing here is to take existing features, and simplify them. Not still thinking about inputting missing features, or to normalize/quantify them.

In [2]:
def simplify_fares(df):
    df.fare = df.fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.fare, bins, labels=group_names)
    df.fare = categories
    return df

def simplify_cabins(df):
    df.cabin = df.cabin.fillna('N')
    df.cabin = df.cabin.apply(lambda x: x[0])
    return df


def format_name(df):
    df['lname'] = df.name.apply(lambda x: x.split(' ')[0])
    df['lname'].fillna(' ')
    df['nameprefix'] = df.name.apply(lambda x: x.split(' ')[1])
    df['nameprefix'].fillna(' ')
    return df


def drop_features(df):
    return df.drop(['ticket', 'name', 'embarked', 'home.dest'], axis=1)

def transform_features(df):
    df = simplify_fares(df)
    df = simplify_cabins(df)
    df = format_name(df)
    df = drop_features(df)
    return df


train_data = transform_features(train_data)
test_data  = transform_features(test_data)
print(train_data)

      pclass  survived     sex   age  sibsp  parch        fare cabin boat  \
id                                                                          
277        1         1  female   NaN      1      0  4_quartile     B    6   
562        2         1  female  30.0      0      0  2_quartile     N   13   
111        1         1  female  24.0      3      2  4_quartile     C   10   
930        3         0    male   NaN      1      0  1_quartile     N  NaN   
841        3         0  female  17.0      0      0  1_quartile     N  NaN   
585        2         0    male  27.0      1      0  3_quartile     N  NaN   
609        3         0    male  26.0      0      0  2_quartile     N  NaN   
540        2         1  female   2.0      1      1  3_quartile     N   11   
1075       3         0    male  23.0      0      0  2_quartile     N  NaN   
390        2         0    male  17.0      0      0  4_quartile     N  NaN   
921        3         0    male   NaN      0      0  1_quartile     N    A   

### MiniStep 2 - Feature Encoding 
Our next step is to take existing features which possibly have text to convert them into numbers. 
**Categorical->Numerical**
In our previous sessions, we saw how to do it manually. 
Here, we will use the awesome scikit-learn library. 

In [3]:
from sklearn import preprocessing
def encode_features(df_train, df_test):
    features = ['fare', 'cabin', 'sex', 'lname', 'nameprefix', 'boat']
    df_combined = pd.concat([df_train[features], df_test[features]])

    for feature in features:
        print feature
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test

train_data['boat'] = train_data['boat'].astype('str')
test_data['boat'] = test_data['boat'].astype('str')
train_data, test_data = encode_features(train_data, test_data)
print(train_data)


fare
cabin
sex
lname
nameprefix
boat
      pclass  survived  sex   age  sibsp  parch  fare  cabin  boat   body  \
id                                                                          
277        1         1    0   NaN      1      0     3      1    17    NaN   
562        2         1    0  30.0      0      0     1      7     4    NaN   
111        1         1    0  24.0      3      2     3      2     1    NaN   
930        3         0    1   NaN      1      0     0      7    27    NaN   
841        3         0    0  17.0      0      0     0      7    27    NaN   
585        2         0    1  27.0      1      0     2      7    27  293.0   
609        3         0    1  26.0      0      0     1      7    27  103.0   
540        2         1    0   2.0      1      1     2      7     2    NaN   
1075       3         0    1  23.0      0      0     1      7    27    NaN   
390        2         0    1  17.0      0      0     3      7    27    NaN   
921        3         0    1   NaN      

### MiniStep 3 - Missing Data!

Here, we use sklearn's imputer to directly fill in missing data

In [4]:
def fill_missing_data(df_train,df_test):
    features = ['age', 'body', 'boat']
    df_combined = pd.concat([df_train[features], df_test[features]])
    df_imputer = preprocessing.Imputer()
    df_imputer.fit(df_combined[features])
    df_train[features] = df_imputer.transform(df_train[features])
    df_test[features] = df_imputer.transform(df_test[features])
    return df_train, df_test

train_data,test_data = fill_missing_data(train_data,test_data)
print(train_data)

      pclass  survived  sex        age  sibsp  parch  fare  cabin  boat  \
id                                                                        
277        1         1    0  29.881135      1      0     3      1  17.0   
562        2         1    0  30.000000      0      0     1      7   4.0   
111        1         1    0  24.000000      3      2     3      2   1.0   
930        3         0    1  29.881135      1      0     0      7  27.0   
841        3         0    0  17.000000      0      0     0      7  27.0   
585        2         0    1  27.000000      1      0     2      7  27.0   
609        3         0    1  26.000000      0      0     1      7  27.0   
540        2         1    0   2.000000      1      1     2      7   2.0   
1075       3         0    1  23.000000      0      0     1      7  27.0   
390        2         0    1  17.000000      0      0     3      7  27.0   
921        3         0    1  29.881135      0      0     0      7  22.0   
339        2         1   

### MiniStep 4  - Almost There!

Get X (FeatureVectors - Actual training data) and Y ( training labels) . Once we get them, we will also scale them.

In [5]:
def get_X_Y_pair(df):
    features = df.columns.values
    x_features = [f for f in features if f!='survived']
    return df[x_features], df['survived']

def scale_data(df_train, df_test):
    df_combine = pd.concat([df_train, df_test])
    features = df_train.columns.values
    scaler = preprocessing.StandardScaler()
    scaler.fit(df_combine)
    return scaler.transform(df_train), scaler.transform(df_test)

x_train, y_train = get_X_Y_pair(train_data)
x_test, y_test = get_X_Y_pair(test_data)

#not pandas after this
x_train, x_test = scale_data(x_train,x_test)
print(type(x_train), type(x_test))

(<type 'numpy.ndarray'>, <type 'numpy.ndarray'>)


Now, we are done arranging the data. Time To run some sample classifiers and see how they work out


## Step 2 - Feed the classifier


### MiniStep 2.1 - Setting up a meta-classifier
Again, using scikit-learn, we have a great advantage here. Scikit-learn's classifiers come with a beautiful interface

```
Let's assume our classifier instance is clf
clf = ClassifierName(ClassifierParams=...) #Defaults are usually good enough in most cases
clf.fit (traindata, trainlabels) # Training
test_answers = clf.predict(testdata) # Testing

```

since it's so beautiful, lets create a meta function that would train on any training data, test on some given testing data, calculate the accuracy for titanic dataset and return it