# Week 2 - Let's get classifying

In this session, we will try to get some classifiers running. If time permits, we will do a bit more complex example as well

## Step 0 - Library and imports, as always
We will make the essential imports. Further imports will be done later

In [23]:
import sys
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
import matplotlib.pyplot as plt
import seaborn as sns
import sys


#Ignore Warnings - save some confusion
import warnings
warnings.filterwarnings('ignore')

#Pandas more columns
pd.options.display.max_columns = None
pd.set_option('display.max_columns', None)

# Add input as import path
sys.path.insert(0,'../input')

# Plot style
plt.style.use('fivethirtyeight')

# Import the data from the dataset
train_data = pd.read_csv('../input/train.csv',index_col='id')
test_data = pd.read_csv('../input/test.csv',index_col='id')

train_data.sort_index()

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
5,1,1,"Anderson, Mr. Harry",male,48.0000,0,0,19952,26.5500,E12,S,3,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0000,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0000,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
12,1,1,"Aubart, Mme. Leontine Pauline",female,24.0000,0,0,PC 17477,69.3000,B35,C,9,,"Paris, France"
13,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.0000,0,0,19877,78.8500,,S,6,,
15,1,0,"Baumann, Mr. John D",male,,0,0,PC 17318,25.9250,,S,,,"New York, NY"
16,1,0,"Baxter, Mr. Quigg Edmond",male,24.0000,0,1,PC 17558,247.5208,B58 B60,C,,,"Montreal, PQ"
18,1,1,"Bazzani, Miss. Albina",female,32.0000,0,0,11813,76.2917,D15,C,8,,


## Step 1 - Getting our input ready
Oops, we spoke too soon. Before we can think of classifying, we got to make sure our inputs are the way we would like them

### MiniStep 1 - Feature Transforms
What we would like to be doing here is to take existing features, and simplify them. Not still thinking about inputting missing features, or to normalize/quantify them.

Questions:
* Q1: What does the simplify_fares() method?
* Q2: Why we drop some features?

In [24]:
def simplify_fares(df):
    df.fare = df.fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.fare, bins, labels=group_names)
    df.fare = categories
    return df

def simplify_cabins(df):
    df.cabin = df.cabin.fillna('N')
    df.cabin = df.cabin.apply(lambda x: x[0])
    return df


def format_name(df):
    df['lname'] = df.name.apply(lambda x: x.split(' ')[0])
    df['lname'].fillna(' ')
    df['nameprefix'] = df.name.apply(lambda x: x.split(' ')[1])
    df['nameprefix'].fillna(' ')
    return df


def drop_features(df):
    return df.drop(['ticket', 'name', 'embarked', 'home.dest', 'body', 'boat'], axis=1)

def transform_features(df):
    df = simplify_fares(df)
    df = simplify_cabins(df)
    df = format_name(df)
    df = drop_features(df)
    return df


train_data = transform_features(train_data)
test_data  = transform_features(test_data)
print(train_data[0:10])

      pclass  survived     sex   age  sibsp  parch        fare cabin  \
id                                                                     
277        1         1  female   NaN      1      0  4_quartile     B   
562        2         1  female  30.0      0      0  2_quartile     N   
111        1         1  female  24.0      3      2  4_quartile     C   
930        3         0    male   NaN      1      0  1_quartile     N   
841        3         0  female  17.0      0      0  1_quartile     N   
585        2         0    male  27.0      1      0  3_quartile     N   
609        3         0    male  26.0      0      0  2_quartile     N   
540        2         1  female   2.0      1      1  3_quartile     N   
1075       3         0    male  23.0      0      0  2_quartile     N   
390        2         0    male  17.0      0      0  4_quartile     N   

          lname nameprefix  
id                          
277    Spencer,       Mrs.  
562    Slayter,      Miss.  
111    Fortune,    

### MiniStep 2 - Feature Encoding 
Our next step is to take existing features which possibly have text to convert them into numbers. 
**Categorical->Numerical**
In our previous sessions, we saw how to do it manually. 
Here, we will use the awesome scikit-learn library

Questions:
* Q1: Why we need to encode our features?
* Q2: All the classifiers needs encoding?

In [25]:
from sklearn import preprocessing
def encode_features(df_train, df_test):
    features = ['fare', 'cabin', 'sex', 'lname', 'nameprefix']
    df_combined = pd.concat([df_train[features], df_test[features]])

    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test

train_data, test_data = encode_features(train_data, test_data)
print(train_data[0:10])


      pclass  survived  sex   age  sibsp  parch  fare  cabin  lname  \
id                                                                    
277        1         1    0   NaN      1      0     3      1    755   
562        2         1    0  30.0      0      0     1      7    741   
111        1         1    0  24.0      3      2     3      2    257   
930        3         0    1   NaN      1      0     0      7    406   
841        3         0    0  17.0      0      0     0      7    303   
585        2         0    1  27.0      1      0     2      7    830   
609        3         0    1  26.0      0      0     1      7      7   
540        2         1    0   2.0      1      1     2      7    652   
1075       3         0    1  23.0      0      0     1      7    591   
390        2         0    1  17.0      0      0     3      7    198   

      nameprefix  
id                
277           20  
562           16  
111           16  
930           19  
841           16  
585           

### MiniStep 3 - Missing Data!

Here, we use sklearn's imputer to directly fill in missing data

ToDo:
* Find the different strategies that are avaible in the Imputer class.

Questions:
* Q1: Which is the strategy that get best results in terms of accuracy?

In [26]:
def fill_missing_data(df_train,df_test):
    features = ['age']
    df_combined = pd.concat([df_train[features], df_test[features]])
    df_imputer = preprocessing.Imputer()
    df_imputer.fit(df_combined[features])
    df_train[features] = df_imputer.transform(df_train[features])
    df_test[features] = df_imputer.transform(df_test[features])
    return df_train, df_test

train_data,test_data = fill_missing_data(train_data,test_data)
print(train_data)

      pclass  survived  sex        age  sibsp  parch  fare  cabin  lname  \
id                                                                         
277        1         1    0  29.881135      1      0     3      1    755   
562        2         1    0  30.000000      0      0     1      7    741   
111        1         1    0  24.000000      3      2     3      2    257   
930        3         0    1  29.881135      1      0     0      7    406   
841        3         0    0  17.000000      0      0     0      7    303   
585        2         0    1  27.000000      1      0     2      7    830   
609        3         0    1  26.000000      0      0     1      7      7   
540        2         1    0   2.000000      1      1     2      7    652   
1075       3         0    1  23.000000      0      0     1      7    591   
390        2         0    1  17.000000      0      0     3      7    198   
921        3         0    1  29.881135      0      0     0      7    399   
339        2

### MiniStep 4  - Almost There!

Get X and Y. Once we get them, we will also scale them.

Questions:
* Q1: What means X and Y variables?
* Q2: Which dimensions has each variable?

In [27]:
def get_X_Y_pair(df):
    features = df.columns.values
    x_features = [f for f in features if f!='survived']
    return df[x_features], df['survived']

def scale_data(df_train, df_test):
    df_combine = pd.concat([df_train, df_test])
    features = df_train.columns.values
    scaler = preprocessing.StandardScaler()
    scaler.fit(df_combine)
    return scaler.transform(df_train), scaler.transform(df_test)

x_train, y_train = get_X_Y_pair(train_data)
x_test, y_test = get_X_Y_pair(test_data)

#not pandas after this
x_train, x_test = scale_data(x_train,x_test)
print(type(x_train), type(x_test))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


Now, we are done arranging the data. Time To run some sample classifiers and see how they work out


## Step 2 - Feed the classifier


### MiniStep 2.1 - Setting up a meta-classifier
Again, using scikit-learn, we have a great advantage here. Scikit-learn's classifiers come with a beautiful interface

```
Let's assume our classifier instance is clf
clf = ClassifierName(ClassifierParams=...) #Defaults are usually good enough in most cases
clf.fit (traindata, trainlabels) # Training
test_answers = clf.predict(testdata) # Testing

```

since it's so beautiful, lets create a meta function that would train on any training data, test on some given testing data, calculate the accuracy for titanic dataset and return it

ToDo:
* Search in the sklearn how to represent the decision trees using graphviz.

Questions:
* Q1: Which are the parameters that can improve the results (Perceptron)
* Q2: What is and how is computed the result of the feature_importances method (DecisionTreeClassifier)
* Q3: SVM try to modify the C parameter with values like 10^-3, 1, 10, 100, 10^3 (SVM)
* Q4: Ra

In [28]:
##Perceptron
from utils import accuracy_score_numpy
def train_test_accuracy(clf,x_train, y_train,x_test):
    clf.fit(x_train, y_train)
    y_test = clf.predict(x_test)
    perc = accuracy_score_numpy(y_test)
    return perc

from sklearn.linear_model import Perceptron
percep = Perceptron()
acc = train_test_accuracy(percep, x_train, y_train,x_test)
print("Accuracy for perceptron is  ", acc)


Accuracy for perceptron is   0.712977099237


In [29]:
##Decision Tree
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
acc = train_test_accuracy(dtree, x_train,y_train,x_test)
print("Accuracy for dtree is  ", acc)

Accuracy for dtree is   0.748091603053


In [30]:
##SVM
from sklearn.svm import LinearSVC
svm = LinearSVC()
acc = train_test_accuracy(dtree, x_train, y_train, x_test)
print("Accuracy for svm is  ", acc)


Accuracy for svm is   0.734351145038


In [35]:
##Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
acc = train_test_accuracy(rf, x_train, y_train, x_test)
print("Accuracy for RF is  ", acc)

##Random Forest - More trees
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=5000,n_jobs=2)
acc = train_test_accuracy(rf, x_train, y_train, x_test)
print("Accuracy for RF is  ", acc)

Accuracy for RF is   0.777099236641
Accuracy for RF is   0.786259541985
