# A classification example from soup to nuts

* Meet the problem
* Read in the data
* Inspect the data
* Prepare our dataset
* Fill in missing data
* Handle categoric data
* Train a classifier
* Engineer new features
* Train with engineered features

# Meet the problem - who survives the Titanic?

![](https://kaggle2.blob.core.windows.net/competitions/kaggle/3136/logos/front_page.png)

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
```                

# Read in the data

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv('train.csv', index_col='PassengerId')

# Inspect the data

## Look at sample rows

In [2]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Look at summary information

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


# Prepare our dataset

In [4]:
y = data['Survived']
y.head()

PassengerId
1    0
2    1
3    1
4    1
5    0
Name: Survived, dtype: int64

In [5]:
X = pd.DataFrame(data[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']])
X.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3,22.0,1,0,7.25
2,1,38.0,1,0,71.2833
3,3,26.0,0,0,7.925
4,1,35.0,1,0,53.1
5,3,35.0,0,0,8.05


# Explore the data a bit

# Fill in missing data

In [6]:
X.isnull()[0:10]

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,True,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False
10,False,False,False,False,False


In [7]:
X.isnull()[0:10].sum()

Pclass    0
Age       1
SibSp     0
Parch     0
Fare      0
dtype: int64

In [8]:
X.isnull().sum()

Pclass      0
Age       177
SibSp       0
Parch       0
Fare        0
dtype: int64

In [9]:
data['Age'][0:10]

PassengerId
1     22.0
2     38.0
3     26.0
4     35.0
5     35.0
6      NaN
7     54.0
8      2.0
9     27.0
10    14.0
Name: Age, dtype: float64

In [10]:
data['Age'].median()

28.0

In [11]:
data['Age'].isnull()[0:10]

PassengerId
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
Name: Age, dtype: bool

In [12]:
X.loc[ data['Age'].isnull(), 'Age' ] = data['Age'].median()

In [13]:
X.isnull().sum()

Pclass    0
Age       0
SibSp     0
Parch     0
Fare      0
dtype: int64

# Handle categorical values

In [14]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Convert to binary

In [15]:
X.loc[:, 'female'] = np.where(data['Sex'] == 'female', 1, 0)
X.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,female
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3,22.0,1,0,7.25,0
2,1,38.0,1,0,71.2833,1
3,3,26.0,0,0,7.925,1
4,1,35.0,1,0,53.1,1
5,3,35.0,0,0,8.05,0


## Fill in more missing values

In [16]:
data['Embarked'].isnull().sum()

2

In [17]:
data['Embarked'].mode()

0    S
dtype: object

In [18]:
data['Embarked'].mode()[0]

'S'

In [19]:
data.loc[data['Embarked'].isnull(), 'Embarked'] = data['Embarked'].mode()[0]

In [20]:
data['Embarked'].isnull().sum()

0

## One Hot encode

In [21]:
data['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [22]:
X.loc[:, 'Embarked_S'] = np.where(data['Embarked'] == 'S', 1, 0)
X.loc[:, 'Embarked_C'] = np.where(data['Embarked'] == 'C', 1, 0)
X.loc[:, 'Embarked_Q'] = np.where(data['Embarked'] == 'Q', 1, 0)
X.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,female,Embarked_S,Embarked_C,Embarked_Q
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,3,22.0,1,0,7.25,0,1,0,0
2,1,38.0,1,0,71.2833,1,0,1,0
3,3,26.0,0,0,7.925,1,1,0,0
4,1,35.0,1,0,53.1,1,1,0,0
5,3,35.0,0,0,8.05,0,1,0,0


In [23]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Train a classifier

## Consider the null hypothesis first

In [24]:
print 'Died:    %d' % (y == 0).sum()
print 'Surived: %d' % (y == 1).sum()

Died:    549
Surived: 342


In [25]:
549.0 / (549 + 342)

0.6161616161616161

## Pick a classifier

![](http://scikit-learn.org/stable/_static/ml_map.png)

In [26]:
from sklearn.svm import LinearSVC

![](svc.png)

## Evaluate the classifier

### Using Cross Validation

![](https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png)

In [27]:
clf = LinearSVC(random_state=0)

In [28]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X, y, cv=10)

In [29]:
scores

array([ 0.73333333,  0.7       ,  0.38202247,  0.74157303,  0.71910112,
        0.78651685,  0.78651685,  0.75280899,  0.70786517,  0.68181818])

In [30]:
scores.mean()

0.69915560095335372

# Engineer new features

In [31]:
data['Name'].head()

PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object

In [32]:
title = data['Name'].map(lambda x: \
                  (x.split(',')[1].split(' ')[2] if ' the ' in x \
                    else x.split(',')[1].split(' ')[1]) if (',' in x) \
                  else x.split(' ')[1]) 
title.head()

PassengerId
1      Mr.
2     Mrs.
3    Miss.
4     Mrs.
5      Mr.
Name: Name, dtype: object

In [33]:
title.unique()

array(['Mr.', 'Mrs.', 'Miss.', 'Master.', 'Don.', 'Rev.', 'Dr.', 'Mme.',
       'Ms.', 'Major.', 'Lady.', 'Sir.', 'Mlle.', 'Col.', 'Capt.',
       'Countess.', 'Jonkheer.'], dtype=object)

In [34]:
X.loc[:, 'Title_Mr'] = title.isin(['Mr.']).astype(int)
X.loc[:, 'Title_Mrs'] = title.isin(['Mrs.', 'Mlle.']).astype(int)
X.loc[:, 'Title_Miss'] = title.isin(['Miss.', 'Mme', 'Ms.']).astype(int)
X.loc[:, 'Title_Master'] = title.isin(['Master.']).astype(int)

In [35]:
X.loc[:, 'title_Other'] = ((X['Title_Mr'] + X['Title_Mrs'] + X['Title_Miss'] + X['Title_Master']) == 0).astype(int)

In [36]:
X.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,female,Embarked_S,Embarked_C,Embarked_Q,Title_Mr,Title_Mrs,Title_Miss,Title_Master,title_Other
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,3,22.0,1,0,7.25,0,1,0,0,1,0,0,0,0
2,1,38.0,1,0,71.2833,1,0,1,0,0,1,0,0,0
3,3,26.0,0,0,7.925,1,1,0,0,0,0,1,0,0
4,1,35.0,1,0,53.1,1,1,0,0,0,1,0,0,0
5,3,35.0,0,0,8.05,0,1,0,0,1,0,0,0,0


# Train with engineered features

In [37]:
clf = LinearSVC(random_state=0)
clf.fit(X, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
     verbose=0)

In [38]:
clf = LinearSVC(random_state=0)

scores = cross_val_score(clf, X, y, cv=10)

print scores
print scores.mean()

[ 0.68888889  0.67777778  0.70786517  0.78651685  0.78651685  0.7752809
  0.68539326  0.71910112  0.86516854  0.72727273]
0.741978209057


This improves on `0.69915560095335372`!

# Where do you go from here?

* Visualizing data ([Matplotlib](http://matplotlib.org/) and [Seaborn](http://seaborn.pydata.org/))
* [Dimensionality reduction](http://scikit-learn.org/stable/modules/decomposition.html#decompositions) (e.g. [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection) and [Principal Component Analysis](http://scikit-learn.org/stable/modules/decomposition.html#pca))
* More feature engineering
* Compare with other classifiers (see [algorithm cheat sheet](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html))
* Hyper-parameter tuning (e.g. [Grid Search](http://scikit-learn.org/stable/modules/grid_search.html#grid-search))
* Ensemble models

# Resources

* [Pandas](pandas.pydata.org)
* [NumPy](http://www.numpy.org/)
* [Scikit-learn](scikit-learn.org) - 