# Titanic: Machine Learning from Disaster

Description of the challenge [here](https://www.kaggle.com/c/titanic).

Our data is in .csv format. Each row represents a passenger on the titanic, and some information about them. Let's take a look at the columns:

 - PassengerId -- A numerical id assigned to each passenger.
 - Survived -- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.
 - Pclass -- The class the passenger was in -- first class (1), second class (2), or third class (3).
 - Name -- the name of the passenger.
 - Sex -- The gender of the passenger -- male or female.
 - Age -- The age of the passenger. Fractional.
 - SibSp -- The number of siblings and spouses the passenger had on board.
 - Parch -- The number of parents and children the passenger had on board.
 - Ticket -- The ticket number of the passenger.
 - Fare -- How much the passenger paid for the ticker.
 - Cabin -- Which cabin the passenger was in.
 - Embarked -- Where the passenger boarded the Titanic.
 
A good first step is to think logically about the columns and what we're trying to predict. What variables might logically affect the outcome of survived? (reading more about the titanic might help here).

**Exercise.** Discuss which variables are more likely to have had an impact on the survival odds of the passengers. 

## Looking at the data

Using your recent knowledge of pandas, load the train data into a dataframe and perform some first basic exploratory operations on it to make yourself aquainted with its features.

In [7]:
import pandas as pd
data = pd.read_csv('data/titanic/train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [10]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [17]:
data[pd.isnull(data['Cabin'])][:5]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


## Handling missing data

You might have noticed that not every field for all of the records contains data. This means that we will need to deal with handling missing data. There are many strategies for cleaning up missing data, but a simple one is to just fill in all the missing values with the median of all the values in the column. (Question: why the median and not the average?)

Implement a solution to handle missing data on the Titanic dataframe.

In [21]:
data[~pd.isnull(data['Cabin'])][:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


Cabin is useless

In [42]:
if ('Cabin' in data.columns):
    data = data.drop('Cabin', axis=1)
    data.head()
med_age = data.median()['Age']
print(med_age)
data['Age']  = data['Age'].fillna(med_age)
data['Embarked'] = data['Embarked'].fillna('C')

28.0


In [44]:
data.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

## Non-numeric columns

Several of our columns are non-numeric, which is a problem when it comes time to make predictions -- we can't feed non-numeric columns into a machine learning algorithm and expect it to make sense of them.

We have to either exclude our non-numeric columns when we train our algorithm (Name, Sex, Cabin, Embarked, and Ticket), or find a way to convert them to numeric columns.

Decide which non-numeric columns to keep for the analysis, based on your domain knowledge of the problem. Then, turn them into numeric values (categorical variables).

In [51]:
for x in ['Name', 'Ticket', 'Cabin']:
    if x in data.columns:
        data = data.drop(x,axis=1)
data.head()
    

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,53.1,S
4,5,0,3,male,35.0,0,0,8.05,S


In [56]:
print(data['Embarked'].unique())
print(data['Sex'].unique())


['S' 'C' 'Q']
['male' 'female']


In [74]:
import pandas as pd

def convSex(row):
    return 1 if row['Sex'] == 'male' else 0
   
data['SexC'] = data.apply(convSex,axis=1)

data.head()
    

     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch      Fare  \
0              1         0       3    male  22.0      1      0    7.2500   
1              2         1       1  female  38.0      1      0   71.2833   
2              3         1       3  female  26.0      0      0    7.9250   
3              4         1       1  female  35.0      1      0   53.1000   
4              5         0       3    male  35.0      0      0    8.0500   
5              6         0       3    male  28.0      0      0    8.4583   
6              7         0       1    male  54.0      0      0   51.8625   
7              8         0       3    male   2.0      3      1   21.0750   
8              9         1       3  female  27.0      0      2   11.1333   
9             10         1       2  female  14.0      1      0   30.0708   
10            11         1       3  female   4.0      1      1   16.7000   
11            12         1       1  female  58.0      0      0   26.5500   
12          

In [75]:
def convEmbarked(row):
    return {'S':0, 'C':1, 'Q':2}[row['Embarked']] 

data['EmbarkedC'] = data.apply(convEmbarked,axis=1)

data.head()
    

     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch      Fare  \
0              1         0       3    male  22.0      1      0    7.2500   
1              2         1       1  female  38.0      1      0   71.2833   
2              3         1       3  female  26.0      0      0    7.9250   
3              4         1       1  female  35.0      1      0   53.1000   
4              5         0       3    male  35.0      0      0    8.0500   
5              6         0       3    male  28.0      0      0    8.4583   
6              7         0       1    male  54.0      0      0   51.8625   
7              8         0       3    male   2.0      3      1   21.0750   
8              9         1       3  female  27.0      0      2   11.1333   
9             10         1       2  female  14.0      1      0   30.0708   
10            11         1       3  female   4.0      1      1   16.7000   
11            12         1       1  female  58.0      0      0   26.5500   
12          

In [77]:
if 'Sex' in data.columns:
    data = data.drop('Sex',axis=1)
if 'Embarked' in data.columns:
    data = data.drop('Embarked',axis=1)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,SexC,EmbarkedC
0,1,0,3,22.0,1,0,7.25,1,0
1,2,1,1,38.0,1,0,71.2833,0,1
2,3,1,3,26.0,0,0,7.925,0,0
3,4,1,1,35.0,1,0,53.1,0,0
4,5,0,3,35.0,0,0,8.05,1,0


## Cross validation

We want to train the algorithm on different data than we make predictions on. This is critical if we want to avoid overfitting. Overfitting is what happens when a model fits itself to "noise", not signal. Every dataset has its own quirks that don't exist in the full population. For example, if I asked you to predict the top speed of a car from its horsepower and other characteristics, and gave you a dataset that randomly had cars with very high top speeds, you would create a model that overstated speed. The way to figure out if your model is doing this is to evaluate its performance on data it hasn't been trained using.

Every machine learning algorithm can overfit, although some are much less prone to it. If you evaluate your algorithm on the same dataset that you train it on, it's impossible to know if it's performing well because it overfit itself to the noise, or if it actually is a good algorithm.

Luckily, cross validation is a simple way to avoid overfitting. To cross validate, you split your data into some number of parts (or "folds"). Lets use 3 as an example. You then do this:

 - Combine the first two parts, train a model, make predictions on the third.
 - Combine the first and third parts, train a model, make predictions on the second.
 - Combine the second and third parts, train a model, make predictions on the first.

This way, we generate predictions for the whole dataset without ever evaluating accuracy on the same data we train our model using.

Familiarize yourself with cross-validation in scikit-learn reading about [KFold](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html).

## Applying scikit-learn ML algorithms to the Titanic dataset

 - Follow the scikit-learn API and apply the ML algorithm of your choice to a subset of the Titanic dataset.
 - Using the wrappers described [here](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), compute cross-validated metrics for said algorithm
 - If your algorithm takes various parameters as input, consider tuning them to improve performance
 - Repeat with various algorithms and compare its performance

## Generating new features

If your data is rich, sometimes new features can be extracted from it that can better inform your machine learning algorithm. You can try generating new features from the name of the passengers, the members of their families present, etc.

## Finding the best features

Feature engineering is the most important part of any machine learning task, and there are lots more features we could calculate. But we also need a way to figure out which features are the best.

One way to do this is to use univariate feature selection. This essentially goes column by column, and figures out which columns correlate most closely with what we're trying to predict (Survived).

As usual, scikit-learn has a function that will help us with feature selection, [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html). This selects the best features from the data, and allows us to specify how many it selects.