# Titanic Survival Prediction

This is a short lesson on data cleaning and classification prediction, using Kaggle's 
introductory competition, [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic).

In [1]:
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, \
    RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
import re

# Importing The Data
We're going to be using Pandas to work with our data, so we'll simply import our two 
csv files directly into Pandas DataFrames:

In [2]:
import pandas as pd
import numpy as np

train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

To simplify our data processing, we can combine these two data frames into a single
data frame. Since the test data frame is missing the 'Survived' column, we'll fill 
this in with `np.nan`, so we can remember which rows are test data and which are 
training data later on.

In [3]:
test_df['Survived'] = np.nan
df = train_df.append(test_df, sort=False)

With that out of the way, let's start looking at the actual contents of our data and 
put it into a form that'll best suited for training our classifier.

# Clean-up Time 

Now that we've loaded our data, we need to clean it up. Let's take a look at what we're 
dealing with first:

In [4]:
df.dtypes

PassengerId      int64
Survived       float64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [5]:
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
170,1062,,3,"Lithman, Mr. Simon",male,,0,0,S.O./P.P. 251,7.55,,S
561,562,0.0,3,"Sivic, Mr. Husein",male,40.0,0,0,349251,7.8958,,S
300,1192,,3,"Olsson, Mr. Oscar Wilhelm",male,32.0,0,0,347079,7.775,,S
124,125,0.0,1,"White, Mr. Percival Wayland",male,54.0,0,1,35281,77.2875,D26,S
753,754,0.0,3,"Jonkoff, Mr. Lalio",male,23.0,0,0,349204,7.8958,,S


If we want to look for linear or binary relationships between 
our data and the survived value of each passenger, we need to convert this human-readable 
data into something that's more easy for a machine to understand: numbers!

Let's go through this column-by-column, explaining the process for each new case as 
we come across it.

# Sex
The 'Sex' column is a great place to start because of its simplicity. There are only 
two cases to worry about, 'male' and 'female', and there isn't any missing data. So all 
we need to do is map these strings to a binary case for a classifier algorithm to understand.

Let's use the mapping `{0: 'female', 1: 'male'}`:

In [6]:
mapping = {
    'female': 0,
    'male': 1
}
df['Sex'] = df['Sex'].apply(lambda x: mapping[x])
df['Sex'][:3]

0    1
1    0
2    0
Name: Sex, dtype: int64

With just one step, we have our first binary column! Simple enough. Now onto the next one...

# Age
The 'Age' column is already in a nice numeric form, but let's take a look into how well 
it actually corresponds to whether a passenger survived in its current state.

In [7]:
pd.isna(df['Age']).sum()

263

Looks like we have some NA values! Considering that age is a nice continuous value, let's just assign
the median age to each missing value.

In [8]:
med_age = np.median(df['Age'][~pd.isna(df['Age'])])
df['Age'] = df['Age'].fillna(med_age)


Now that we have only real values, let's look at the correspondence between age and survival:

In [9]:
df[['Survived', 'Age']].corr()

Unnamed: 0,Survived,Age
Survived,1.0,-0.06491
Age,-0.06491,1.0


Quick refresher on correspondence: this value ranges between -1 and +1. 

A correspondence of 1 means two sets are strongly linearly 
correlated. Ideally, with a correspondence of 1, you could apply a simple formula of the form 
`y = mx + b` to map from one set to the other, with `m` being positive.

A correspondence of -1 means the two sets are strongly linearly correlated as well, just ini the negative 
direction. You can still apply the mapping `y = mx + b`, but now you'd have a negative `m`.

A correspondence of 0 means that the two sets you're looking at have no similarity. There exists no mapping 
from one set to the other.

Back to our data! Seeing that the correspondence between age and survival is nearly zero, we need to 
see if there's anything we can do to actually compare the two columns! One popular method 
of identifying nonlinear correspondence is by binning, or making bins of ages, so we have a bin filled with 
our 0-10 year old passengers, our 11-20 year old passengers, and so on and so forth.

In [10]:
bin_space = 10
bin_walls = range(0, 100, bin_space)
bin_intervals = [pd.Interval(low, low + bin_space) for low in bin_walls]
bins = pd.IntervalIndex(bin_intervals)

age_bin_mapping = {}
for i, b in enumerate(bins):
    age_bin_mapping[b] = i

age_bins = pd.cut(df['Age'], bins=bins)
df['Age_bin'] = age_bins.apply(lambda x: age_bin_mapping[x]).astype(int)

df['Age_bin'][:10]

0    2
1    3
2    2
3    3
4    3
5    2
6    5
7    0
8    2
9    1
Name: Age_bin, dtype: int64

Now we have all of our passengers assigned to bins! Let's see what the correspondence is for this...

In [11]:
df[['Survived', 'Age_bin']].corr()

Unnamed: 0,Survived,Age_bin
Survived,1.0,-0.051406
Age_bin,-0.051406,1.0


As expected, the age bins linearly correspond with survival just as little as the raw age did.
There might be a strong correspondence hiding here still, so let's look at the survival rate for each bin:

In [12]:
for bin_ind in age_bin_mapping.values():
    df_bin = df[df['Age_bin'] == bin_ind]
    if len(df_bin) == 0:
        continue
    print("Bin {} survival rate: {:.2g}".format(
        bin_ind, 
        len(df_bin[df_bin['Survived'] == 1]) / len(df_bin)
    ))
    

Bin 0 survival rate: 0.44
Bin 1 survival rate: 0.27
Bin 2 survival rate: 0.22
Bin 3 survival rate: 0.33
Bin 4 survival rate: 0.25
Bin 5 survival rate: 0.27
Bin 6 survival rate: 0.15
Bin 7 survival rate: 0.17


Looking at this breakdown, there's definitely some sort of trend. Bin 0 had a much higher 
rate of survival than the others, while bins 6 and 7 had much lower rates of survival. In 
order to extract this in a more machine-readable way, we're going to expand this categorical 
 column into a set of columns, one for each bin, with the value `1` if the row belongs to 
 the bin and `0` otherwise.

In [13]:
age_bins = pd.get_dummies(df['Age_bin']).add_prefix('Age_bin_')
df = df.join(age_bins)

In [14]:
df[:1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked,Age_bin,Age_bin_0,Age_bin_1,Age_bin_2,Age_bin_3,Age_bin_4,Age_bin_5,Age_bin_6,Age_bin_7
0,1,0.0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,...,S,2,0,0,1,0,0,0,0,0


Now we have a DataFrame with 8 extra columns, one for each age bin that we had created earlier.
Since we've coded the age column into this new form, we can drop the old columns so we 
don't try to train on different representations of the same data.

In [15]:
df = df.drop(columns=['Age', 'Age_bin'])

# Pclass (Ticket class)
Now that we've finished cleaning up the age data, let's move on to `Pclass`. First, we should look at its
correspondence to see if it's already good or not.

In [16]:
df[['Survived', 'Pclass']].corr()

Unnamed: 0,Survived,Pclass
Survived,1.0,-0.309447
Pclass,-0.309447,1.0


A correspondence of -0.3 is actually fairly strong, compared to our previous examinations
of the age bins. We can break this value down further to see if there are any individual 
ticket classes that show strong individual correspondences with the survival rate.

In [23]:
p_classes = sorted(df['Pclass'].unique())
for p_class in p_classes:
    df_class = df[df['Pclass'] == p_class]
    survival = len(df_class[df_class['Survived'] == 1]) / len(df_class)
    print("Passenger class {}: {:.2g}".format(p_class, survival))


Passenger class 1: 0.36
Passenger class 2: 0.28
Passenger class 3: 0.16


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_bin_0,Age_bin_1,Age_bin_2,Age_bin_3,Age_bin_4,Age_bin_5,Age_bin_6,Age_bin_7


With these survival rates, it looks like there is a pretty nice linear relationship in the 
data already, so we'll leave the Pclass column as-is.

And just to be safe, we should check to make sure that there aren't any missing data
from this column.

In [27]:
len(df[pd.isna(df['Pclass'])])

0

No missing values, good! Now we can move on to the next column to analyze.

# SibSP (Siblings and spouses)

# Parch (Parents and children)


In [17]:
def get_title(name):
	title_search = re.search(' ([A-Za-z]+)\.', name)
	# If the title exists, extract and return it.
	if title_search:
		return title_search.group(1)
	return ""

def categorize_titles(df):
    df['Title'] = df['Name'].apply(get_title)
    df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    df['Title'] = df['Title'].replace('Mlle', 'Miss')
    df['Title'] = df['Title'].replace('Ms', 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')
    df['Title'] = df['Title'].fillna('Rare')
    return df

def map_col_to_ids(df, col_name, col_id_name):
    temp_df = pd.DataFrame({col_name: df[col_name].unique(), 
                            col_id_name:range(len(df[col_name].unique()))})
    df = df.merge(temp_df, on=col_name, how='left')
    return df

def explode_col(df, col):
    values = df[col].unique()
    for val in values:
        df[col+"_"+str(val)] = df[col].apply(lambda x: 1 if x == val else 0)
    return df

def clean(df):    
    # Survived
    if 'Survived' in df.columns:
        df['Survived'] = df['Survived'].astype(int)
    
    # Age
    df['Age'] = col_as_numeric(df['Age'], float)
    
    # Age bins
    bin_cuts = [0, 11, 23, 34, 45, 57, 68, 100]
    df['Age_bin'] = pd.cut(df['Age'], bin_cuts).apply(lambda x: str(x))
    actual_bins = df['Age_bin'].unique()
    for age_bin in actual_bins[:-1]:
        if age_bin == np.nan:
            continue
        df[age_bin] = df['Age_bin'].apply(lambda x: 1 if x == age_bin else 0)
    _ = df.pop('Age')
    _ = df.pop('Age_bin')

    # Sex
    df = map_col_to_ids(df, 'Sex', 'Sex_id')
    df = explode_col(df, 'Sex_id')
    _ = df.pop('Sex')
    _ = df.pop('Sex_id')
    
    
    # SibSp - # of siblings / spouses aboard 
    df['SibSp'] = col_as_numeric(df['SibSp'], int, 0)
    
    # Parch - # of parents / children aboard
    df['Parch'] = col_as_numeric(df['Parch'], int, 0)
    
    # Total relatives
    df['Relatives'] = df['SibSp'] + df['Parch']
    
    # Is alone
    df['IsAlone'] = df['Relatives'].apply(lambda x: 1 if x == 0 else 0)
    
    df = map_col_to_ids(df, 'Relatives', 'Relatives_id')
    df = explode_col(df, 'Relatives_id')
    _ = df.pop('Relatives')
    _ = df.pop('Relatives_id')
    
    # Names and titles
    df = categorize_titles(df)
    df = map_col_to_ids(df, 'Title', 'Title_id')
    df = explode_col(df, 'Title_id')
    _ = df.pop('Title')
    _ = df.pop('Title_id')
    
    # # Fare
    df['Fare'] = col_as_numeric(df['Fare'], float)
    bin_cuts = [0, 15, 25, 35, 45, 60, 200, 600]
    df['Fare_bin'] = pd.cut(df['Fare'], bin_cuts).apply(lambda x: str(x))
    df = map_col_to_ids(df, 'Fare_bin', 'Fare_id')
    df = explode_col(df, 'Fare_id')
    _ = df.pop('Fare')
    _ = df.pop('Fare_bin')
    _ = df.pop('Fare_id')
     
    # Pclass
    df['Pclass'] = col_as_numeric(df['Pclass'], int)
    df = map_col_to_ids(df, 'Pclass', "Pclass_id")
    df = explode_col(df, 'Pclass_id')
    _ = df.pop('Pclass')
    _ = df.pop('Pclass_id')
    
    # Cabin - by letter class
    df['CabinClass'] = df['Cabin'].apply(lambda x: x if len(x) == 0 else x[0])
    df = map_col_to_ids(df, 'CabinClass', "CabinClass_id")
    df = explode_col(df, 'CabinClass_id')
    _ = df.pop('CabinClass')
    _ = df.pop('CabinClass_id')
    
    # Embarked
    df = map_col_to_ids(df, 'Embarked', "Embarked_id")
    df = explode_col(df, 'Embarked_id')
    _ = df.pop('Embarked')
    _ = df.pop('Embarked_id')
        
    return df

df = clean(train_df)
t_df = clean(test_df)

NameError: name 'col_as_numeric' is not defined

In [None]:
for col in df.columns:
    if col not in t_df.columns:
        t_df[col] = 0

In [None]:
# Check correspondences to find any over 0.5 with survival 
corr = df.corr().abs()

plt.matshow(df.corr().abs())
cb = plt.colorbar()
plt.title('Correlation Matrix')
plt.show()

In [None]:
num_cols = df.select_dtypes('number').columns
df_numeric = df[num_cols]
df_numeric = df_numeric.dropna()
survived = df_numeric.pop('Survived')

X = df_numeric.to_numpy()
y = survived.to_numpy()

test_p = 0.9
t = int(test_p * len(y))
X_test = X[:t]
X_train = X[t:]
y_test = y[:t]
y_train = y[t:]

t_num_cols = t_df.select_dtypes('number').columns
t_df_numeric = t_df[t_num_cols]
t_df_numeric = t_df_numeric.dropna()
t_survived = t_df_numeric.pop('Survived')

X_final = t_df_numeric.to_numpy()

In [None]:
clf = svm.SVC(gamma='scale', kernel='linear', random_state=0, decision_function_shape='ovo')

In [None]:
lin_clf = svm.LinearSVC(random_state=0)

In [None]:
ada_clf = AdaBoostClassifier(base_estimator=RandomForestClassifier(n_estimators=10, random_state=0), 
                             n_estimators=100, random_state=0)

In [None]:
gr_clf = GradientBoostingClassifier(n_estimators=100, random_state=0)

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=0)

In [None]:
erf_clf = ExtraTreesClassifier(n_estimators=100, random_state=0)

In [None]:
eclf = VotingClassifier(estimators=[('Vanilla SVC', clf),
                                    ('Linear SVC', lin_clf),
                                    ('AdaBoost, random forest', ada_clf), 
                                    ('Gradient Boosting, random forest', gr_clf),
                                    ('Random Forest', rf_clf),
                                    ('Extra Trees', erf_clf),
                                    ], voting='hard')
for clft in eclf.estimators:
    print("\n"+clft[0])
    clf = clft[1]
    clf.fit(X_train, y_train)
    print("- train: {}\n- test: {}".format(clf.score(X_train, y_train), 
                                           clf.score(X_test, y_test)))
eclf.fit(X_train, y_train)
print("\n====\nVoting Classifier:\n- train: {}\n- test: {}".format(eclf.score(X_train, y_train), 
                                                           eclf.score(X_test, y_test)))

In [None]:
y_final = eclf.predict(X_final)
df_final = pd.DataFrame({
    'PassengerId': t_df['PassengerId'],
    'Survived': y_final
})

df_final.to_csv("data/pred.csv", index=False)