Here is my solution of the well-known Titanic task from Kaggle: there are passangers who survived the catastrophe and who didn't, and the task is to determine who is who, taking into account characteristics such as sex, age and so on.

I didn't try to gain as high result as possible, but rather followed general considerations in solving a machine learning task.

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, cross_val_score

In [2]:
# read data
df_train = pd.read_csv('titanic_train.csv')
df_test = pd.read_csv('titanic_test.csv')

Let's explore the data.

In [3]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Cabin column has too many missing values, so it's better to remove the whole column.

In [5]:
# remove Cabin column
df_train = df_train.drop(['Cabin'], axis=1)

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB


Age must be a good parameter in terms of prediction, so I'm going to use it. There are different strategies in dealing with missing values. Let's replace them with medians. To obtain more accurate numbers, I use both train and test data.

In [7]:
median_age = pd.concat([df_train['Age'], df_test['Age']], axis=0).median()

In [8]:
median_age

28.0

In [9]:
df_train['Age'] = df_train['Age'].apply(lambda x: median_age if pd.isnull(x) else x)

Now let's make sure that there are no missing values in the Age column anymore.

In [10]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB


But there are two missing values in the Embarked column. Let's use the same strategy. But this time the values are not numbers, and it's impossible to get medians. Let's look at the frequencies of value types.

In [11]:
pd.concat([df_train['Embarked'], df_test['Embarked']], axis=0).value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

The undisputable leader is 'S', so it will be used for filling.

In [12]:
df_train['Embarked'] = df_train['Embarked'].apply(lambda x: 'S' if pd.isnull(x) else x)

In [13]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB


Let's then get dummy variables for two columns: Sex and Embarked.

In [14]:
df_train = pd.get_dummies(df_train, columns=['Sex', 'Embarked'])

In [15]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,0,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,1,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,1,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,1,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,0,1,0,0,1


Now it's time to determine which values will be used in the process of learning. For each of dummy variables we use all of them but one. So, let's use for Sex Sex_male and for Embarked Embarked_Q and Embarked_C.

In [16]:
X_train = df_train[['Age', 'Pclass', 'Sex_male', 'Fare', 'Embarked_Q', 'Embarked_C', 'SibSp', 'Parch']].values
y_train = df_train['Survived'].values

Now as we have all the values for X, it's time to scale them.

In [17]:
from sklearn.preprocessing import StandardScaler

In [18]:
X_train = StandardScaler().fit_transform(X_train)

For machine learning procedure let's use different techniques and determine the winner. I will use the following ones:

* Logistic Regression
* Stochastic Gradient Descent
* Decision Tree
* Gaussian Naive Bayes
* Gradient Boosting
* Random Forest
* Support Vector Machines
* k Nearest Neighbours


I will try several parameters set on almost all models. Fortunately, python has libraries that make it rather easy.

In [19]:
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn import tree, svm
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [20]:
def train(model=None, params={}, train_data_X=[], train_data_y=[]):
    
    if model is not None and len(train_data_X) and len(train_data_y):
        grid = GridSearchCV(model, params, cv=5, n_jobs=-1)
        grid.fit(train_data_X, train_data_y)
       
        return grid.best_score_, grid.best_params_

In [25]:
models = [
    ['Logistic Regression', LogisticRegression(), 
     {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}],
    ['Stochastic Gradient Descent', SGDClassifier(max_iter=5, tol=None), 
     {'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]}],
    ['Decision Tree', tree.DecisionTreeClassifier(), {}],
    ['Gaussian Naive Bayes', GaussianNB(), {}],
    ['Gradient Boosting', XGBClassifier(), 
     {"n_estimators": [120, 150, 180, 200, 230, 250, 270, 300, 330], 
        "max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10], 
        "learning_rate": [0.01, 0.02, 0.05, 0.06, 0.07]}],
    ['Random Forest', RandomForestClassifier(),  
        { 'n_estimators': [10, 20, 50, 100, 200, 300, 400, 500],
          'min_samples_split': [5, 8, 10, 12],
          'min_samples_leaf': [2, 3, 4, 5, 6, 7, 8],
          'max_features': ['auto', 'sqrt', 'log2']
        }
    ],
    ['Support Vector Machines', svm.SVC(), 
     {'kernel': ['rbf', 'sigmoid', 'linear', 'poly'], 
        'C': [0.001, 0.01, 0.1, 1, 10, 100]}
    ],
    ['k Nearest Neighbours', KNeighborsClassifier(), {
        'n_neighbors': range(1, 20), 
        'leaf_size': range(10, 81, 10)}
    ],
]

In [26]:
for model in models:
    best_score, best_params = train(model=model[1], params=model[2], train_data_X=X_train, train_data_y=y_train)
    print(model[0], best_score, best_params)

Logistic Regression 0.787878787879 {'C': 1000}
Stochastic Gradient Descent 0.799102132435 {'alpha': 0.01}
Decision Tree 0.778900112233 {}
Gaussian Naive Bayes 0.775533108866 {}
XG Boost 0.836139169473 {'n_estimators': 180, 'learning_rate': 0.02, 'max_depth': 8}
Random Forest 0.837261503928 {'n_estimators': 500, 'min_samples_split': 5, 'max_features': 'log2', 'min_samples_leaf': 2}
Support Vector Machines 0.828282828283 {'C': 1, 'kernel': 'rbf'}
k Nearest Neighbours 0.819304152637 {'n_neighbors': 14, 'leaf_size': 20}


So the winner is Random Forest with the result 0.837. Still, Gradient Boosting is very effective, too.