# Restoring order and filling up missing values in Titanic kaggle dataset

First of all, it is convenient to have all the libraries in one place that we need further

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

## Titanic dataset taken from kaggle

https://www.kaggle.com/c/titanic/data

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

In [2]:
train_data = pd.read_csv('train.csv')
X_train = train_data.drop(columns='Survived')
y_train = train_data.Survived

In [3]:
X_train.head(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [4]:
X_train.isna().any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In the train dataset, one can see that `Age`, `Cabin` and `Embarked` features have missing values.

In [5]:
X_test = pd.read_csv('test.csv')

In [6]:
X_test.isnull().any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked       False
dtype: bool

In the test dataset, one can see that `Age`, `Fare` and `Cabin` features have missing values.

It is convenient to combine both datasets into one, but to retain the index separating them.

**Combining features from train and test datasets** 

In [7]:
X_total = pd.concat([X_train, X_test], axis=0, ignore_index=True, sort=False)
idx = X_train.shape[0]  #index from which test data starts

Let's check that everything is fine with dataframes dimensions.

In [8]:
print (X_total.shape[0] == X_train.shape[0] + X_test.shape[0])
print (X_total.shape[1] == X_train.shape[1])
print (X_total.shape[1] == X_train.shape[1])

True
True
True


In [9]:
X_total.isnull().any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked        True
dtype: bool

We see that in the combined dataset the missing values are contained in `Age`, `Fare`, `Cabin` and `Embarked` columns. This is in agreement with our earlier findings. However, we are still not ready to start filling up the missing values. Firstly, let's perform a feature engineering for columns `Name`, `Ticket`, `Cabin` and `Embarked`. We are going to convert them into numerical features.

In [10]:
X_total.head(1)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


We start investigating dataset from `PassengerId` column. It is completely useless for us since it gives an information about an order. So, we drop it.

In [11]:
X_total.drop(columns='PassengerId', inplace=True)

Next goes `Pclass` feature. Let's leave it as that for now.

`Name` itself cannot affect our predictions. However, we can extract an information which may be important for us: the `Title` of a person. Depending on a `Title`, people had differrent locations on Titanic correlated with time needed to leave Titanic.

Let's find unique titles we have in our dataset.

In [12]:
X_total['Title'] = \
X_total['Name'].apply(lambda name: name.split(',')[1].split('.')[0].strip())

In [13]:
print (X_total.Title.unique())

['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
 'Sir' 'Mlle' 'Col' 'Capt' 'the Countess' 'Jonkheer' 'Dona']


Some of titles are not really unique. For example, `Mrs` and  `Mme` really mean the same.

In [14]:
title_dictionary = {'Mr': 'Mr',
                    'Mrs':'Mrs',
                    'Miss': 'Mrs',
                    'Master': 'Officer',
                    'Don': 'Royalty',
                    'Rev': 'Officer',
                    'Dr': 'Officer',
                    'Mme': 'Mrs',
                    'Ms': 'Mrs',
                    'Major':'Officer',
                    'Lady': 'Mrs',
                    'Sir': 'Mr',
                    'Mlle': 'Mrs',
                    'Col': 'Officer',
                    'Capt': 'Officer',
                    'the Countess': 'Royalty',
                    'Jonkheer': 'Royalty'}

X_total['Title'] = X_total['Title'].map(title_dictionary)

Let's check whether we have any person without a title

In [15]:
X_total[X_total.isnull().Title == True]

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
1305,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C,


We have found one person and we see that this a female. We may link her to 'Mrs'. 

In [16]:
X_total['Title'].replace(np.nan, 'Mrs', inplace=True)
X_total.drop(columns=['Name'], inplace=True)

Let's check one more time and there is not a person without a title.

In [17]:
X_total[X_total.isnull().Title == True]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title


Next we deal with `Sex` feature. Since there are only two options in our database: male or female, we perform a mapping to 1 and 0, respectively. 

In [18]:
X_total['Sex'] = X_total['Sex'].apply(lambda x: 1 if x == 'male' else 0)

We disregard `Ticket` column in our analysis.

In [19]:
X_total.drop(columns=['Ticket'], inplace=True)

This is what our database looks like for now

In [20]:
X_total.head(2)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
0,3,1,22.0,1,0,7.25,,S,Mr
1,1,0,38.0,1,0,71.2833,C85,C,Mrs


In [21]:
X_total.isnull().any()

Pclass      False
Sex         False
Age          True
SibSp       False
Parch       False
Fare         True
Cabin        True
Embarked     True
Title       False
dtype: bool

Missing values: 

`Age`  - ready for filling up

`Fare` - ready for filling up

`Cabin` - not ready for filling up

`Embarked` - not ready for filling up

Cabin and Embarked features are not ready for being filled up yet because they are categorical variables which require some additional preprocessing.

** Cabin and Embarked features **

First of all, we replace `NaN` values to categorical 'None'.

In [22]:
X_total['Embarked'].replace(np.nan, 'None', inplace=True)
X_total['Cabin'].replace(np.nan, 'None', inplace=True)

Let's see what we have in `Embarked` column

In [23]:
X_total.Embarked.value_counts()

S       914
C       270
Q       123
None      2
Name: Embarked, dtype: int64

S - is the dominant value in `Embarked` feature. Besides, we have only 2 missing values. Both of the latter may be substituted by S.

In [24]:
X_total.Embarked.fillna('S', inplace=True)

Now let's explore `Cabin` column

In [25]:
X_total.Cabin.value_counts()[:3]

None           1014
C23 C25 C27       6
G6                5
Name: Cabin, dtype: int64

In [26]:
X_total.shape[0]

1309

We have 1014 missing values in a `Cabin` column, while it has 1309 values in total. It is reasonable to drop the whole column.

In [27]:
X_total.drop(columns=['Cabin'], inplace=True)

In [28]:
X_total.isnull().any()

Pclass      False
Sex         False
Age          True
SibSp       False
Parch       False
Fare         True
Embarked    False
Title       False
dtype: bool

We are ready to start filling up the missing values by using different approaches. We will use K Nearest Neighbors (KNN) algorithm. Recall that `idx` variable is responsible for index starting from which we deal with test dataset. 

**Заменим оставшиеся пропуски (features: Age, Fare) с помощью KNN**

In [29]:
from fancyimpute import KNN 
#installing by: !conda install -c brittainhard fancyimpute --yes

Using TensorFlow backend.


In [30]:
X_total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
Pclass      1309 non-null int64
Sex         1309 non-null int64
Age         1046 non-null float64
SibSp       1309 non-null int64
Parch       1309 non-null int64
Fare        1308 non-null float64
Embarked    1309 non-null object
Title       1309 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 81.9+ KB


In this section, we are going to fill in the missing values by using K Nearest Neighbors imputations. The latter weights samples using the mean squared difference on features for which two rows both have observed data

In [31]:
cols = ['Age','Fare']
data_knn3 = pd.DataFrame(KNN(3).fit_transform(X_total[cols]))
data_knn3.columns = cols

Imputing row 1/1309 with 0 missing, elapsed time: 0.210
Imputing row 101/1309 with 0 missing, elapsed time: 0.212
Imputing row 201/1309 with 0 missing, elapsed time: 0.213
Imputing row 301/1309 with 1 missing, elapsed time: 0.214
Imputing row 401/1309 with 0 missing, elapsed time: 0.215
Imputing row 501/1309 with 0 missing, elapsed time: 0.216
Imputing row 601/1309 with 0 missing, elapsed time: 0.217
Imputing row 701/1309 with 0 missing, elapsed time: 0.218
Imputing row 801/1309 with 0 missing, elapsed time: 0.219
Imputing row 901/1309 with 0 missing, elapsed time: 0.220
Imputing row 1001/1309 with 0 missing, elapsed time: 0.221
Imputing row 1101/1309 with 0 missing, elapsed time: 0.222
Imputing row 1201/1309 with 0 missing, elapsed time: 0.223
Imputing row 1301/1309 with 0 missing, elapsed time: 0.224


In [32]:
data_knn3.isna().any()

Age     False
Fare    False
dtype: bool

In [33]:
data_knn3.head(1)

Unnamed: 0,Age,Fare
0,22.0,7.25


** Get dummies of `Title`, `Pclass`, `Embarked` features**

Here we finish preparing our database for making predictions. Firsly, we start with processing categorical features such as `Title`, `Pclass`, `Embarked`. This can be done with `Pandas` attribute: `get_dummies`.

In [34]:
title = pd.get_dummies(X_total['Title'], prefix ='title')
pclass = pd.get_dummies(X_total['Pclass'], prefix = 'pclass')
embarked = pd.get_dummies(X_total['Embarked'], prefix='embarked')
other_cols = ['Sex','SibSp','Parch']

Now we need to concatenate categorical features with numerical ones. We store the result in `df_knn3` variable. 

In [35]:
df_knn3 = pd.concat([data_knn3, title, pclass, \
                     embarked, X_total[other_cols]], axis=1, sort=None)

Let's divide our dataframes into `train` and `test` parts by using `idx` variable containing index value before which train data should come.

In [36]:
train_df_knn3 = df_knn3[:idx] 
test_df_knn3 = df_knn3[idx:]

In [37]:
df_knn3.head(2)

Unnamed: 0,Age,Fare,title_Mr,title_Mrs,title_Officer,title_Royalty,pclass_1,pclass_2,pclass_3,embarked_C,embarked_None,embarked_Q,embarked_S,Sex,SibSp,Parch
0,22.0,7.25,1,0,0,0,0,0,1,0,0,0,1,1,1,0
1,38.0,71.2833,0,1,0,0,1,0,0,1,0,0,0,0,1,0


In [38]:
df_knn3[['Age','Fare']].head(2)

Unnamed: 0,Age,Fare
0,22.0,7.25
1,38.0,71.2833


We see that `Age` and `Fare` have a broad spectrum of values. This may negatively affect our predictions. The usual technique for getting rid of such a program is to scale the values for each feature so that their mean is 0 and standard deviation is 1.

In [39]:
scaler = StandardScaler()
train_scaled = scaler.fit_transform(train_df_knn3[['Age','Fare']].values)
test_scaled = scaler.transform(test_df_knn3[['Age','Fare']].values)

In [40]:
train_df_knn3.drop(columns=['Age','Fare'], inplace=True)
test_df_knn3.drop(columns=['Age','Fare'], inplace=True)

In [41]:
X_train3 = np.concatenate((train_df_knn3.values, train_scaled), axis=1)
X_test3 = np.concatenate((test_df_knn3.values, test_scaled), axis=1)

`X_train3` and `X_test3` contain our feature matrices of train and test data, respectively.

## Use XGBoost

Let's use XGBoost algorithm, without setting parameters, to make a simple kaggle submission.

In [42]:
from sklearn.ensemble import GradientBoostingClassifier

In [43]:
xgb = GradientBoostingClassifier()
xgb.fit(X_train3, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [44]:
y_pred3 = xgb.predict(X_test3)

In [45]:
def write_to_submission_file(predicted_labels, out_file, train_num=891,
                    target='Survived', index_label="PassengerId"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(train_num + 1,
                                                  train_num + 1 +
                                                  predicted_labels.shape[0]),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [46]:
write_to_submission_file(y_pred3, out_file='submission.csv')

With such a submission we obtain 0.77033 score. It places us to the middle of the raiting, which is not bad accounting for that our goal was to fill in the existing gaps, not to obtain a high score.

In order to obtain higher score, one should create new features by combining existing ones, `family size` is one of possible options: `X_total['FamilySize'] = X_total.Parch + X_total.SibSp + 1`, and also by setting up the parameters of predicting algorithm.