In [7]:
# Let's load in the data we want
import pandas as pd

fr_train = pd.read_csv('../input/train.csv')
print(fr_train.shape)
fr_train.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


So right off the bat we can see quite a few missing values in the 'Cabin' variable. It's always good to find and clean up missing or clearly incorrect values when you're working with a dataset, so let's do that first.

In [2]:
def naSummary(df):
    nrow, ncol = df.shape
    na_count = df.isnull().sum()
    na_pc = na_count.divide(nrow)
    print(pd.DataFrame({'NA Count': na_count, 'NA %': na_pc}))
    
naSummary(fr_train)

                 NA %  NA Count
PassengerId  0.000000         0
Survived     0.000000         0
Pclass       0.000000         0
Name         0.000000         0
Sex          0.000000         0
Age          0.198653       177
SibSp        0.000000         0
Parch        0.000000         0
Ticket       0.000000         0
Fare         0.000000         0
Cabin        0.771044       687
Embarked     0.002245         2


Looks like we're missing data in three columns: Age, Cabin and Embarked. Fortunately we're not missing too much Age data and practically none in Embarked. On the other hand we're missing a whopping 687 values for Cabin, or as we can see from the 'NA %' variable, we're missing Cabin values for 77% of our entire dataset!

Since we're only missing a measly two values for 'Embarked' lets try and rectify those. First let's find the relevant passengers.

In [3]:
# Get the indices of the rows with the missing 'Embarked' values
fr_train[fr_train.isnull()['Embarked']]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Looks like they share a ticket number, my original suspicion was that this was erroneous data but turns out the Titanic offerd group type tickets. Typically this points out families as we can see in the simple example below.

In [4]:
fr_train[fr_train.duplicated(subset='Ticket')]
fr_train[fr_train['Ticket'] == '349909']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
24,25,0,3,"Palsson, Miss. Torborg Danira",female,8.0,3,1,349909,21.075,,S
374,375,0,3,"Palsson, Miss. Stina Viola",female,3.0,3,1,349909,21.075,,S
567,568,0,3,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",female,29.0,0,4,349909,21.075,,S


So if we're lucky there might be other people on the same ticket with the Embarkment value already set, and seeing as they shared a ticket number I think it would be safe to assume they embarked from the same port. I don't really expect this to be the case but it's a simple check so why not!

In [5]:
fr_train[fr_train['Ticket'] == '113572']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Turns out we're not so lucky.

So we still need to assign them an Embarkment value. Now you could do this in a more statistical manner, formally called 'unit imputation', but why overcomplicate things? They both survived and there's a plethora of information on Titanic survivors available. A quick google search of their respective names yields the following information on courtesy of Encyclopedia Titanica.

> Mrs Stone boarded the Titanic in Southampton on 10 April 1912 and was travelling in first class with her maid Amelie Icard.

Easy!

In [6]:
fr_train.set_value([61, 829], 'Embarked', 'S');

# Verify by using our naSummary
naSummary(fr_train)

                 NA %  NA Count
PassengerId  0.000000         0
Survived     0.000000         0
Pclass       0.000000         0
Name         0.000000         0
Sex          0.000000         0
Age          0.198653       177
SibSp        0.000000         0
Parch        0.000000         0
Ticket       0.000000         0
Fare         0.000000         0
Cabin        0.771044       687
Embarked     0.000000         0


So there go our missing Embarked values. So in the interest of common sense we should realise that when we only have 2 missing values out of 891 we probably could have just done anything we want with them and statistically it wouldn't have made a difference. Mean value imputation would have been more than suitable for this (replace the missing values with the overall mean for that vector), and it actually would've made given us the same result in this case.

Regardless let's move onto filling the missing age values.

* Write about my script magics here*

In [12]:
fr_train_plus = pd.read_csv('../input/train_plus.csv')
print(fr_train.shape)
fr_train.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
naSummary(fr_train_plus)

                 NA %  NA Count
PassengerId  0.000000         0
Survived     0.000000         0
Pclass       0.000000         0
Name         0.000000         0
Sex          0.000000         0
Age          0.000000         0
SibSp        0.000000         0
Parch        0.000000         0
Ticket       0.000000         0
Fare         0.000000         0
Cabin        0.771044       687
Embarked     0.002245         2


Excellent! No more missing Age values, let's double check that there's no bug in our scripts which overwrote original Age data quickly.

In [22]:
checkv = fr_train['Age'] != fr_train_plus['Age']
ogv = fr_train['Age'].isnull()

(checkv == ogv).all()

True

Awesome! The "checkv" vector is a series of bools which are True when there's a difference in the original data and my new data, the "ogv" vector is a series of bools which are True when there's a missing value in the original data. This code confirms that both vectors are identical, meaning my script only replaced originally missing data and all the original data is intact.

Though look and remember that we've reintroduced our missing Embarked values, we'll have to fix that again!

In [24]:
fr_train_plus.set_value([61, 829], 'Embarked', 'S');
naSummary(fr_train_plus)

                 NA %  NA Count
PassengerId  0.000000         0
Survived     0.000000         0
Pclass       0.000000         0
Name         0.000000         0
Sex          0.000000         0
Age          0.000000         0
SibSp        0.000000         0
Parch        0.000000         0
Ticket       0.000000         0
Fare         0.000000         0
Cabin        0.771044       687
Embarked     0.000000         0
