# Analysing Titanic data
This notebook is using the Titanic passenger data available at Kaggle. 
We'll read the data and try to predict the survival of the passenges of Titanic from it. 


At first, we want to have a general look at the data. Just reading the data as pandas and using the variable allows to take a first look at the data!


In [10]:
import pandas as pd;
titanic = pd.read_csv('train.csv')
titanic_test = pd.read_csv('test.csv')

# Taking a look at the train data


In [33]:
#how to "take a look" at the variable in pandas???? Just use the variable name!!!
#titanic
titanic.head(5) # let's see first rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",0,35,0,0,373450,8.05,,0


In [38]:
titanic.tail(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
876,877,0,3,"Gustafsson, Mr. Alfred Ossian",0,20,0,0,7534,9.8458,,0
877,878,0,3,"Petroff, Mr. Nedelio",0,19,0,0,349212,7.8958,,0
878,879,0,3,"Laleff, Mr. Kristo",0,28,0,0,349217,7.8958,,0
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",1,56,0,1,11767,83.1583,C50,1
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",1,25,0,1,230433,26.0,,0
881,882,0,3,"Markun, Mr. Johann",0,33,0,0,349257,7.8958,,0
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",1,22,0,0,7552,10.5167,,0
883,884,0,2,"Banfield, Mr. Frederick James",0,28,0,0,C.A./SOTON 34068,10.5,,0
884,885,0,3,"Sutehall, Mr. Henry Jr",0,25,0,0,SOTON/OQ 392076,7.05,,0
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",1,39,0,5,382652,29.125,,2


With describe(), we will see all the numerical values (like this, it is clear which ones are not numerical, and we can identify the ones to convert).
We can also check if all the values are defined (-->count compared to the number of rows above).

In [12]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Only the numerical data is shown in the above table.  The table has more columns than that!

In [13]:
titanic_test.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In order to make more columns usable for the analysis, we'll transform e.g. 'Embarked' to the numerical format below. Also, we immediately see here that that we have 891 passenger recodrs, but fro age, only 714 values. We'll need to fix that as well.

# ... and test data

In [31]:
titanic_test.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,2
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,0
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,2
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,0


We notice that the test data does not contain the 'survived' column! Hence, it is not useable for us during the development of the model.

# Asking some questions about the data

In [14]:
#how many survived? - we only check the training data. In fact, we do not have any "survived" labels for the test data
print(titanic['Survived'].value_counts())

0    549
1    342
Name: Survived, dtype: int64


In [15]:
#how many were in first class?
len(titanic.loc[titanic['Pclass']== 1])

216

# Filling not available values
We have seen that several data values are missing. What can we do with the missing values?
We can:
* ignore the rows/columns with the missing data - but like this our data set can really shrink!
* fill them with some default values
* adapt the algorithms that we are using so that they actually are capable of handing missing values


In the first stage, let's fill the missing values with some defaults.

In [16]:
print(titanic["Embarked"].unique()) # let's check which values "Embarked" has
titanic["Embarked"].value_counts() # value counts is like a quick histogram!!

['S' 'C' 'Q' nan]


S    644
C    168
Q     77
Name: Embarked, dtype: int64

#unique() states that we have nan values in Embarked. We need to fill that, and we choose to do so with the majority class: S

In [17]:
titanic['Embarked'] = titanic['Embarked'].fillna('S') #let's fill the not available values with the most frequent value

In [18]:
titanic_test['Embarked'].unique() #in titanic test, all the embarked fields are filled

array(['Q', 'S', 'C'], dtype=object)

In [19]:
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
titanic_test['Age'] = titanic_test['Age'].fillna(titanic['Age'].median())

# Converting categorical values to numerical values

In [20]:
titanic.loc[titanic['Embarked']=='S','Embarked'] = 0;
titanic.loc[titanic['Embarked']=='C','Embarked'] = 1;
titanic.loc[titanic['Embarked']=='Q','Embarked'] = 2;

In [21]:
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0 #we access .loc(row, column)
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

In [22]:
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1

titanic_test['Embarked'].fillna('S')
titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

titanic_test['Age'] = titanic_test['Age'].fillna(titanic['Age'].median())
titanic_test['Fare'] = titanic_test['Fare'].fillna(titanic_test['Fare'].median())

In [36]:
titanic.describe() #check that all the values are defined --> the strange thing is that "Embarked" is still not there!!!

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.361582,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.019697,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [174]:
titanic.shape[0] #this s a way to ask for a size in the first dimension (rows). another way to do the same thing len(titanic)

891

In [175]:
titanic.shape[1] # this is a way to ask for a number of columns

12

In [176]:
len(titanic.columns)

12

# Different models - quick look

In [None]:
# we choose columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# our actual test set does not contain target values - since it is a competition test set!
# For that reason we will set aside some of the training data as "our" (not Kaggles!) test data
X_train, X_test, y_train, y_test = cv.train_test_split(titanic[predictors], titanic['Survived'], test_size=0.2, random_state=0)


# Linear regression


In [137]:
import sklearn.linear_model as lm; 
import sklearn.cross_validation as cv;


linreg_model = lm.LinearRegression()# Using linear regression
linreg_model = linreg_model.fit(X_train, y_train)
pred = linreg_model.predict(X_test)
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(y_test == pred)/y_test.shape[0]
print('The accuracy = ', accuracy)

The accuracy =  0.793296089385


# Logistic regression

In [136]:
import sklearn.linear_model as lm

logreg_model = lm.LogisticRegression(random_state=1)
logreg_model = logreg_model.fit(X_train, y_train)
pred = logreg_model.predict(X_test) # the prediction is the probability that the passenger survived
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(y_test == pred)/y_test.shape[0]
print('The accuracy = ', accuracy)


The accuracy =  0.798882681564


# Decision tree

In [177]:
from sklearn import tree
dTree = tree.DecisionTreeClassifier(max_depth=6, min_samples_leaf  =5)
dTree = dTree.fit(X_train, y_train)
pred = dTree.predict(X_test) # predictions are 0 and 1s right away

predTrain = dTree.predict(X_train)
accuracy = sum(y_train == predTrain)/y_train.shape[0]
print ('accuracy on train=', accuracy)

accuracy = sum(y_test == pred)/y_test.shape[0]
print('accuracy on test= ', accuracy)
#help (tree.DecisionTreeClassifier)


accuracy on train= 0.858146067416
accuracy on test=  0.837988826816


# Ada boost

In [174]:
 from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=30)
clf = clf.fit(X_train, y_train)
pred = clf.predict(X_test) 


accuracy = sum(y_test == pred)/y_test.shape[0]
print('The accuracy = ', accuracy)

The accuracy =  0.832402234637


# Random forest

In [84]:
from sklearn.ensemble import RandomForestClassifier
# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
rForest = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)
rForest.fit(X_train, y_train);
pred = rForest.predict(X_test)
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(y_test == pred)/y_test.shape[0]
print('The accuracy = ', accuracy)


The accuracy =  0.849162011173


# Using cross-validation

## Using cross_val_score wrapper
Cross_val_score makes it easy. That's the way to go!!!

In [88]:
import sklearn.cross_validation as cv

In [90]:
# Compute the accuracy score for all the cross validation folds.  
scores = cv.cross_val_score(linreg_model, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
#TODO analyse what went wrong here - when we manually do Kfold, we get 0.78 and not 0.37

0.374682056691


In [91]:
scores = cv.cross_val_score(logreg_model, titanic[predictors], titanic['Survived'], cv=3)
print(scores.mean())

0.787878787879


In [115]:
scores = cv.cross_val_score(rForest, titanic[predictors], titanic['Survived'], cv=3)
print(scores.mean())

0.820426487093


## Using KFold
We can use the generated kfold split & manually fit and predict


In [124]:
from sklearn.cross_validation import KFold
import pandas as pd

# Generate cross validation folds for the titanic dataset.  It returns the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
scores = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:]) #take only the "train" rows, and all the columns
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    linreg_model.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = linreg_model.predict(titanic[predictors].iloc[test,:])
    test_predictions[test_predictions >= 0.5] = 1
    test_predictions[test_predictions < 0.5] = 0
    test_target = titanic['Survived'].iloc[test]
    correct = sum(test_predictions == test_target)
    score = correct/float(test_target.shape[0])
    predictions.append(test_predictions)
    scores.append(score)
pd.DataFrame(scores).mean()



0    0.783389
dtype: float64

# Improving the score
Things to do in order to improve the score:
- use different algorithm
- tune the parameters of the algorithm (e.g., for random forests, the number of trees etc.)
- use different features

# Adding new features

In [287]:
titanic['familySize'] = titanic['Parch'] + titanic['SibSp']
titanic['nameLen'] = titanic['Name'].apply(lambda x: len(x))

In [288]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'familySize',
       'nameLen'],
      dtype='object')