# 'Titanic' challenge from Kaggle using logistic regression

For details, see my other notebook (using linear regression).

### Steps
* clean up data: transform relevant data to numerical, replace missinv values by median of training set (numerical) or most frequent value (binary, string)
* maybe do some plotting of the data to get a feel
* set up algorithm for logistic regression using sklearn
* check how good algorithm works, compute accuracy, precision and recall this time
* make predictions for test data set and create a submission file

### Questions:
* Why does it almost not matter which parameters I use?
* How to interpret fitting coefficients?


### Import libraries/modules:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%matplotlib notebook

### Import data sets and make a copy for further treatment:

In [2]:
titanic_train_original = pd.read_csv('train.csv')
titanic_test_original = pd.read_csv('test.csv')
titanic_train = titanic_train_original.copy()
titanic_test = titanic_test_original.copy()

### Clean up data:

In [3]:
titanic_train['Sex'].replace(['male','female'],[0,1],inplace=True)
titanic_test['Sex'].replace(['male','female'],[0,1],inplace=True)
titanic_train['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
titanic_test['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)

# replace missing values in numerical columns with median of column
titanic_train['Age'].fillna(titanic_train['Age'].median(),inplace=True)
titanic_test['Age'].fillna(titanic_train['Age'].median(),inplace=True)
titanic_test['Fare'].fillna(titanic_train['Fare'].median(),inplace=True)

# replace missing values of train['Embarked']
titanic_train['Embarked'].fillna(0,inplace=True)

### Logistic regression using sklearn:

Documentation for logistic regression: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Documentation for cross-validation: http://scikit-learn.org/stable/modules/cross_validation.html

Documentation for 'cross_val_score': http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html#sklearn.cross_validation.cross_val_score

Like in my other notebook on the 'Titanic' challenge, I will use 'sklearn' to do the regression and cross validation. Instead of manually indexing the data using kFolds, I will use the 'cross_validation' class of 'sklearn' this time. Doing so one does not need to do very much any more, just:
* initiate logistic regression algorithm
* pass algorithm ('estimator'), input and target data to 'cross_validation', also state how many kFolds should be tested
* automatically obtain score, which - I think - by definition gives the accuracy. There are also options to obtain 'precision' and 'recall' (scoring ='accuracy' or ='precision', or ='recall'), however I do not know how to do all simultaneously. Another question is how exactly they are defined here, since they will not measure if the fit does EXACTLY match the data. Maybe via some confidence intervall..

In [42]:
# Import the linear regression class
from sklearn.linear_model import LogisticRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn import cross_validation

# define which columns of 'titanic_train' should be used for fit
predictors = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']

# initiate logistic regression algorithm
LR = LogisticRegression()

# Compute accuracy for 3 cross validation folds
accuracy = cross_validation.cross_val_score(LR,titanic_train[predictors],titanic_train['Survived'],cv=3,scoring ='accuracy')
precision = cross_validation.cross_val_score(LR,titanic_train[predictors],titanic_train['Survived'],cv=3,scoring ='precision')
recall = cross_validation.cross_val_score(LR,titanic_train[predictors],titanic_train['Survived'],cv=3,scoring ='recall')
accuracy = accuracy.mean()
precision = precision.mean()
recall = recall.mean()

print (accuracy,precision,recall)

0.787878787879 0.750337544903 0.675438596491


'Accuracy' measures the fraction of the correct predictions, 'precision' how accurate positive our positive predictions were and 'recall' the fraction of positives the model identified. It is not quite clear to me how all of these are defined for continuous outcomes, but well... The numbers do not look to bad I would say.

### Apply model to test data set

Now use the whole data set to obtain a logistic fit. Then apply it to predict the survival of the 'test' data set.

In [46]:
LR.fit(titanic_train[predictors],titanic_train['Survived'])
predictions = LR.predict(titanic_test[predictors])

# Normally I would have to map the predicitons on 0 and 1. However, it appears there are no intermediate values anyhow
predictions[predictions < 0.5] = 0
predictions[predictions >= 0.5] = 1

### Generate submission file

Should be '.csv' and only contain passanger ID and 'Survived'.

In [47]:
# concatenate ID and prediction
predictions = pd.DataFrame(predictions)
submission = pd.concat([titanic_test['PassengerId'],predictions],axis=1)
# label columns
submission.columns = ['PassengerId','Survived']

# export
submission.to_csv('Titanic_submission.csv')