# Tutorial for the 'Titanic' challenge of Kaggle

Following the instructions given at: https://www.dataquest.io/mission/74/getting-started-with-kaggle

### Import stuff:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%matplotlib notebook

### Import data set and get a feel for it:

In [2]:
titanic_df = pd.read_csv('train.csv')
print (titanic_df.head())
print (titanic_df.describe())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                             Heikkinen, Miss. Laina  female   26      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
4                           Allen, Mr. William Henry    male   35      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
       P

'Cound' gives the number of non-missing (null, NA, NaN) values. The 'Age' column apparently has some empty spaces. Since we do not want to delete the corresponding rows (more data makes better training), nor the whole column ('Age' is possibly important for the analysis), the data needs to be cleaned up. 

A common way for doing this is by writing the median of the column into the empty places. Use the '.fillna()' method to replace missing values with the median, the latter being computed with '.median()'.

In [3]:
titanic_df["Age"].fillna(titanic_df["Age"].median(),inplace=True)
print (titanic_df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.361582    0.523008   
std     257.353842    0.486592    0.836071   13.019697    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   22.000000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   35.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


'.describe()' does not evaluate all of the columns, since some of them are not numeric. To use the non-numeric columns for the algorithms used below, they should be transformed to some numeric value. We will do this for the 'Sex' column. 'Embarked', 'Cabin' and 'Name' are possibly hard to evaluate without further specialised knowledge, so we will igonre them for the moment.

Replace all 'male' entries with '0' and all 'female' entries with '1'. To find the indices of the 'male' entries, use: 'titanic_df.loc[titanic_df["Sex"] == "male", "Sex"]'

In [4]:
titanic_df.loc[titanic_df["Sex"] == "male", "Sex"] = 0
titanic_df.loc[titanic_df["Sex"] == "female", "Sex"] = 1

Next replace the values of the 'Embarked' column with numbers. The unique values of the 'Embarked' column are 'S', 'C' and 'Q', which will be replaced with 0, 1 and 2, respectively.

In [5]:
print (titanic_df['Embarked'].unique())
titanic_df['Embarked'].fillna('S',inplace=True)
titanic_df.loc[titanic_df['Embarked'] == 'S', 'Embarked'] = 0
titanic_df.loc[titanic_df['Embarked'] == 'C', 'Embarked'] = 1
titanic_df.loc[titanic_df['Embarked'] == 'Q', 'Embarked'] = 2

['S' 'C' 'Q' nan]


### Linear regression and cross validation

**Linear regression** essentially means - if I understand correctly - that predictions are made based on a linear function of the input parameters. For instance, the 'survived' value could be predicted as $y_i = \alpha \cdot x_i + \beta$. 

**Cross validation** means that the algorithm is trained on one part of the data set (i.e. a linear fit is performed), and another part is used to test how good the prediction is. This is done to avoid 'overfitting', which could arise from accidentally fitting to some particular properties of the training data set. 

Here we use the 'scikit-learn' library to partition the data set for cross fitting and make predicitons using linear regression.

In [6]:
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

In [7]:
# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic_df.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic_df[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic_df["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic_df[predictors].iloc[test,:])
    predictions.append(test_predictions)

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

### Evaluating error



We'll first need to define an error metric, so we can figure out how accurate our model is. From the Kaggle competition description, the error metric is percentage of correct predictions. The metric will basically involve finding the number of values in 'predictions' that are the exact same as their counterparts in 'titanic["Survived"]', and then dividing by the total number of passengers. Before we can do this, we need to combine the 3 sets of predictions into one column (done above).

In [8]:
# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

accuracy = np.sum(np.array(titanic_df['Survived']) == predictions)/len(predictions)

### Logistic regression

Squeeze prediciton values between '0' and '1'.

In [13]:
#from sklearn import cross_validation

# Initialize our algorithm
#alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
#scores = cross_validation.cross_val_score(alg, titanic_df[predictors], titanic_df["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
#print(scores.mean())

### Use algorithm on full data set

First clean up test data set:

In [11]:
titanic_test = pd.read_csv("test.csv")
titanic_test['Age'].fillna(titanic_df['Age'].median(),inplace=True)
titanic_test.loc[titanic_test['Sex'] == 'male', 'Sex'] = 0
titanic_test.loc[titanic_test['Sex'] == 'female', 'Sex'] = 1
titanic_test['Embarked'].fillna('S',inplace=True)
titanic_test.loc[titanic_test['Embarked'] == 'S', 'Embarked'] = 0
titanic_test.loc[titanic_test['Embarked'] == 'C', 'Embarked'] = 1
titanic_test.loc[titanic_test['Embarked'] == 'Q', 'Embarked'] = 2
titanic_test['Fare'].fillna(titanic_test['Fare'].median(),inplace=True)