**Method 2 of one-hot-encoding**

In [None]:
Pclass_lb = preprocessing.LabelBinarizer()
Pclass_one_hot_encoded = Pclass_lb.fit_transform(data.Pclass.values)

dfOneHot_Encoded = pd.DataFrame(Pclass_one_hot_encoded, 
columns = ["Pclass_"+str(int(i+1)) for i in range(Pclass_one_hot_encoded.shape[1])],
index=data.index
) # we now construct a dataframe out of this one-hot-encoded data

data = pd.concat([data, dfOneHot_Encoded], axis=1)
# we now add our one-hot-encoded Embarked features

In [None]:
data[0:5]

Now we have finished preprocessing our data we can extract our feature and target varaibles to be used to train, validate and test our learning algorithm.

In [None]:
y = data['Survived']
X = data.drop(['Survived'], 1)

In [None]:
X.drop(labels=['Sex', 'Pclass', 'Embarked', 'Embarked_numeric'], axis=1, inplace=True) # we drop the features that we have one-hot-encoded but not removed yet (we left these in to check the encoding had worked correctly)

In [None]:
X[0:5]

## [Train, Validation, Test] splitting

Now lets split our pre-processed data into a training set, cross validation set and test set with a 60/20/20 split.

In [None]:
X_temp, X_test, y_temp, y_test = model_selection.train_test_split(X, y, test_size=0.2)

X_train, X_cv, y_train, y_cv = model_selection.train_test_split(X_temp, y_temp, train_size=0.75)

In [None]:
print("Data split is as follows:")
print("-------------------------")
print("train: {} \ncross-validation: {} \ntest: {}".format(X_train.shape[0], X_cv.shape[0], X_test.shape[0]))

## First lets just predict everyone dies, no learning algorithm, to see how that goes

In [None]:
def predict(X):
    return np.zeros(X_test.shape[0])

In [None]:
prediction = predict(X_test)

In [None]:
100*sum(prediction == y_test)/y_test.shape[0]

This achieves an accuracy of ~57%, which doesn't seem that bad. But if we now look at the truth table of our target variable vs our prediction we can see the truth, that we are just predicting false all the time.

In [None]:
TruePositives = sum((prediction == y_test) & (y_test == 1))
TrueNegatives = sum((prediction == y_test) & (y_test == 0))
FalsePositives = sum((prediction != y_test) & (prediction == 1))
FalseNegatives = sum((prediction != y_test) & (prediction == 0))

In [None]:
table = tt.Texttable()
table.add_rows([
                ["", "", "Real Value", ""],
                ["", "", "Positive", "Negative"],
                ["Prediction", "Positive", TruePositives, FalsePositives],
                ["", "Negative", FalseNegatives, TrueNegatives],
                ])
print(table.draw() + "\n")

In [None]:
PredictedPositives = TruePositives + FalsePositives
Precision = TruePositives/PredictedPositives

ActualPositives = TruePositives + FalseNegatives
Recall = TruePositives/ActualPositives

print("Precision: {}".format(Precision))

print("Recall: {}".format(Recall))

F1score = 2*Precision*Recall/(Precision+Recall)

print("F1score: {}".format(Recall))

It is much more obvious from this that our predicting everyone dies is a bad method to predict survival.

## First lets attempt a logistic regression
#### i.e. Not using any additional higher order features

Now we predict on the Kaggle data and submit to get a score.

In [None]:
prediction_kaggle_data = svm_model.predict(data_kaggle_test)

In [None]:
with open('prediction_submission_SVM_Linear_Kernel.csv', 'w') as file:
    print('PassengerId,Survived', file=file)
    for i, id_ in enumerate(data_kaggle_test.index):
        print('{},{}'.format(id_, prediction_kaggle_data[i]), file=file)


Got ~76% accuracy on Kaggle.

### Learning curve for Linear Kernal SVM with best C (0.03)

In [None]:
m_array = np.round(np.linspace(20, X_train.shape[0], 20)).astype(int)
train_acc_array = np.zeros(len(m_array))
cv_acc_array = np.zeros(len(m_array))

for i, m in enumerate(m_array):
    print(i, end=', ')
    svm_model_iter = svm.SVC(C=bestC, kernel=kernel)
    # we now fit to the training data
    svm_model_iter.fit(X_train.head(m), y_train.head(m)) # training on the first m training data examples
    train_accuracy = svm_model_iter.score(X_train, y_train)
    train_acc_array[i] = train_accuracy
    cv_accuracy = svm_model_iter.score(X_cv, y_cv)
    cv_acc_array[i] = cv_accuracy


In [None]:
fig, ax = plt.subplots()
ax.plot(m_array, 1-train_acc_array, label='train')
ax.plot(m_array, 1-cv_acc_array, label='cross-validation')
ax.legend()
ax.set_xlabel('m (size of training data)')
ax.set_ylabel('error (1-accuracy)')

### We will now try an SVM with a Gaussian Kernal


### We will use the Gaussian radial-basis function (RBF kernel)



In [None]:
C = 1 # start with penalty parameter equal to 1 as we don't know what value this should take yet, larger C -> lower bias, higher variance, smaller C -> higher bias, lower varaince. Since we don't have that much data a lower bias algorithm is probably best to avoid overfitting

kernel = 'rbf' # we'll use the radial basis function (Gaussian) kernal to see how ths performs

gamma = 'auto' # a second hyperparameter to tune for the Guassian kernal - the width of the Gaussian function used

svm_model = svm.SVC(C=C, kernel=kernel, gamma=gamma)

In [None]:
svm_model.fit(X_train, y_train)

In [None]:
svm_model.score(X_cv, y_cv)

Not very good performance. Since we now have 2 hyperparameters to tune we will make use of scikit learns grid_search function which allows you to provide a range of values of parameters and search over the grid for the best hyperparameters.

In [None]:
TruePositives = sum((prediction_cv == y_cv) & (y_cv == 1))
TrueNegatives = sum((prediction_cv == y_cv) & (y_cv == 0))
FalsePositives = sum((prediction_cv != y_cv) & (prediction_cv == 1))
FalseNegatives = sum((prediction_cv != y_cv) & (prediction_cv == 0))

Looking at our truth table this looks much better than we got just predicting everyone dies.

In [None]:
PredictedPositives = TruePositives + FalsePositives
Precision = TruePositives/PredictedPositives

ActualPositives = TruePositives + FalseNegatives
Recall = TruePositives/ActualPositives

print("Precision: {:.3f}".format(Precision))

print("Recall: {:.3f}".format(Recall))

F1score = 2*Precision*Recall/(Precision+Recall)

print("F1score: {:.3f}".format(F1score))

And we now get a decent precision and recall, although recall is worse, giving us a decent F1score.

### Learning Curve for Logistic Regression

Lets look at the learning curve for this algorithm:

In [None]:
m_array = np.round(np.linspace(20, X_train.shape[0], 100)).astype(int)
train_acc_array = []
cv_acc_array = []

for m in m_array:
    lr_model_iter = linear_model.LogisticRegression(fit_intercept=True)
    # we now fit to the training data
    lr_model_iter.fit(X_train.head(m), y_train.head(m)) # training on the first m training data examples
    train_accuracy = lr_model_iter.score(X_train, y_train)
    train_acc_array.append(train_accuracy)
    cv_accuracy = lr_model_iter.score(X_cv, y_cv)
    cv_acc_array.append(cv_accuracy)

train_acc_array = np.array(train_acc_array)
cv_acc_array = np.array(cv_acc_array)

In [None]:
accuracy = lr_model.score(X_test, y_test)

print("accuracy on testing data: {:0.2f}".format(accuracy*100))

Now we have a model accuracy of ~80% on our testing data, which the classifier has never seen before we will apply it to the actual testing data to submit to Kaggle.

## Loading Test Data to Predict On for Kaggle Submission

## Loading data

In [None]:
data = pd.read_csv('data/train.csv', index_col='PassengerId')

## Preprocessing

### removing unhelpful columns

First we need to preprocess the data remove unique data like the name and ticket number that may not be useful in predicting the target variable: i.e. did the passenger survive.

Here we choose to remove the columns : 
- name
- ticket number
- Cabin

In [None]:
del data.index.name # lets also remove this row with just the name on it to make things easier later

In [None]:
data = data.drop(labels=['Name', 'Ticket', 'Cabin'], axis=1) # dropping name, ticket and Cabin columns

### Dealing with NaN values

Now we need to treat NaN values somehow as they can have break our learning algorithm

In [None]:
num_dropped_removing_Nans = len(data) - len(data.dropna()) # calculate number of entries we would drop if we dropped all entries containing NaN
percent_dropped_removing_Nans = 100*(len(data) - len(data.dropna()))/len(data)

print(num_dropped_removing_Nans) # dropping all rows with NaNs in them drops 708 examples
print("Percent dropped by removing NaNs: {:.2f}%".format(percent_dropped_removing_Nans))

Removing entries with NaNs is not a good option as it leads to loss of almost 20% of the data, we should find another way to deal with these NaN entries.

Lets take a look at some of our NaN entries

In [None]:
data_temp = data[data.isnull().any(axis=1)] # get any row that has a NaN in it
data_temp[0:5]

Our options are to replace NaNs with:
- A constant value that has meaning within the domain, such as 0, distinct from all other values.
- A value from another randomly selected record.
- A mean, median or mode value for the column.
- A value estimated by another predictive model.

I have tested the first 3 methods and found using the mean value to produce the best performance.

In [None]:
data = data[data.Embarked.notnull()] # remove data where embarked is null as we can't calculate a numerical value for this with the below methods

#### Here we are using scikit learns imputer to guess missing values from other data, in practise the method we are using is to calculate the mean of that column and replace NaN values with the column mean

In [None]:
imputer = preprocessing.Imputer(strategy="mean", axis=0)
data_nans_replaced = data.copy()
data_nans_replaced['Age'] = imputer.fit_transform(data_nans_replaced['Age'].reshape(-1, 1))

And lets look what the NaNs were replaced with 

In [None]:
data_temp = data_nans_replaced[data.isnull().any(axis=1)] # get rows of the replaced data where the rows used to contain NaNs
data_temp[0:5]

Repacing these in this way may cause some issues, but we will go forward with this method for now and see how it affects the performance of our learning algorithm later by comparing performance of the learning algorithm with some of the other methods to replace NaN values.

In [None]:
data = data_nans_replaced # replace data with data where we have replaced NaNs with mean values

Lets check we really got rid of all the NaNs

In [None]:
index_best = np.argmax(validation_score_array)
bestC = C_array[index_best]
print('C that performed best on validation data was C={}'.format(bestC))
print("train score: {:0.3f}, validation score: {:0.3f}".format(train_score_array[index_best], validation_score_array[index_best]))

In [None]:
svm_model = svm.SVC(C=bestC, kernel=kernel)
svm_model.fit(X_train, y_train)

In [None]:
svm_model.score(X_test, y_test)

We get ~80 percent accuracy on the test data we held back from our training and validation set.

In [None]:
data_kaggle_test = pd.read_csv('data/test.csv', index_col='PassengerId')
del data_kaggle_test.index.name # lets also remove this row with just the name on it to make things easier later

In [None]:
data_kaggle_test[0:5]

### Preprocessing the data as we did for the training data

In [None]:
data_kaggle_test = data_kaggle_test.drop(labels=['Name', 'Ticket', 'Cabin'], axis=1) # dropping name, ticket and Cabin columns

data_kaggle_test['Age'] = imputer.fit_transform(data_kaggle_test['Age'].reshape(-1, 1))

# ----- one-hot encoding sex ----- 

data_kaggle_test['Sex_numeric'] = le_sex.transform(data_kaggle_test.Sex) # transform the data from labels to numeric


# ----- one-hot encoding Embarked ----- 

data_kaggle_test['Embarked_numeric'] = le_Embarked.transform(data_kaggle_test.Embarked) # transform the data from labels to numeric

encoded_column_vector = data_kaggle_test.Embarked_numeric.values.reshape(-1,1) # gets numeric embarked data and rehsapes it to column vector

Embarked_one_hot_encoded = enc_Embarked.transform(encoded_column_vector).toarray() # transforms the data to one-hot-encoded data

dfOneHot_Encoded = pd.DataFrame(Embarked_one_hot_encoded, 
columns = ["Embarked_"+le_Embarked.inverse_transform(int(i)) for i in range(Embarked_one_hot_encoded.shape[1])],
index=data_kaggle_test.index
) # we now construct a dataframe out of this one-hot-encoded data

data_kaggle_test = pd.concat([data_kaggle_test, dfOneHot_Encoded], axis=1)
# we now add our one-hot-encoded Embarked features

# ----- one-hot encoding Pclass ----- 

Pclass_lb = preprocessing.LabelBinarizer()
Pclass_one_hot_encoded = Pclass_lb.fit_transform(data_kaggle_test.Pclass.values)

dfOneHot_Encoded = pd.DataFrame(Pclass_one_hot_encoded, 
columns = ["Pclass_"+str(int(i+1)) for i in range(Pclass_one_hot_encoded.shape[1])],
index=data_kaggle_test.index
) # we now construct a dataframe out of this one-hot-encoded data

data_kaggle_test = pd.concat([data_kaggle_test, dfOneHot_Encoded], axis=1)
# we now add our one-hot-encoded Embarked features

data_kaggle_test.drop(labels=['Sex', 'Pclass', 'Embarked', 'Embarked_numeric'], axis=1, inplace=True) # we drop the features that we have one-hot-encoded but not removed yet (we left these in to check the encoding had worked correctly)

In [None]:
inds = pd.isnull(data_kaggle_test).any(1).nonzero()[0]

In [None]:
data_kaggle_test[inds[0]:inds[0]+1]

We didn't have a training example where the Fare was NaN so we now fit an imputer to the training data in order to impute a mean training value to use here.

Pclass is already numeric and takes values 1 -> 3

In [None]:
data.Pclass.unique()

Now that Pclass and Embarked take numeric integer values we can now apply the one-hot-encoding to generate binary features.

**Method 1 of one-hot-encoding**

In [None]:
enc_Embarked = preprocessing.OneHotEncoder()
encoded_column_vector = data.Embarked_numeric.values.reshape(-1,1) # gets numeric embarked data and rehsapes it to column vector

Embarked_one_hot_encoded = enc_Embarked.fit_transform(encoded_column_vector).toarray() # we now apply a fit and transform step to the data simultaneous which fits the one-hot-encoder and transforms the data to one-hot-encoded data

dfOneHot_Encoded = pd.DataFrame(Embarked_one_hot_encoded, 
columns = ["Embarked_"+le_Embarked.inverse_transform(int(i)) for i in range(Embarked_one_hot_encoded.shape[1])],
index=data.index
) # we now construct a dataframe out of this one-hot-encoded data

In [None]:
data = pd.concat([data, dfOneHot_Encoded], axis=1)
# we now add our one-hot-encoded Embarked features

In [None]:
imputer_fare = preprocessing.Imputer(strategy="mean", axis=0)
imputer_fare.fit(data['Age'].reshape(-1, 1))
data_kaggle_test['Fare'] = imputer_fare.transform(data_kaggle_test['Fare'].reshape(-1, 1))

In [None]:
data_kaggle_test

### We now predict on our test data and write the predictions to a CSV file to submit to Kaggle

In [None]:
num_dropped_removing_Nans = len(data) - len(data.dropna())
percent_dropped_removing_Nans = 100*(len(data) - len(data.dropna()))/len(data)

print(num_dropped_removing_Nans) # dropping all rows with NaNs in them drops 708 examples
print("Percent dropped by removing NaNs: {:.2f}%".format(percent_dropped_removing_Nans))

We did!

### Encoding categorical features

We now need to encode the categorical features into many binary features. This is nessesary because of the way the algorithm interprets numbers. If we have a categorical feature that takes values 0, 1, 2, 3, 4 it assumes the higher numbers are 'better' (e.g. 4>3) even though they are arbitrary encodings, because ultimately it is calculating values/weights/parameters to be multiplied by these feature variables to give a term which enters into the linear regression. One common way to deal with this is one-hot-encoding, where a feature N takes values 0, 1, 2 for example we would generate 3 features which takes binary values 0 or 1. An example is shown below

We have the original feature data:

| Entry        | N          |
| ------------ |:----------:|
| 0            | 1          | 
| 1            | 2          |
| 2            | 0          |
| 3            | 1          |
| 4            | 2          |
| 5            | 0          |

Which when encoded becomes:

| Entry        | N==0       | N==1       | N==2       |
| ------------ |:----------:|:----------:|:----------:|
| 0            | 0          | 1          | 0          | 
| 1            | 0          | 0          | 1          | 
| 2            | 1          | 0          | 0          | 
| 3            | 0          | 1          | 0          | 
| 4            | 0          | 0          | 1          | 
| 5            | 1          | 0          | 0          | 


Here we need some insight into the data. The categorical features are:
- pclass : the Ticket class
- sex : the gender/sex of the passenger 
- embarked : the port where the passenger embarked from
    - C = Cherbourg
    - Q = Queenstown
    - S = Southampton

We encode sex with 0 or 1 and save the mapping we have used to a dictionary so we now how to transform new data we get in future.

In [None]:
data.Sex[0:5]

In [None]:
le_sex = preprocessing.LabelEncoder()
le_sex.fit(data.Sex.unique()) # fits a value to each unique integer value of the feature variable sex
data['Sex_numeric'] = le_sex.transform(data.Sex) # transform the data from labels to numeric

In [None]:
data.Sex_numeric[0:5] # values are now encoded numerically

In [None]:
le_sex.inverse_transform(data.Sex_numeric[0:5]) # and the label encoder lets us reverse this if need be

In [None]:
le_Embarked = preprocessing.LabelEncoder()
le_Embarked.fit(data.Embarked.unique()) # fits a value to each unique integer value of the feature variable sex
data['Embarked_numeric'] = le_Embarked.transform(data.Embarked) # transform the data from labels to numeric

In [None]:
prediction_kaggle_data = lr_model.predict(data_kaggle_test)

In [None]:
data_kaggle_test.index

In [None]:
with open('prediction_submission_LR.csv', 'w') as file:
    print('PassengerId,Survived', file=file)
    for i, id_ in enumerate(data_kaggle_test.index):
        print('{},{}'.format(id_, prediction_kaggle_data[i]), file=file)


The result is quite variable, but I get scores of 75.1%-77.5% from 4 submissions with different random train/validation/test splits.

## Lets now attempt to train a Support Vector Machine to predict surivival

### Feature scaling

We will now perform feature scaling on the continous variables, namely Age, SibSp, Parch, and Fare, such that they are all around the same scale. This wasn't so important for logistic regression but IS important for SVMs.

In [None]:
scaler_age = preprocessing.StandardScaler().fit(data['Age'].reshape(-1, 1))
data['Age'] = scaler_age.transform(data['Age'].reshape(-1, 1))

scaler_SibSp = preprocessing.StandardScaler().fit(data['SibSp'].reshape(-1, 1))
data['SibSp'] = scaler_SibSp.transform(data['SibSp'].reshape(-1, 1))

scaler_Parch = preprocessing.StandardScaler().fit(data['Parch'].reshape(-1, 1))
data['Parch'] = scaler_Parch.transform(data['Parch'].reshape(-1, 1))

scaler_Fare = preprocessing.StandardScaler().fit(data['Fare'].reshape(-1, 1))
data['Fare'] = scaler_Fare.transform(data['Fare'].reshape(-1, 1))

X_temp, X_test, y_temp, y_test = model_selection.train_test_split(X, y, test_size=0.2)

X_train, X_cv, y_train, y_cv = model_selection.train_test_split(X_temp, y_temp, train_size=0.75)

We now perform our feature scaling with the fitted scalers

In [None]:
data_kaggle_test['Age'] = scaler_age.transform(data_kaggle_test['Age'].reshape(-1, 1))

data_kaggle_test['SibSp'] = scaler_SibSp.transform(data_kaggle_test['SibSp'].reshape(-1, 1))

data_kaggle_test['Parch'] = scaler_Parch.transform(data_kaggle_test['Parch'].reshape(-1, 1))

data_kaggle_test['Fare'] = scaler_Fare.transform(data_kaggle_test['Fare'].reshape(-1, 1))

In [None]:
C = 1 # start with penalty parameter equal to 1 as we don't know what value this should take yet, larger C -> lower bias, higher variance, smaller C -> higher bias, lower varaince. Since we don't have that much data a lower bias algorithm is probably best to avoid overfitting

kernel = 'linear' # we'll start with a simple linear kernal to see how ths performs

svm_model = svm.SVC(C=C, kernel=kernel)

In [None]:
C_array = np.array([0.003, 0.01, 0.02, 0.03, 0.05, 0.1, 0.3, 1, 3])
train_score_array = np.zeros_like(C_array)
validation_score_array = np.zeros_like(C_array)
for i, C in enumerate(C_array):
    print(i, C)
    svm_model = svm.SVC(C=C, kernel=kernel)
    svm_model.fit(X_train, y_train)
    train_score_array[i] = svm_model.score(X_train, y_train)
    validation_score_array[i] = svm_model.score(X_cv, y_cv)

In [None]:
fig, ax = plt.subplots()
ax.plot(C_array, train_score_array, label='train accuracy')
ax.plot(C_array, validation_score_array, label='validation accuracy')
ax.semilogx()
ax.legend()

In [None]:
# We now use scikit learn to fit a regularised logistic regression 
# model with each of the feature variables being linear
# setting fit_intercept=True fits the Theta_0 term - an intercept term

lr_model = linear_model.LogisticRegression(fit_intercept=True)

# we now fit to the training data
lr_model.fit(X_train, y_train)

In [None]:
lr_model.score(X_train, y_train)

We get ~80% accuracy on our training data

We now evaluate our performance on the cross validation data it hasn't seen. We using our cross-validation data instead of our testing data to evaluate the classifier as we are using the cross-validation performance to pick our final algorithm and we don't want to pick an algorithm that happens to work well on the testing data but doesn't generalise well to new data.

In [None]:
lr_model.score(X_cv, y_cv) # this very simple logisitc regression with 

We get ~80% accuracy, which is very good, lets look at our truth table, precision, recall and f1score values as well to get a better idea what's going on.

In [None]:
prediction_cv = lr_model.predict(X_cv)