In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook

## Problem 3) My Heart Will Go On

For the next problem we will work with the famous [Titanic survival](https://www.kaggle.com/c/titanic/data?) data set. If you haven't already, please [download the csv file](https://northwestern.box.com/s/hx6airvgb4mukxvztvhlzvedkq8480ug) with the relevant information.

Briefly, [the Titantic](https://thefilmcricket.files.wordpress.com/2012/04/film-titanic_clar.jpg) is a [famous](https://wallpapercave.com/wp/jrF8rQK.jpg) historical ship that was thought to be unsinkable. **Spoiler alert** it hit an iceberg and sank. The data in the Titanic data set includes information about 891 passengers from the Titanic, as well as whether or not they survived. The aim of this data set is to build a machine learning model to predict which passengers survived and which did not.

The features include: 

|Feature    | Description |
|:---------:|:--------------------------------------:|
|PassengerId| Running index that describes the individual passengers|
|Pclass| A proxy for socio-economic status (1 = Upper class, 2 = Middle Class, 3 = Lower Class)|
|Name| The passenger's name|
|Sex | The passenger's sex|
|Age | The passenger's age - note age's ending in 0.5 are estimated |
|SibSp| The sum of the passenger's sibblings and spouces on board|
|Parch| The sum of the passenger's parents and children on board|
|Ticket| The ticket number for the passenger|
|Fare| The price paid for the ticket by th passenger|
|Cabin| The Cabin in which the passenger stayed|
|Embarked| The point of Origin for the Passenger: C = Cherbourg, S = Southampton, Q = Queenstown|

And of course, we are trying to predict:

|Label    | Description |
|:---------:|:--------------------------------------:|
|Survived| 1 = yes; 0 = no|


**Problem 3a**

Read in the Titanic training data and create the `scikit-learn` standard `X` and `y` arrays to hold the features and the labels, respectively.

In [124]:
titanic_df = pd.read_csv('titanic_kaggle_training_set.csv', comment='#')

feat_list = list(titanic_df.columns)
label = 'Survived'
feat_list.remove(label)
X = titanic_df[feat_list].values
y = titanic_df[label]

Now that we have the data in the appropriate `X` and `y` arrays, estimate the accuracy with which a [K nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) classification model can predict whether or not a passenger would survive the Titanic disaster. Use $k=10$ fold cross validation for the prediction.

**Problem 3b**

Train a $k=7$ nearest neighbors machine learning model on the Titanic training set.

In [95]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=7)
knn_clf.fit(X, y)

ValueError: could not convert string to float: 'Q'

Note - that should have failed! And for good reason - recall that `kNN` models measure the Euclidean distance between all points within the feature space. So, when considering the sex of a passenger, what is the *numerical* distance between male and female? 

In other words, given that this is a lecture on data wrangling, you should expect that we need to wrangle this data before we can run the machine learning model. 

Most of the features in this problem are non-numeric (i.e. we are dealing with categorical features), and therefore we need to figure out how to include them in the `kNN` model. 

The first step when wrangling for machine learning is to figure out if anything can be thrown away. We certainly want to avoid including any uninformative features in the model. 

*If you haven't already, now would be a good time to create a new cell and examine the contents of the csv*

In [96]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


**Problem 3c** 

Are there any features that *obviously* will not help with this classification problem?

*If you answer yes - ignore those features moving forward*

**Solution 3c**

*write your solution here*

The `Name` of the passenger will not be useful for this classification task. In particular, the useful information that might be learned from the name, such as the sex (e.g., Mr. vs. Mrs.), or the age (e.g., Mr. vs. Master), are already summarized elsewhere in the data. 

It is also highly likely that the `PassengerId`, which is just a running index for each person in the dataset, is unlikely to be useful when classifying this data.

Given that we have both categorical and numeric features, let's start with the numerical features and see how well they can predict survival on the Titanic.

One note - for now we are going to exclude `Age`, because as you saw when you examined the data, there are some passengers that do not have any age information. This problem, known as "missing data" is one that we will deal with before the end of this problem.

**Problem 3d**

How accurately can the numeric features, `Pclass`, `SibSp`, `Parch`, and `Fare` predict survival on the Titanic? Use a $k = 7$ Nearest Neighbors model, and estimate the model accuracy using 10-fold cross validation. 

*Hint 1 - you'll want to redefine your features vector `X`*

*Hint 2 - you may find [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) from `scikit-learn` helpful*

In [97]:
from sklearn.model_selection import cross_val_score

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare']]

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, X, y, cv=10)

print('The accuracy from numeric features = {:.2f}%'.format(100*np.mean(cv_results)))

The accuracy from numeric features = 68.51%


An accuracy of 68% isn't particularly inspiring. But, there's a lot of important information that we are excluding. As far as the Titanic is concerned, Kate and Leo taught us that [female passengers are far more likely to survive](https://qph.fs.quoracdn.net/main-qimg-93eb36091c7eec872b891fa51dc5722b), while [male passengers are not](http://hoycinema.abc.es/Media/201602/03/titanic-kate-dicaprio--644x362.jpg). So, if we can include gender in the model then we may be able to achieve more accurate predictions. 

**Problem 3e**

Create a new feature called `gender` that equals 1 for male passengers and 2 for female passengers. Add this feature to your dataframe, and include it in a `kNN` model with the other numeric features. Does the inclusion of this feature improve the 10-fold CV accuracy?

In [98]:
gender = np.ones(len(titanic_df['Sex']))
gender[np.where(titanic_df['Sex'] == 'female')] = 2

titanic_df['gender'] = gender

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'gender']]

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, X, y, cv=10)

print('The accuracy when including gender = {:.2f}%'.format(100*np.mean(cv_results)))

The accuracy when including gender = 77.28%


A 14% improvement is pretty good! But, we can wrangle even more out of the gender feature. Recall that `kNN` models measure the Euclidean distance between sources, meaning the scale of each feature really matters. Given that the fare ranges from 0 up to 512.3292, the `kNN` model will see this feature as far more important than `gender`, for no other reason than the units that have been adopted. 

If women are far more likely to survive than men, then we want to be sure that gender is weighted at least the same as all the other features, which we can do with a minmax scaler. As a brief reminder - a minmax scaler scales all values of a feature to be between 0 and 1 by subtracting the minimum value of each feature and then dividing by the maximum minus the minimum. 

**Problem 3f**

Scale all the feature from the previous problem using a minmax scaler and evaluate the CV accuracy of the `kNN` model.

*Hint - you may find [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) helpful*

In [99]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy when scaling features = {:.2f}%'.format(100*np.mean(cv_results)))

The accuracy when scaling features = 80.43%


Scaling the features leads to a significant improvement in the performance of the model.

Now turn your attention to another categorical feature, `Embarked`, which includes the point of origin for each passenger has three categories, `S`, `Q`, and `C`. We need to convert these values to a numeric representation for inclusion in the model.

**Problem 3g**

Convert the categorical feature `Embarked` to a numeric representation, and add it to the `titanic_df`.

In [100]:
# following previous example, set C = 0, S = 1, Q = 2

porigin = np.empty(len(titanic_df['Sex'])).astype(int)
porigin[np.where(titanic_df['Embarked'] == 'C')] = 0
porigin[np.where(titanic_df['Embarked'] == 'S')] = 1
porigin[np.where(titanic_df['Embarked'] == 'Q')] = 2

titanic_df['porigin'] = porigin

But wait! Does this actually make sense?

Our "numerification" has now introduced order where there previously was none. We are effectively telling the model that Cherbourg and Queenstown are far apart (not in distance but in terms of the similarity of the passengers that boarded the ship in each location), while each are equally close to Southampton. Is there actually any evidence to support this conclusion? 

By definition categorical features do not have order (e.g., cat, dog, horse, elephant), and therefore we should not impose any when converting these features to numeric values for inclusion in our model. Instead, we should be creating a new set of binary features for every category within the feature set. Thus, `Embarked` will now need to be represented by 3 different features, where the feature `Queenstown` equals one for passengers that boarded there and zero for everyone else. 

**Problem 3h**

Complete the function below that will automatically create binary arrays for a categorical feature.

In [101]:
def create_bin_cat_feats(feature_array):
    categories = np.unique(feature_array)
    feat_dict = {}
    for cat in categories:
        exec('{} = np.zeros(len(feature_array)).astype(int)'.format(cat))
        exec('{0}[np.where(feature_array == "{0}")] = 1'.format(cat))
        exec('feat_dict["{0}"] = {0}'.format(cat))
    
    return feat_dict

**Problem 3i**

Use the `create_bin_cat_feats` function to convert the `Embarked` and `Sex`, yes we need to do this for `Sex` as well where we otherwise previously introduced order, categorical features to a numeric representation. Add these features to the `titanic_df` data frame.

In [125]:
gender_dict = create_bin_cat_feats(titanic_df['Sex'])
porigin_dict = create_bin_cat_feats(titanic_df['Embarked'])

for feat in gender_dict.keys():
    titanic_df[feat] = gender_dict[feat]
    
for feat in porigin_dict.keys():
    titanic_df[feat] = porigin_dict[feat]

**Problem 3j**

Use the newly created `female`, `male`, `S`, `Q`, and `C` features in combination with the `Pclass`, `SibSp`, `Parch`, and `Fare` features to estimate the classification accuracy of a $k = 7$ nearest neighbors model with 10-fold cross validation.

How does the addition of the point of origin feature affect the final model output?

*Hint - don't forget to scale the features in the model*

In [126]:
from sklearn.preprocessing import MinMaxScaler

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']]

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy with categorical features = {:.2f}%'.format(100*np.mean(cv_results)))

The accuracy with categorical features = 80.09%


The last thing we'd like to add to the model is the `Age` feature. Unfortunately, for 177 passengers we do not have a reported value for their age. This is a standard issue when building models known as "missing data" and this happens in astronomy all the time (for example, LSST is going to observe millions of L and T dwarfs that are easily detected in the $y$-band, but which do not have a detection in $u$-band).

There are several different strategies for dealing with missing data. The first and most straightforward is to simply remove observations with missing data (note - to simplify this example I already did this by removing the 2 passengers from the training set that did not have and entry for `Embarked`). 

This strategy is perfectly fine if only a few sources have missing information (2/891 for `Embarked` - and none of the test set sources are missing `Embarked`). If, however, a significant fraction are missing data, this strategy would remove a lot of useful data from the model.

If you cannot remove the sources with missing data, then it is essential to ask the following question: 

Does the missing information have meaning? 

In the LSST L/T dwarf example, the lack of a $u$-band detection is meaningful: these stars are too faint to be detected. When this is the case, it makes sense to provide an indicator value (e.g., -999) to the model to recognize the non-detection. 

For the Titanic data, the lack of age information is not meaningful. Simply put, there are some passengers that did not have recorded ages. We will now show this to be the case. 

**Problem 3k**

Replace the unknown ages with a value of -999, and estiamte the accuracy of the model via 10-fold cross validation.

In [127]:
age_impute = titanic_df['Age'].copy()
age_impute[np.isnan(age_impute)] = -999

titanic_df['age_impute'] = age_impute

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy with -999 for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))

The accuracy with -999 for missing ages = 80.42%


The accuracy of the model hasn't improved by adding the age information (even though we know children were more likely to survive than adults). 

Given that the missing ages don't have meaning, we need to develop alternative strategies for "imputing" the missing data. The most simple approach in this regard is to replace the missing values with the mean value of the feature distribution for sources that do have measurements (use the median if the distribution has significant outliers).  

**Problem 3l**

Replace the unknown ages with the mean age of passengers, and estiamte the accuracy of the model via 10-fold cross validation.

In [216]:
age_impute = titanic_df['Age'].copy().values
age_impute[np.isnan(age_impute)] = np.mean(age_impute[np.isfinite(age_impute)])

titanic_df['age_impute'] = age_impute

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy with the mean for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))

The accuracy with the mean for missing ages = 80.76%


Using the mean age for missing values provides a marginal improvement over the models with no age information. Is there anything else we can do? Yes - we can build a machine learning model to predict the values of the missing data. So there will be a machine learning model within the final machine learning model. In order to predict ages, we will need to build a regression model. Simple algorithms include Linear or Logistic Regression, while more complex examples include `kNN` or random forest regression.

I quickly tested the above four methods, and found that the `scikit-learn` [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) performs best when using the model defaults.

**Probem 3m**

Build a `LinearRegression` model to predict a passenger's age based on the `Pclass`, `SibSp`, `Parch`, `Fare`, `female`, `male`, `S`, `Q`, `C` features. The model should be trained with passengers that have known ages. Use 10-fold cross validation to determine the performance of this model.

*Hint - note that for regression models the typical metric of evaluation is the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error), and that for consistency within the `scikit-learn` API, the [negative mean squared error is returned rather than the mean squared error](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values).*

In [217]:
from sklearn.linear_model import LinearRegression

has_ages = np.where(np.isfinite(titanic_df['Age']))[0]

impute_X_train = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']].iloc[has_ages]
impute_y_train = titanic_df['Age'].iloc[has_ages]

scaler = MinMaxScaler()
scaler.fit(impute_X_train)
Xminmax = scaler.transform(impute_X_train)

lr_age = LinearRegression().fit(Xminmax, impute_y_train)

cv_results = cross_val_score(LinearRegression(), Xminmax, impute_y_train, cv=10, scoring='neg_mean_squared_error')

print('Missing ages have RMSE = {:.2f}'.format(np.mean((-1*cv_results)**0.5)))

Missing ages have RMSE = 12.77


**Problem 3n**

Use the age regression model to predict the ages for passengers with missing data.

In [218]:
missing_ages = np.where(np.isnan(titanic_df['Age']))[0]

impute_X_missing = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']].iloc[missing_ages]

X_missing_minmax = scaler.transform(impute_X_missing)

age_preds = lr_age.predict(X_missing_minmax)

**Problem 3o**

Use the imputed age estimates to predict the passenger survival via 10-fold cross validation.

In [229]:
age_impute = titanic_df['Age'].copy().values
age_impute[missing_ages] = age_preds

titanic_df['age_impute'] = age_impute

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy with the mean for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))

The accuracy with the mean for missing ages = 80.20%


As far as ages are concerned, imputation of the missing data does not significantly improve the model.

Which brings us to the concluding lesson in wrangling the Titanic data - not every piece of information is *useful*. This is critical to remember when building machine learning models. 

(As a quick aside - it wouldn't be entirely fair to say there is no useful information in the age feature. It is clear, for example, that "children," i.e. those with Age < 10, had a much higher probability of survival than adults. Perhaps the creation of a `child` feature based on age would improve the model... or the using age in combination with other features, e.g., `Age`x`Pclass` which will further highlight that 1st class passengers were more likely to survive than 3rd class passengers)

Finally - note that you can try to build a model and submit it to Kaggle to see how well you preform on blind data. 

https://www.kaggle.com/c/titanic - the classifications are not revealed, but from the [leaderboard](https://www.kaggle.com/c/titanic/leaderboard) it is clear that some people were able to build models that perfectly classified the blind data.

## Problem 4) Wrangling as astro machine learning model