## Kaggle Titanic Entry
### Daniel Walker, PhD
### 12 APRIL 2021

Purpose: Predict whether passengers Survived (1) or did not (0). 
Output: CSV file with 418 entries - one entry per passenger in test set, with 2 columns: `PassengerId` and `Survived`

### Import Data & Take a Look

In [1]:
import os
import pandas as pd

cwd = os.getcwd()
test_data_path = os.path.join(cwd, 'titanic_data', 'test.csv')
train_data_path = os.path.join(cwd, 'titanic_data', 'train.csv')
test = pd.read_csv(test_data_path)
train = pd.read_csv(train_data_path)
gender = pd.read_csv(os.path.join(cwd, 'titanic_data', 'gender_submission.csv'))
gender.head()
#Gender = example answer submission if only all female passengers survived.


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [17]:
#Begin evaluating the data
test.shape

(418, 11)

In [18]:
train.shape

(891, 12)

In [20]:
#Why does train have one more column?
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [21]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [None]:
#Answer, test does not have a target variable - 'Survived' in train

The preceding two tables highlight a couple of things about the data:
- Test does not include the target variable
- Cabin & Age are going to be the two most problematic variables, both of which have a large number of NaNs in both train and test

In [44]:
test['Cabin'].sample(n=10)

45     NaN
96     C46
349    NaN
70     NaN
319    NaN
30     NaN
274    NaN
241    NaN
52     NaN
67     NaN
Name: Cabin, dtype: object

The sample above shows that the `Cabin` variable is an alphanumeric code. If required, I could split the leading letter & the following numbers into two variables that may be informative, but will for now progress without `Cabin` in my analyses.

### Analytical Plan

1. Train separate algorithms to predict missing Age and Cabin entries.
- I suspect Age will have better results than Cabin
- In the test data, 331 observations have `Age` + `Fare`. Only one record is missing `Fare`. I will split these into training & validation sets, then make predictions on the remaining (418-331) passengers plus any passengers missing in
2. With data sets more evenly populated, evaluate model accuracy with & without imputed variables
- If my suspicion is correct, will probably wind up dropping cabin/using another variable for being too correlated

In [32]:
# Train model to impute Age for Test data set
from sklearn import ensemble
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_absolute_error, median_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
import numpy as np

# OHE catgorical vars, Scale continuous vars
# Split out useful vars
age_all = train[["Age","Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked"]].dropna()
age_y = age_all["Age"]
age_x = age_all[["Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked"]]
age_x_cont = age_x[["Pclass", "SibSp", "Fare"]]
age_x_cat = age_x[["Sex", "Embarked"]]
# OHE cat vars
enc = OneHotEncoder()
enc.fit(age_x_cat)
age_x_cat_t = enc.transform(age_x_cat).toarray()
# Scale cont vars
scaler = StandardScaler().fit(age_x_cont)
age_x_cont_s = scaler.transform(age_x_cont)
# Rejoin two numpy arrays
age_x_format = np.concatenate((age_x_cat_t, age_x_cont_s), axis=1)
# Shape checks out
# Split into training & test sets
X_train_age, X_test_age, y_train_age, y_test_age = train_test_split(age_x_format, 
                                                                    age_y, 
                                                                    test_size=0.3, 
                                                                    random_state=865)

# Init model with default params using MAE loss function
gb_reg = ensemble.GradientBoostingRegressor(n_estimators=1000,
                                             loss='lad',
                                             learning_rate=0.01,
                                             max_depth=4,
                                             random_state=865).fit(X_train_age, y_train_age)
tr_mae = median_absolute_error(y_train_age, gb_reg.predict(X_train_age))
ts_mae = median_absolute_error(y_test_age, gb_reg.predict(X_test_age))

print(f'Training error is {tr_mae} and test error is {ts_mae}')

Training error is 5.7459859615735365 and test error is 7.0


Documenting my age predictor training & test errors here:
First run: tr = 8.6, ts = 9.8
Second run (learning_rate from 0.1 -> 0.01) = worse
Third: (learning=0.1, max_depth from 1 -> 6) = marginally better
...repeat several times...
Changed from mean absolute error to median absolute error and the values fell. That means age is impacted by outliers.
Current errors:
tr = 5.91 years over, ts = 6.955.
Rounding those numbers to nearest year (7 in test data).
I should be able to get away with subtracting 7 years from all imputed ages in my final data to improve my predictions.
Lots of additional manual attempts show that this model has a keen test error floo


Hilariously, properly handling the categorical & continuous variables decreased my model accuracy.

In [119]:
# Use gb_reg - 7 to impute missing data


(712, 7)

In this next section,  I will see if a simple neural net does a better job of imputing missing `Age` values than I have been able to do with the GBRegressor.

In [25]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

