# Titanic - Machine Learning from Disaster
## Author: Brody Taylor
## Date: Feb. 2023

---


## Dataset
### Data Dictionary
| Variable | Definition | Key |
| --- | --- | --- |
| Survived | Survival | 0 = No, 1 = Yes |
| Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| Name | Name |
| Sex | Sex |
| Age | Age in years |
| SibSp | # of siblings / spouses aboard the Titanic |
| Parch | # of parents / children aboard the Titanic |
| Ticket | Ticket number |
| Fare | Passenger fare |
| Cabin | Cabin number |
| Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
### Variable Notes
* Pclass: A proxy for socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower
* Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
* SibSp: The dataset defines family relations in this way...
  * Sibling = brother, sister, stepbrother, stepsister
  * Spouse = husband, wife (mistresses and fiancés were ignored)
* Parch: The dataset defines family relations in this way...
  * Parent = mother, father
  * Child = daughter, son, stepdaughter, stepson
  * Some children travelled only with a nanny, therefore Parch=0 for them.

In [1399]:
import pandas


training_df = pandas.read_csv("data/train.csv")
testing_df = pandas.read_csv("data/test.csv")

In [1400]:
training_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Data Cleaning

In [1401]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Convert boolean columns

In [1402]:
training_df["Survived"] = training_df["Survived"].map({1: True, 0: False})

training_df["Sex"] = training_df["Sex"].map({'male': True, 'female': False})
testing_df["Sex"] = testing_df["Sex"].map({'male': True, 'female': False})

### Parse Cabin into Deck and Room Number

In [1403]:
cabin_regex = r'([a-zA-Z])(\d*)'
training_df[["Deck", "Room"]] = training_df["Cabin"].str.extract(cabin_regex)
testing_df[["Deck", "Room"]] = testing_df["Cabin"].str.extract(cabin_regex)

training_df = training_df.drop("Cabin", axis="columns")
testing_df = testing_df.drop("Cabin", axis="columns")

training_df["Room"] = training_df["Room"].mask(training_df["Room"] == "")
testing_df["Room"] = testing_df["Room"].mask(testing_df["Room"] == "")

training_df["Room"] = training_df["Room"].astype(int, errors="ignore")
testing_df["Room"] = testing_df["Room"].astype(int, errors="ignore")

### Convert enum columns to int
* Embarked
  * C -> 0
  * Q -> 1
  * S -> 2

In [1404]:
embarked_enum_map = {"C": 0, "Q": 1, "S": 2}
training_df["Embarked"] = training_df["Embarked"].map(embarked_enum_map)
testing_df["Embarked"] = testing_df["Embarked"].map(embarked_enum_map)

* Deck
  * A -> 0
  * B -> 1
  * ...
  * Z -> 25

In [1405]:
deck_to_int = lambda x: ord(x.upper()) - ord('A') if type(x) == str else x
training_df["Deck"] = training_df["Deck"].apply(deck_to_int)
testing_df["Deck"] = testing_df["Deck"].apply(deck_to_int)

In [1406]:
training_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Deck,Room
0,1,False,3,"Braund, Mr. Owen Harris",True,22.0,1,0,A/5 21171,7.2500,2.0,,
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",False,38.0,1,0,PC 17599,71.2833,0.0,2.0,85
2,3,True,3,"Heikkinen, Miss. Laina",False,26.0,0,0,STON/O2. 3101282,7.9250,2.0,,
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",False,35.0,1,0,113803,53.1000,2.0,2.0,123
4,5,False,3,"Allen, Mr. William Henry",True,35.0,0,0,373450,8.0500,2.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,False,2,"Montvila, Rev. Juozas",True,27.0,0,0,211536,13.0000,2.0,,
887,888,True,1,"Graham, Miss. Margaret Edith",False,19.0,0,0,112053,30.0000,2.0,1.0,42
888,889,False,3,"Johnston, Miss. Catherine Helen ""Carrie""",False,,1,2,W./C. 6607,23.4500,2.0,,
889,890,True,1,"Behr, Mr. Karl Howell",True,26.0,0,0,111369,30.0000,0.0,2.0,148


## Feature Selection
* Improves accuracy
* Simplifies the model
* Reduces risk of overfitting

### Isolate Eligable Features

In [1407]:
features = training_df.drop("Survived", axis="columns")
target = training_df["Survived"]

ineligable_features = ["PassengerId", "Name", "Ticket"]
features.drop(ineligable_features, axis="columns", inplace=True)

### Remove Low Variance Features

In [1408]:
from sklearn.feature_selection import VarianceThreshold


variance_thres = 0.8*(1-0.8)
features = VarianceThreshold(threshold=variance_thres).set_output(transform="pandas").fit_transform(features)

### Univariate Feature Selection

Replace NaN values with mean before scoring

In [1409]:
from sklearn.impute import SimpleImputer


cleaned_features = SimpleImputer(missing_values=pandas.NA, strategy='mean').set_output(transform="pandas").fit_transform(features)

Limit to top 6 highest scoring features

In [1410]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


scores = SelectKBest(score_func=chi2, k="all").fit(cleaned_features, target).scores_
feature_scores = pandas.concat([pandas.DataFrame(cleaned_features.columns), pandas.DataFrame(scores)], axis="columns")
feature_scores.columns = ["Features", "Score"]
feature_scores.sort_values(by="Score", ascending=False, inplace=True)

features = features[feature_scores["Features"][:6]]

feature_scores

Unnamed: 0,Features,Score
5,Fare,4518.319091
1,Sex,92.702447
0,Pclass,30.873699
2,Age,24.687926
6,Embarked,10.413856
4,Parch,10.097499
8,Room,3.020284
3,SibSp,2.581865
7,Deck,0.055073


## Training

Reserve part of training set for testing

In [1411]:
from sklearn.model_selection import train_test_split


train, test = train_test_split(training_df, test_size=0.2)

In [1412]:
from sklearn.ensemble import HistGradientBoostingClassifier


x = train[features.columns]
y = train["Survived"]

categorical_features = [f for f in ["Pclass", "Sex", "Embarked", "Deck", "Room"] if f in x]
fit = HistGradientBoostingClassifier(categorical_features=categorical_features).fit(x, y)

## Testing

In [1413]:
x_test = test[x.columns]
y_test = test["Survived"]

fit.score(x_test, y_test)

0.8100558659217877

## Results

In [1414]:
results = pandas.DataFrame(testing_df["PassengerId"])
results["Survived"] = fit.predict(testing_df[x.columns])
results

Unnamed: 0,PassengerId,Survived
0,892,False
1,893,False
2,894,False
3,895,True
4,896,True
...,...,...
413,1305,False
414,1306,True
415,1307,False
416,1308,False


In [1415]:
from pathlib import Path 


# Convert Survived bool back to int
results["Survived"] = results["Survived"].map({True: 1, False: 0})

filepath = Path('out/results.csv')
filepath.parent.mkdir(parents=True, exist_ok=True)
results.to_csv(filepath, index=False)