<a href="https://www.kaggle.com/code/andrexibiza/titanic-machine-learning-from-disaster?scriptVersionId=218515899" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Titanic - Machine Learning from Disaster

**Andrex Ibiza, MBA**
2025-01-16

# v2.2 Notes
This is now version 2.2 of this notebook. In version 2.1, I attempted to apply and tune a LightGBM model, but it did not go well, scoring only 0.52870 accuracy. Version 2.0 achieved a score of 0.76076, so I reverted to that version. In reviewing v2.0 with fresh eyes, a specific error message in the output from the random forest model caught my attention: `“You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.”` So, my model was attempting to use regression on `Survived` instead of classification. In other words, it was estimating numbers on a continuous range from 0 to 1, instead of classifying with a binary 0 or 1. In spite of this shortcoming, the v2,0 model still scored 0.76076 simply using a round function on this regression result. Before making any other changes to my model selection or engineering new features from existing data, I want to know how much the score can be improved by simply fixing this data type issue and running the model again for scoring.

# Introduction

This notebook documents my second attempt at working through the Titanic dataset to build an accurate predictive model for Titanic shipwreck survivors (https://www.kaggle.com/competitions/titanic). My v1 model scored around 70% accuracy. In this iteration, to build a more accurate model, I plan to take a more nuanced approach toward fully exploring the data, dealing with missing values, and engineering meaningful new features.

## Files
* `gender_submission.csv`: example of what the final submitted file should look like with two columns: `PassengerID` and `Survived`.
* `train.csv`: labeled data (`Survived`) used to build the model. 11 columns
* `test.csv`: 12 columns

## Data dictionary
| Variable	| Definition | Key | Notes |
| --- | --- | --- | --- |
| survival	| Survival	| 0 = No, 1 = Yes | --- |
| pclass	| Ticket class	| 1 = 1st, 2 = 2nd, 3 = 3rd | Proxy for SES- 1st=upper, 2nd=middle, 3rd=lower |
| sex	| Sex | --- | --- |
| Age	| Age in years | --- | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 |
| sibsp	| # of siblings / spouses aboard the Titanic | --- | Sibling = brother, sister, stepbrother, stepsister; Spouse = husband, wife (mistresses and fiancés were ignored) |
| parch	| # of parents / children aboard the Titanic | --- | Parent = mother/father, Spouse = husband, wife (mistresses and fiances ignored). Some children travelled only with a nanny, therefore parch=0 for them. |
| ticket | Ticket number | --- | --- |
| fare	| Passenger fare | --- | --- |
| cabin	| Cabin number | --- | --- |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | --- ||mpton | --- |

# Exploratory Data Analysis

The first step in working with this dataset is to load `test.csv` into a dataframe to check its structure, data types, and identify any missing values. The `Hmisc` package provides a robust `describe()` function that provides detailed summary statistics for each variable in a dataset and helps identify missing values.

In [None]:
# Load packages
library(caret)        # machine learning
library(dplyr)        # data manipulation
library(ggplot2)      # viz
library(Hmisc)        # robust describe() function
library(naniar)       # working with missing data
library(randomForest) # inference model

# Load train and test data
train <- read.csv("/kaggle/input/titanic/train.csv", stringsAsFactors = FALSE)
test <- read.csv("/kaggle/input/titanic/test.csv", stringsAsFactors = FALSE)
head(train) #--loaded successfully
head(test)  #--loaded successfully

# Evaluate structure and data types
# str(train)
# str(test)
# 
# describe(train)
# train has missing values: Age 177, Cabin 687, Embarked 2
# describe(test)
# test has missing values: Cabin 327, Fare 1, Age 86

# Data Cleaning and Preprocessing

## 1) Encode Categorical Variables
We need to encode the categorical variables correctly before using these variables to impute missing `Age` values with a random forest model.
* `Sex`: Binary *factor* (male = 0, female = 1).
* `Pclass`: Ordinal encode (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
* `Embarked`: One-hot encode (C, Q, S).

## 2) Data Transformation
* `Fare`: Highly skewed (95th percentile = 112.08, max = 512.33). Apply a log transformation (log(Fare + 1)) to reduce skew.

## 3) Missing Values
Preparing the data for modeling requires addressing missing values in the dataset. 
* `Age`: 177 missing values. We will apply a random forest model to impute missing ages, instead of simpler imputation methods like median or mode. Perform cross-validation to estimate how well the model predicts Age for rows with non-missing values.
* `Cabin`: 687 missing values. There are too many missing values to impute them. This column will be converted to a new binary column called `HasCabin` of 1 if a cabin was recorded and 0 if not.
* `Embarked`: 2 missing values. These will be imputed with the mode, since only two are missing.

## 4) Feature Engineering
* `HasCabin`: 0 if `Cabin` entry missing, 1 if complete.
* `SibSp` and `Parch`: Combine into a new `FamilySize = SibSp + Parch + 1`. Family size may capture survival trends better than the individual components.

## 5) Remove Unnecessary Features
* `Cabin`: after extracting `HasCabin` feature.
* `Name`: We could consider extracting titles (`Mr.`, `Mrs.`, `Miss`, etc.) as a new feature. Titles may capture social status or age-related trends. For this iteration, we will drop the `Name` variable entirely without adding new features.
* `PassengerId`: purely an identifier
* `Ticket`: although there could potentially be useful patterns in the ticket prefixes, we will drop this column for this iteration since the data seem noisy.


### Encode `Sex` as numeric factor

In [None]:
# DATA CLEANING AND PREPROCESSING
# 1) Encode categorical variables
# [X] Encode Sex as numeric factor
train$Sex <- as.factor(ifelse(train$Sex == "male", 1, 0)) # v2.2 added as.factor() to coerce output
test$Sex <- as.factor(ifelse(test$Sex == "male", 1, 0))
head(train[, "Sex"]) #--encoded successfully
head(test[, "Sex"]) #--encoded successfully

### Convert `Pclass` to an ordinal factor

In [None]:
# [X] Convert Pclass to an ordinal factor
train$Pclass <- factor(train$Pclass, levels = c(1, 2, 3), ordered = TRUE)
test$Pclass <- factor(test$Pclass, levels = c(1, 2, 3), ordered = TRUE)
head(train[, "Pclass"]) #--encoded successfully
head(test[, "Pclass"]) #--encoded successfully

### One-hot encode `Embarked`

In [None]:
# [X] One-hot encode Embarked
embarked_train_one_hot <- model.matrix(~ Embarked - 1, data = train)
embarked_test_one_hot <- model.matrix(~ Embarked - 1, data = test)

# Add the one-hot encoded columns back to the dataset
train <- cbind(train, embarked_train_one_hot)
test <- cbind(test, embarked_test_one_hot)

# Verify encoding:
#head(train[, c("Embarked", "EmbarkedC", "EmbarkedQ", "EmbarkedS")])
#head(test[, c("Embarked", "EmbarkedC", "EmbarkedQ", "EmbarkedS")])

# -- looks perfect, let's not forget about imputing our 2 missing values
# Impute 2 missing Embarked values with the mode
train$Embarked[train$Embarked == ""] <- NA
embarked_mode <- names(sort(table(train$Embarked)))
train$Embarked[is.na(train$Embarked)] <- embarked_mode

# verify imputation
#describe(train$Embarked)

##v2.2 also want to explicitly cast the values in EmbarkedC, EmbarkedQ, and EmbarkedS as factors.
train$EmbarkedC <- as.factor(train$EmbarkedC)
test$EmbarkedC <- as.factor(test$EmbarkedC)
train$EmbarkedQ <- as.factor(train$EmbarkedQ)
test$EmbarkedQ <- as.factor(test$EmbarkedQ)
train$EmbarkedS <- as.factor(train$EmbarkedS)
test$EmbarkedS <- as.factor(test$EmbarkedS)

## SibSp and Parch should be integers
train$SibSp <- as.integer(train$SibSp)
test$SibSp <- as.integer(test$SibSp)
train$Parch <- as.integer(train$Parch)
test$Parch <- as.integer(test$Parch)
# Survived needs to be a factor
train$Survived <- as.factor(train$Survived)

# now drop the original Embarked column
train <- train %>% select(-Embarked)
test <- test %>% select(-Embarked)
str(train)
str(test)

### Log Transform `Fare`

In [None]:
# 2) Apply log transformation to Fare
#--plot shape before transformation?
ggplot(train, aes(x = Fare)) +
  geom_histogram(bins=20) +
  theme_minimal() +
  ggtitle("Fare (before transforming)")

#--note an extreme outlier over 500!
train$Fare <- log(train$Fare + 1)
test$Fare <- log(test$Fare + 1)
head(train[, "Fare"])
head(test[, "Fare"])

ggplot(train, aes(x = Fare)) +
  geom_histogram(bins=20) +
  theme_minimal() +
  ggtitle("Log Transformed Fare")

### Use a random forest model to impute missing ages

After cleaning and transforming the rest of the data, I then trained a random forest model to impute missing Age values, with predictors: Pclass, Sex, SibSp, Parch, Fare, EmbarkedC, EmbarkedQ, and EmbarkedS.

In [None]:
# 3) Address missing values
# Age - Train
#--Predict missing ages using other features
train_age_data <- train %>% 
    select(Age, Pclass, Sex, SibSp, Parch, Fare, EmbarkedC, EmbarkedQ, EmbarkedS)

# head(train[, c("Age", "Pclass", "Sex", "SibSp", "Parch", "Fare", "EmbarkedC", "EmbarkedQ", "EmbarkedS")])
#--verified that all these columns are formatted properly

train_age_complete <- train_age_data %>% filter(!is.na(Age))
train_age_missing <- train_age_data %>% filter(is.na(Age))

set.seed(666)
cv_control <- trainControl(method = "cv", number = 10) #v2.2 10-fold cross-validation for imputing missing ages
train_age_cv_model <- train(
  Age ~ Pclass + Sex + SibSp + Parch + Fare + EmbarkedC + EmbarkedQ + EmbarkedS,
  data = train_age_complete,
  method = "rf",
  trControl = cv_control,
  tuneLength = 3
)
print(train_age_cv_model)

The R-squared on the age imputation for v2.2 shows a clear improvement, explaining roughly 31% of the variation versus 27% in v2.0.

In [None]:
# Use the best model to predict missing ages
predicted_train_ages <- predict(train_age_cv_model, newdata = train_age_missing)

# Impute the predicted ages back into the train dataset
train$Age[is.na(train$Age)] <- predicted_train_ages
describe(train$Age)

In [None]:
#--Age in test data
# Preprocess the test data for Age imputation
test_age_data <- test %>% 
  select(Age, Pclass, Sex, SibSp, Parch, Fare, EmbarkedC, EmbarkedQ, EmbarkedS)

test_age_missing <- test_age_data %>% filter(is.na(Age))
test_age_complete <- test_age_data %>% filter(!is.na(Age))

# Use the trained train_age_cv_model to predict missing ages in the test dataset
predicted_test_ages <- predict(train_age_cv_model, newdata = test_age_missing)

# Impute the predicted ages back into the test dataset
test$Age[is.na(test$Age)] <- predicted_test_ages

n_miss(test$Age)

In [None]:
# Create HasCabin feature
# any_na(train$Cabin) # returns FALSE
# describe(train$Cabin) # 687 missing - need to replace empty string values

# Convert empty strings to NA in Cabin
train$Cabin[train$Cabin == ""] <- NA
test$Cabin[test$Cabin == ""] <- NA

# n_miss(train$Cabin)
# n_miss(test$Cabin)

# Encode the HasCabin variable:
train$HasCabin <- ifelse(!is.na(train$Cabin), 1, 0)
test$HasCabin <- ifelse(!is.na(test$Cabin), 1, 0)

# describe(train$HasCabin) # - perfect
head(train[, c("Cabin", "HasCabin")])  #looks good
head(test[, c("Cabin", "HasCabin")]) 

n_miss(train$HasCabin)
n_miss(test$HasCabin)

In [None]:
# Create the FamilySize feature
train$FamilySize <- as.integer(train$SibSp + train$Parch + 1)
test$FamilySize <- as.integer(test$SibSp + test$Parch + 1)

# Inspect the new feature
head(train[, "FamilySize"])
head(test[, "FamilySize"])

# describe(train)
# describe(test)
#--test still has 1 missing fare - impute with the median
test$Fare[is.na(test$Fare)] <- median(test$Fare, na.rm = TRUE)
describe(test)

In [None]:
describe(test)

# Random Forest Model

In [None]:
# Data preprocessing is now complete and we are ready to model 
# the `Survival` variable for the `test` dataset!

# Drop Cabin
train <- train %>% select(-Cabin)
test <- test %>% select(-Cabin)

# Train the random forest model
rf_cv_control <- trainControl(method = "cv", number = 10)
set.seed(666)
rf_model <- train(
  Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + EmbarkedC + EmbarkedQ + EmbarkedS + HasCabin + FamilySize, 
  data = train,
  method = "rf",
  trControl = rf_cv_control,
    
  tuneLength = 5
)

# Print the cross-validation results
print(rf_model)

In [None]:
# Use the trained model to predict Survived in the test dataset
test$Survived <- predict(rf_model, newdata = test)

table(test$Survived)

In [None]:
# Save the updated test dataset with predictions
gender_submission <- test %>% select(PassengerId, Survived)
head(gender_submission)
write.csv(gender_submission, "submission.csv", row.names = FALSE)