# Titanic - Machine Learning from Disaster

The goal is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

## Data

The data and feature explanations can be found at: https://www.kaggle.com/competitions/titanic/data

## Submission

We'll use the `test.csv` file to test our model, in which we'll put our results in a `submission.csv` file with 2 columns: `PassengerId` and `Survived`.

- `gender_submission.csv` contains a set of predictions that assumes all and only female passengers survived, and is an example of how the submission file should look like.

We'll make a simple function to create the `submission.csv` file from the model and features we've decided on.

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Import data.
titanic_data = pd.read_csv("./train.csv")
# Select target value.
X = titanic_data.copy()
y = X.pop("Survived")

def create_submission(model, transformer):
  """Create a `submission.csv` file from the model created from a set offeatures.

    Parameters
    ----------
    model : Model
      Model we want to test.
    transformer : callable
      Function that transforms the test data, returning a DataFrame with only the columns used.
  """
  test_data = pd.read_csv("./test.csv")
  # Create predictions.
  predictions = model.predict(transformer(test_data))
  # Export predictions.
  output_df = pd.DataFrame({ "PassengerId": test_data.PassengerId, "Survived": predictions })
  output_df.to_csv("./submission.csv", index=False)

# Initial Model

For our initial model, we'll look at some features that intuitively may correspond to the result.

- `Sex`: If we think about history, which was mostly a male-dominated society, we should expect potentially males having a higher survival chance compared to females.
- `Age`: With age, we might expect children and older people being left behind.
- `Fare`: This is a numeric variable and should correlate to the `Pclass` and `Cabin` variables.

Speaking of our model, instead of using the `RandomForestRegressor`, we're using the `RandomForestClassifier` model as we're classifying whether a person survived or not.

In [2]:
# The list of features we want to use.
feats_1 = ["Sex", "Age", "Fare"]
# Select columns corresponding to features.
X_1 = X[feats_1]

X_1.describe(include="all")

Unnamed: 0,Sex,Age,Fare
count,891,714.0,891.0
unique,2,,
top,male,,
freq,577,,
mean,,29.699118,32.204208
std,,14.526497,49.693429
min,,0.42,0.0
25%,,20.125,7.9104
50%,,28.0,14.4542
75%,,38.0,31.0


From observing the statistical results, we see that some entries are missing an `Age` value. However, since [version 1.4 of `scikit-learn`, they now support missing values](https://scikit-learn.org/dev/whats_new/v1.4.html#id7). In addition, since we're using a `RandomForestClassifier`, we need to have the `Sex` column contain numeric values, or alternatively, create features based on the values in the column. This can be done by using `get_dummies()`, which will create a boolean `Sex_female` and `Sex_male` column.

In [3]:
# Convert unique string values in columns to boolean columns.
X_1 = pd.get_dummies(X_1)

X_1.describe(include="all")

Unnamed: 0,Age,Fare,Sex_female,Sex_male
count,714.0,891.0,891,891
unique,,,2,2
top,,,False,True
freq,,,577,577
mean,29.699118,32.204208,,
std,14.526497,49.693429,,
min,0.42,0.0,,
25%,20.125,7.9104,,
50%,28.0,14.4542,,
75%,38.0,31.0,,


Now we can create our initial model design and initial submission.

In [4]:
create_submission(
  model=RandomForestClassifier(random_state=1).fit(X_1, y),
  transformer=lambda data: pd.get_dummies(data[feats_1])
  )

On submission, we received a score of `0.73684`, which isn't that bad of an initial guess (note that getting a score of `1` indicates a perfect match).

# 2nd Model

This time, we'll add on the `Pclass` feature, which indicates the social-economic status of the person and see if anything changes. This feature might add some more context as if the person is of a higher status, they might have a higher priority on entering the lifeboats.

In [5]:
feats_2 = ["Pclass", "Sex", "Age", "Fare"]
X_2 = pd.get_dummies(X[feats_2])

create_submission(
  model=RandomForestClassifier(random_state=1).fit(X_2, y),
  transformer=lambda data: pd.get_dummies(data[feats_2])
  )

On submission, we received a score of `0.74162`, which is slightly better than what we had prior.

# 3rd Model

Although we learned that random forests make a good prediction with the default parameters compared to decision trees (ie: we don't need to specify some max depth value), it might be helpful to see what happens if we do limit the tree depth.

In the [`Titanic Tutorial`](https://www.kaggle.com/code/alexisbcook/titanic-tutorial), they limited their tree to a depth of 5, so we'll do that as well.

In [6]:
create_submission(
  model=RandomForestClassifier(max_depth=5, random_state=1).fit(X_2, y),
  transformer=lambda data: pd.get_dummies(data[feats_2])
  )

On submission, we received a score of `0.77751`, which is a substantial improvement.

Since we saw an improvement when specifying a `max_depth` of 5, is there a better value that might result in a better result? Let's try something larger, like 7.

In [7]:
create_submission(
  model=RandomForestClassifier(max_depth=7, random_state=1).fit(X_2, y),
  transformer=lambda data: pd.get_dummies(data[feats_2])
  )

On submission, we received a score of `0.78229`, which is a slight improvement.

# 4th Model

One last idea that I have that may result in an improved score (without changing the model used) is to maybe take in account the `Embarked` feature. This might contribute to the final result as if you embarked from the first port, you may have gotten a better cabin that's closer to the lifeboats.

In [8]:
feats_4 = ["Pclass", "Sex", "Age", "Fare", "Embarked"]
X_4 = pd.get_dummies(X[feats_4])

create_submission(
  model=RandomForestClassifier(max_depth=7, random_state=1).fit(X_4, y),
  transformer=lambda data: pd.get_dummies(data[feats_4])
  )

On submission, we received a score of `0.78468`, which is a slight improvement. Back to the topic of `max_depth`, since we changed our parameters, the value of `max_depth` that maximizes our accuracy score might change. Let's go back to using a `max_depth` of 5 and see if there's any difference.

In [9]:
create_submission(
  model=RandomForestClassifier(max_depth=5, random_state=1).fit(X_4, y),
  transformer=lambda data: pd.get_dummies(data[feats_4])
  )

With this change, we obtained a score of `0.79186`, which indicates that yes, the parameters used will change based on the features selected.

# 5th Model

After finishing kaggle Learn's ["Intermediate Machine Learning"](https://www.kaggle.com/learn/intermediate-machine-learning) course, let's see if using more advanced techniques such as XGBoost can help improve our accuracy score. In addition, we'll utilize pipelines to clean up the code.

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier

# Split data to train & validation sets.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=1)

# Create preprocessing logic for numeric & categorical features used.
preprocessor_5 = ColumnTransformer(transformers=[
  ("num", SimpleImputer(strategy="median"), ["Pclass", "Age", "Fare"]),
  ("cat", OneHotEncoder(handle_unknown="ignore"), ["Sex", "Embarked"])
])

X_train_transform = preprocessor_5.fit_transform(X_train)
X_valid_transform = preprocessor_5.transform(X_valid)

# Use XGBoost to get the best results.
xg_model = XGBClassifier(n_estimators=1000, learning_rate=0.05, early_stopping_rounds=5, random_state=1)
xg_model.fit(X_train_transform, y_train, eval_set=[(X_valid_transform, y_valid)])

# Create submission.
create_submission(model=xg_model, transformer=preprocessor_5.transform)

[0]	validation_0-logloss:0.66584
[1]	validation_0-logloss:0.64601
[2]	validation_0-logloss:0.62825
[3]	validation_0-logloss:0.61281
[4]	validation_0-logloss:0.59845
[5]	validation_0-logloss:0.58568
[6]	validation_0-logloss:0.57368
[7]	validation_0-logloss:0.56346
[8]	validation_0-logloss:0.55366
[9]	validation_0-logloss:0.54473
[10]	validation_0-logloss:0.53561
[11]	validation_0-logloss:0.52840
[12]	validation_0-logloss:0.52113
[13]	validation_0-logloss:0.51492
[14]	validation_0-logloss:0.50910
[15]	validation_0-logloss:0.50412
[16]	validation_0-logloss:0.49929
[17]	validation_0-logloss:0.49519
[18]	validation_0-logloss:0.49155
[19]	validation_0-logloss:0.48835
[20]	validation_0-logloss:0.48499
[21]	validation_0-logloss:0.48297
[22]	validation_0-logloss:0.48023
[23]	validation_0-logloss:0.47835
[24]	validation_0-logloss:0.47641
[25]	validation_0-logloss:0.47435
[26]	validation_0-logloss:0.47229
[27]	validation_0-logloss:0.47127
[28]	validation_0-logloss:0.46962
[29]	validation_0-loglos

With this more "complex" model, we obtained a score of `0.78708`, which is slightly worse than before, but can be improved as some fine-tuning is required to get the best results.

In [11]:
from sklearn.metrics import mean_absolute_error

def use_XGB(strategy, learning_rate):
  # Create preprocessing logic for numeric & categorical features used.
  preprocessor = ColumnTransformer(transformers=[
    ("num", SimpleImputer(strategy=strategy), ["Pclass", "Age", "Fare"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["Sex", "Embarked"])
  ])

  X_train_transform = preprocessor.fit_transform(X_train)
  X_valid_transform = preprocessor.transform(X_valid)

  # Use XGBoost to get the best results.
  xg_model = XGBClassifier(n_estimators=1000, learning_rate=learning_rate, early_stopping_rounds=5, random_state=1)
  xg_model.fit(X_train_transform, y_train, eval_set=[(X_valid_transform, y_valid)], verbose=False)
  # See accuracy of model with these parameters.
  valid_predictions = xg_model.predict(preprocessor.transform(X_valid))
  mae = mean_absolute_error(y_valid, valid_predictions)
  print("Obtained MAE score of `{}` from parameters: `strategy={}`, `learning_rate={}`".format(mae, strategy, learning_rate))

# Parameters that we model XBG with.
all_strategies = ["mean", "median", "most_frequent", "constant"]
all_learning_rates = [0.01, 0.03, 0.05, 0.07, 0.1, 0.25, 0.5]

# for strategy in all_strategies:
#   for learning_rate in all_learning_rates:
#     use_XGB(strategy, learning_rate)

# for learning_rate in [i * 0.01 for i in range(0, 100)]:
#   use_XGB("median", learning_rate)

From running testing different XGB models with different parameter values using the function created above, we landed on with essentially the best model being the one we've used previously. From this, it seems that we may need to do some feature engineering to recieve a higher score.

# 6th Model

With this model, we plan on doing some feature engineering in hopes of obtaining a better score. With that in mind, we'll go back to using the Random Forest model instead of our XGB model as our best score did come from a Random Forest.

Some potential feature ideas that I had include:
- Binning the `Age` & `Fare` categories.
- Creating a new feature based on the 1st letter of `Cabin`.

#### `Cabin_Level` Feature

This feature is derived from the existing `Cabin` feature and indicates the "level" on the Titanic their cabin is in. According to diagrams of the titanic, the levels start from `A` at the top to `G` at the bottom. We do notice that there's a rogue `T` entry, which we'll replace with `nan` afterwards.

Although the `Cabin` feature is populated for a little under 1/4 of the training data, this might be some critical missing information due to the dataset being small.

#### `Fare` Binning

One problem we have with binning the `Fare` is that the distribution is not as even, so the boundaries might not be that good.

```python
import seaborn as sns
sns.histplot(data=X, x="Fare")
```

There's something else we could do instead with the `Fare` feature which is standardizing it using `StandardScaler`. From viewing the distribution, we see that we have a very wide range of values, ranging from 0 to 512. Compared to the other numeric values, the scale is way off. This poses an issue with the model as it may factor heavily on this feature. Standardizing the data will scale the values down, making them more comparable.

#### `Age` Binning

`Age` is something that can be binned easily and manually as there's a clear methodology of how we can do this. With the binning of the `Age` feature, we can use a "common age range" from that time period. We specifically choose an age range around that time period as the average life expectancy from back then and now are way different (late 40s vs late 70s). Our age range will look something along the lines of:
- 0-9
- 10-15
- 16-25
- 26-45
- 45+

In [12]:
from sklearn.preprocessing import StandardScaler

feats_6 = ["Pclass", "Sex", "Age_binned", "Fare_normal", "Embarked", "Cabin_Level"]

cabin_levels = ["A", "B", "C", "D", "E", "F", "G"]
cabin_levels_cols = [f"Cabin_Level_{letter}" for letter in cabin_levels]

# Bin "Age".
def bin_age(age):
  if age > 45:
    return 5
  elif age > 25:
    return 4
  elif age > 15:
    return 3
  elif age > 9:
    return 2
  return 1

def transform_6(data):
  data["Age_binned"] = data["Age"].map(bin_age)
  # Ensure we have all the valid cabin levels.
  data.loc[:, cabin_levels_cols] = 0
  data["Cabin_Level"] = data["Cabin"].str[0]
  # Remove invalid cabin level values.
  data.loc[~data["Cabin_Level"].isin(cabin_levels), "Cabin_Level"] = np.nan
  # Standardize fare.
  scaler = StandardScaler()
  data["Fare_normal"] = scaler.fit_transform(data[["Fare"]])
  return pd.get_dummies(data[feats_6])

create_submission(
  model=RandomForestClassifier(max_depth=5, random_state=1).fit(transform_6(X), y),
  transformer=lambda data: transform_6(data)
  )

With this feature engineering applied, we obtained a score of `0.80143`, which is a great improvement as we broke through the 80% accuracy point.