# Titanic - Machine Learning from Disaster

The goal is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

## Data

The data and feature explanations can be found at: https://www.kaggle.com/competitions/titanic/data

## Submission

We'll use the `test.csv` file to test our model, in which we'll put our results in a `submission.csv` file with 2 columns: `PassengerId` and `Survived`.

- `gender_submission.csv` contains a set of predictions that assumes all and only female passengers survived, and is an example of how the submission file should look like.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Import data.
titanic_data = pd.read_csv("./train.csv")
# Select target value.
y = titanic_data.Survived

In addition, we'll make a simple function to create the `submission.csv` file from the model and features we've decided on.

In [2]:
def create_submission(model, usedFeatures):
  """
  Create a `submission.csv` file from the model created from a set of features.
  `pd.get_dummies()` will automatically be applied to the test data from `test.csv`.
  """
  test_data = pd.read_csv("./test.csv")
  test_X = pd.get_dummies(test_data[usedFeatures])
  # Create predictions.
  predictions = model.predict(test_X)
  # Export predictions.
  output_df = pd.DataFrame({ "PassengerId": test_data.PassengerId, "Survived": predictions })
  output_df.to_csv("./submission.csv", index=False)

# Initial Model

For our initial model, we'll look at some features that intuitively may correspond to the result.

- `Sex`: If we think about history, which was mostly a male-dominated society, we should expect potentially males having a higher survival chance compared to females.
- `Age`: With age, we might expect children and older people being left behind.
- `Fare`: This is a numeric variable and should correlate to the `Pclass` and `Cabin` variables.

Speaking of our model, instead of using the `RandomForestRegressor`, we're using the `RandomForestClassifier` model as we're classifying whether a person survived or not.

In [3]:
# The list of features we want to use.
features = ["Sex", "Age", "Fare"]

# Select columns corresponding to features.
X = titanic_data[features]
X.describe(include="all")

Unnamed: 0,Sex,Age,Fare
count,891,714.0,891.0
unique,2,,
top,male,,
freq,577,,
mean,,29.699118,32.204208
std,,14.526497,49.693429
min,,0.42,0.0
25%,,20.125,7.9104
50%,,28.0,14.4542
75%,,38.0,31.0


From observing the statistical results, we see that some entries are missing an `Age` value. However, since [version 1.4 of `scikit-learn`, they now support missing values](https://scikit-learn.org/dev/whats_new/v1.4.html#id7). In addition, since we're using a `RandomForestClassifier`, we need to have the `Sex` column contain numeric values, or alternatively, create features based on the values in the column. This can be done by using `get_dummies()`, which will create a boolean `Sex_female` and `Sex_male` column.

In [4]:
# Convert unique string values in columns to boolean columns.
X = pd.get_dummies(X)

X.describe(include="all")

Unnamed: 0,Age,Fare,Sex_female,Sex_male
count,714.0,891.0,891,891
unique,,,2,2
top,,,False,True
freq,,,577,577
mean,29.699118,32.204208,,
std,14.526497,49.693429,,
min,0.42,0.0,,
25%,20.125,7.9104,,
50%,28.0,14.4542,,
75%,38.0,31.0,,


Now we can create our initial model design and initial submission.

In [5]:
# Create and train model.
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.73684`, which isn't that bad of an initial guess (note that getting a score of `1` indicates a perfect match).

# 2nd Model

This time, we'll add on the `Pclass` feature, which indicates the social-economic status of the person and see if anything changes. This feature might add some more context as if the person is of a higher status, they might have a higher priority on entering the lifeboats.

In [6]:
features = ["Pclass", "Sex", "Age", "Fare"]
X = pd.get_dummies(titanic_data[features])
# Create and train model.
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.74162`, which is slightly better than what we had prior.

# 3rd Model

Although we learned that random forests make a good prediction with the default parameters compared to decision trees (ie: we don't need to specify some max depth value), it might be helpful to see what happens if we do limit the tree depth.

In the [`Titanic Tutorial`](https://www.kaggle.com/code/alexisbcook/titanic-tutorial), they limited their tree to a depth of 5, so we'll do that as well.

In [7]:
# Create and train model.
rf_model = RandomForestClassifier(max_depth=5, random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.77751`, which is a substantial improvement.

# 4th Model

Since we saw an improvement when using a `max_depth` of 5, is there a better value that might result in a better result? Let's try something larger, like 7.

In [8]:
# Create and train model.
rf_model = RandomForestClassifier(max_depth=7, random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.78229`, which is a slight improvement.

# 5th Model

One last idea that I have that may result in an improved score (without changing the model used) is to maybe take in account the `Embarked` feature. This might contribute to the final result as if you embarked from the first port, you may have gotten a better cabin that's closer to the lifeboats.

In [9]:
features = ["Pclass", "Sex", "Age", "Fare", "Embarked"]
X = pd.get_dummies(titanic_data[features])
# Create and train model.
rf_model = RandomForestClassifier(max_depth=7, random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.78468`, which is a slight improvement. Back to the topic of `max_depth`, since we changed our parameters, the value of `max_depth` that maximizes our accuracy score might change. Let's go back to using a `max_depth` of 5 and see if there's any difference.

In [10]:
# Create and train model.
rf_model = RandomForestClassifier(max_depth=5, random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

With this change, we obtained a score of `0.79186`, which indicates that yes, the parameters used will change based on the features selected.

# 6th Model

After finishing kaggle Learn's ["Intermediate Machine Learning"](https://www.kaggle.com/learn/intermediate-machine-learning) course, let's see if using more advanced techniques such as XGBoost can help improve our accuracy score. In addition, we'll utilize pipelines to clean up the code.

In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier

# Reset the information since we converted `Sex` to dummy features earlier on.
X = titanic_data[features]

# Split data to train & validation sets.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=1)

# Create preprocessing logic for numeric & categorical features used.
preprocessor = ColumnTransformer(transformers=[
  ("num", SimpleImputer(strategy="median"), ["Pclass", "Age", "Fare"]),
  ("cat", OneHotEncoder(handle_unknown="ignore"), ["Sex", "Embarked"])
])

X_train_transform = preprocessor.fit_transform(X_train)
X_valid_transform = preprocessor.transform(X_valid)

# Use XGBoost to get the best results.
xg_model = XGBClassifier(n_estimators=1000, learning_rate=0.05, early_stopping_rounds=5, random_state=1)
xg_model.fit(X_train_transform, y_train, eval_set=[(X_valid_transform, y_valid)])

# Predict results.
test_data = pd.read_csv("./test.csv")
predictions = xg_model.predict(preprocessor.transform(test_data))
# Export predictions.
output_df = pd.DataFrame({ "PassengerId": test_data.PassengerId, "Survived": predictions })
output_df.to_csv("./submission.csv", index=False)

[0]	validation_0-logloss:0.66584
[1]	validation_0-logloss:0.64601
[2]	validation_0-logloss:0.62825
[3]	validation_0-logloss:0.61281
[4]	validation_0-logloss:0.59845
[5]	validation_0-logloss:0.58568
[6]	validation_0-logloss:0.57368
[7]	validation_0-logloss:0.56346
[8]	validation_0-logloss:0.55366
[9]	validation_0-logloss:0.54473
[10]	validation_0-logloss:0.53561
[11]	validation_0-logloss:0.52840
[12]	validation_0-logloss:0.52113
[13]	validation_0-logloss:0.51492
[14]	validation_0-logloss:0.50910
[15]	validation_0-logloss:0.50412
[16]	validation_0-logloss:0.49929
[17]	validation_0-logloss:0.49519
[18]	validation_0-logloss:0.49155
[19]	validation_0-logloss:0.48835
[20]	validation_0-logloss:0.48499
[21]	validation_0-logloss:0.48297
[22]	validation_0-logloss:0.48023
[23]	validation_0-logloss:0.47835
[24]	validation_0-logloss:0.47641
[25]	validation_0-logloss:0.47435
[26]	validation_0-logloss:0.47229
[27]	validation_0-logloss:0.47127
[28]	validation_0-logloss:0.46962
[29]	validation_0-loglos

With this more "complex" model, we obtained a score of `0.78708`, which is slightly worse than before, but can be improved as some fine-tuning is required to get the best results.

In [12]:
from sklearn.metrics import mean_absolute_error

def use_XGB(strategy, learning_rate):
  # Create preprocessing logic for numeric & categorical features used.
  preprocessor = ColumnTransformer(transformers=[
    ("num", SimpleImputer(strategy=strategy), ["Pclass", "Age", "Fare"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["Sex", "Embarked"])
  ])

  X_train_transform = preprocessor.fit_transform(X_train)
  X_valid_transform = preprocessor.transform(X_valid)

  # Use XGBoost to get the best results.
  xg_model = XGBClassifier(n_estimators=1000, learning_rate=learning_rate, early_stopping_rounds=5, random_state=1)
  xg_model.fit(X_train_transform, y_train, eval_set=[(X_valid_transform, y_valid)], verbose=False)
  # See accuracy of model with these parameters.
  valid_predictions = xg_model.predict(preprocessor.transform(X_valid))
  mae = mean_absolute_error(y_valid, valid_predictions)
  print("Obtained MAE score of `{}` from parameters: `strategy={}`, `learning_rate={}`".format(mae, strategy, learning_rate))

# Parameters that we model XBG with.
all_strategies = ["mean", "median", "most_frequent", "constant"]
all_learning_rates = [0.01, 0.03, 0.05, 0.07, 0.1, 0.25, 0.5]

# for strategy in all_strategies:
#   for learning_rate in all_learning_rates:
#     use_XGB(strategy, learning_rate)

# for learning_rate in [i * 0.01 for i in range(0, 100)]:
#   use_XGB("median", learning_rate)