# Titanic - Machine Learning from Disaster

The goal is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

## Data

The data and feature explanations can be found at: https://www.kaggle.com/competitions/titanic/data

## Submission

We'll use the `test.csv` file to test our model, in which we'll put our results in a `submission.csv` file with 2 columns: `PassengerId` and `Survived`.

- `gender_submission.csv` contains a set of predictions that assumes all and only female passengers survived, and is an example of how the submission file should look like.

In [11]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Import data.
titanic_data = pd.read_csv("./train.csv")
# Select target value.
y = titanic_data.Survived

In addition, we'll make a simple function to create the `submission.csv` file from the model and features we've decided on.

In [12]:
def create_submission(model, usedFeatures):
  """
  Create a `submission.csv` file from the model created from a set of features.
  `pd.get_dummies()` will automatically be applied to the test data from `test.csv`.
  """
  test_data = pd.read_csv("./test.csv")
  test_X = pd.get_dummies(test_data[usedFeatures])
  # Create predictions.
  predictions = model.predict(test_X)
  # Export predictions.
  output_df = pd.DataFrame({ "PassengerId": test_data.PassengerId, "Survived": predictions })
  output_df.to_csv("./submission.csv", index=False)

# Initial Model

For our initial model, we'll look at some features that intuitively may correspond to the result.

- `Sex`: If we think about history, which was mostly a male-dominated society, we should expect potentially males having a higher survival chance compared to females.
- `Age`: With age, we might expect children and older people being left behind.
- `Fare`: This is a numeric variable and should correlate to the `Pclass` and `Cabin` variables.

Speaking of our model, instead of using the `RandomForestRegressor`, we're using the `RandomForestClassifier` model as we're classifying whether a person survived or not.

In [13]:
# The list of features we want to use.
features = ["Sex", "Age", "Fare"]

# Select columns corresponding to features.
X = titanic_data[features]
X.describe(include="all")

Unnamed: 0,Sex,Age,Fare
count,891,714.0,891.0
unique,2,,
top,male,,
freq,577,,
mean,,29.699118,32.204208
std,,14.526497,49.693429
min,,0.42,0.0
25%,,20.125,7.9104
50%,,28.0,14.4542
75%,,38.0,31.0


From observing the statistical results, we see that some entries are missing an `Age` value. However since [version 1.4 of `scikit-learn`, they now support missing values](https://scikit-learn.org/dev/whats_new/v1.4.html#id7). In addition, since we're using a `RandomForestClassifier`, we need to have the `Sex` column contain numeric values, or alternatively, create features based on the values in the column. This can be done by using `get_dummies()`, which will create a boolean `Sex_female` and `Sex_male` column.

In [14]:
# Convert unique string values in columns to boolean columns.
X = pd.get_dummies(X)

X.describe(include="all")

Unnamed: 0,Age,Fare,Sex_female,Sex_male
count,714.0,891.0,891,891
unique,,,2,2
top,,,False,True
freq,,,577,577
mean,29.699118,32.204208,,
std,14.526497,49.693429,,
min,0.42,0.0,,
25%,20.125,7.9104,,
50%,28.0,14.4542,,
75%,38.0,31.0,,


Now we can create our initial model design and initial submission.

In [15]:
# Create and train model.
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.73684`, which isn't that bad of an initial guess (note that getting a score of `1` indicates a perfect match).

# 2nd Model

This time, we'll add on the `Pclass` feature and see if anything changes.

In [16]:
features = ["Pclass", "Sex", "Age", "Fare"]
X = pd.get_dummies(titanic_data[features])
# Create and train model.
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.74162`, which is slightly better than what we had prior.

# 3rd Model

Although we learned that random forests make a good prediction with the default parameters compared to decision trees (ie: we don't need to specify some max depth value), it might be helpful to see what happens if we do limit the tree depth.

In the [`Titanic Tutorial`](https://www.kaggle.com/code/alexisbcook/titanic-tutorial), they limited their tree to a depth of 5, so we'll do that as well.

In [17]:
# Create and train model.
rf_model = RandomForestClassifier(max_depth=5, random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.77751`, which is a substantial improvement.

# 4th Model

Since we saw an improvement when using a `max_depth` of 5, is there a better value that might result in a better result? Let's try something larger, like 7.

In [18]:
# Create and train model.
rf_model = RandomForestClassifier(max_depth=7, random_state=1)
rf_model.fit(X, y)
# Create submission.
create_submission(rf_model, features)

On submission, we received a score of `0.78229`, which is a slight improvement.