# Titanic Spacechip - Kaggle Competition

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

## Task

In this competition your task is to **predict whether a passenger was transported to an alternate dimension** during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.


# Dataset

## train.csv
Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

- `PassengerId` - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

- `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.

- `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

- `Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

- `Destination` - The planet the passenger will be debarking to.

- `Age` - The age of the passenger.

- `VIP` - Whether the passenger has paid for special VIP service during the voyage.

- `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

- `Name` - The first and last names of the passenger.

- `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

### test.csv
Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

## sample_submission.csv
A submission file in the correct format.
`PassengerId` - Id for each passenger in the test set.



# Initial data exploration

In [1]:
from main import load_train_data, load_test_data

# Load data
titanic_train = load_train_data()

# Explore the dataset
titanic_train.describe()

ModuleNotFoundError: No module named 'pandas'

There are some missing values. Imputation will be necessary later on.

In [None]:
titanic_train.info()

In [None]:
titanic = titanic_train.copy()
titanic[["CryoSleep", "VIP", "Transported"]] = titanic[["CryoSleep", "VIP", "Transported"]].convert_dtypes(float)
titanic.info()

First, we note that there are some values which are objects and bools.

## HomePlanet & Destination

These two features can be one-hot encoded. For HomePlanet we only have three values.

## CryoSleep

Is a boolean value indicating whether the pax was on CryoSleep

## Cabin

Cabin is composed of three datum: Deck/Cabin Number/Side. This feature can be split into three, while Deck and Side can also be one-hot encoded.

## VIP

Boolean feature,

## RoomService, FoodCourt, ShoppingMall, Spa, VRDeck

Indicate the expenses of each passenger. Maybe a feature with the total expenses could be meaningful, but maybe not. Maybe individual expense categories correlate better to the Transported label.

## Name

Maybe uninteresting? won't be used initially

## Transported

Label, indicating whether the pax was transported to another dimension or not.

## HomePlanet & Destination

In [None]:
titanic["HomePlanet"].value_counts()

In [None]:
titanic["Destination"].value_counts()

In [None]:
# Explore the correlation between "Transported" and home planet
earth_survivors = titanic.loc[titanic["HomePlanet"] == "Earth"]["Transported"]
rate_earth_survivors = sum(earth_survivors) / len(earth_survivors)
rate_earth_survivors

In [None]:
# Get the subset of the pax that were transported
transported_pax = titanic[titanic["Transported"]]
transported_rate = len(transported_pax["Transported"]) / len(titanic)
print(f"The rate of transported pax is: {transported_rate}")

About half of the pax were transported!

In [None]:
titanic_home_planet_group = titanic.groupby("HomePlanet")
transported_rate = titanic_home_planet_group["Transported"].sum() / len(transported_pax)
transported_rate.plot.bar(x="HomePlanet", y="Transported", xlabel="Home Planet", ylabel="% Transported")

In [None]:
# Explore the rate of transported pax by destination
titanic_destination_group = titanic.groupby(["Destination"])
transported_rate_by_destination = titanic_destination_group["Transported"].sum() / len(transported_pax)
transported_rate_by_destination.plot.bar(x="Destination", y="Transported",
                                         ylabel="% Transported")

It seems that the majority (over 60%) of the transported passengers were travelling to TRAPPIST-1e!

In [None]:
# Explore the Transported rate by CryoSleep
cryo_sleep_group = titanic.groupby("CryoSleep")
transported_rate_by_cryo_sleep = cryo_sleep_group["Transported"].sum() / len(transported_pax)
transported_rate_by_cryo_sleep.plot.bar(x="CryoSleep", y="Transported", xlabel="Cryo Sleep", ylabel="% Transported")

It seems that a majority of transported passengers were in cryogenic sleep.

In [None]:
# Explore transported rate by VIP status
vip_group = titanic.groupby("VIP")
vip_transported_rate = vip_group["Transported"].sum() / len(transported_pax)
vip_transported_rate.plot.bar(x="VIP", y="Transported", xlabel="VIP Status", ylabel="% Transported")

In [None]:
transported_vip_pax = titanic[titanic["VIP"]]["Transported"]
transported_vip_rate = len(transported_vip_pax) / len(transported_pax)
print(f"Of {len(transported_pax)} transported passengers only {len(transported_vip_pax)} were VIP")
print(f"Among the transported pax only {transported_vip_rate * 100:.2f}% were VIP")

## Boolean features

The boolean features `HomePlanet`, `Destination`, `CryoSleep`, and `VIP` seem to be good indicators of whether a passenger was transported. Include those in the resulting model.


In [None]:
# create a scatter matrix from a few of the potentially relevant numeric features.
# Make the Transformed column a numeric one
titanic["Transported"] = titanic_train["Transported"].astype(int)

# Create a feature for the total expenses
titanic["TotalExpenses"] = titanic["RoomService"] + titanic["FoodCourt"] + titanic[
    "ShoppingMall"] + titanic["Spa"] + titanic["VRDeck"]
titanic["TotalExpenses"].hist(bins=20)

In [None]:
import pandas as pd

# Create expense categories and see how many of each category were transported.
titanic["TotalExpenses"].describe()
bins = range(0, int(titanic["TotalExpenses"].max()) + 5000, 5000)
titanic["ExpenseCat"] = pd.cut(titanic["TotalExpenses"], bins=bins, labels=range(len(bins) - 1))
expense_cat_group = titanic.groupby("ExpenseCat")
expense_cat_group["Transported"].sum().plot.bar(x="ExpenseCat", y="Transported", ylabel="Passengers transported")

The expenses don't seem to have a big effect, since the distribution of transported according to the expense category seems to be very similar as the distribution of expenses among all passengers.

The transported passengers seem to be a sample of the whole ship population in regards to expenses.

In [None]:
titanic_train["Age"].hist(bins=10)

In [None]:
import matplotlib.pyplot as plt

# Plot histogram of number of Transported people according to their Age
plt, ax = plt.subplots(2, 1, figsize=(12, 12))
ax[0].hist(titanic["Age"], weights=titanic["Transported"])
ax[1].hist(titanic["TotalExpenses"], weights=titanic["Transported"])

Apparently, there seems to be no strong correlation between transported passengers and their age or expenses. The transported passengers seem to have been representative of the whole population in this regard.

Inspect other features, such as CryoSleep, VIP, Deck and Side

## Other features

Explore the influence of the feature `Cabin`. The `Cabin` is a string that indicates the deck, cabin number and the side of the ship (P for port and S for starboard).

Separate the cabin number into three separate features: `Deck`, `CabinNumber`, and `Side`.

In [None]:
cabin_data = titanic["Cabin"].str.split("/", expand=True)
titanic[["Deck", "CabinNumber", "Side"]] = cabin_data
titanic.head()

In [None]:
# See how many of each deck were transported.
deck_group = titanic.groupby("Deck")
deck_group["Transported"].sum().plot.bar(x="Deck", y="Transported", ylabel="Passengers transported")

We see that there is a distribution of the transported passengers according to the deck. The distribution has two peaks.

This seems to be a multimodal distribution. For, now, in order to keep it simple, don't do much more about this.

In [None]:
side_group = titanic.groupby("Side")
side_group["Transported"].sum().plot.bar(x="Side", y="Transported", ylabel="Passengers transported")

The side seems to also be a rough indicator for Transported.

# Preparing the data

We've established that some features are better indicators than others. Most features have some correlation with `Transported`, but for instance age, doesn't seem to have a big effect.

In [None]:
titanic.info()

In [None]:

from main import split_cabin_feature, calculate_total_expenses, make_expenses_categories
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore")
)

num_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler()
)

boolean_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent")
)

preprocessing = ColumnTransformer([
    ("cat", cat_pipeline, ["HomePlanet", "Deck", "Side", "Destination"]),
    ("num", num_pipeline, ["Age", "TotalExpenses"]),
    # ("expense_cat", cat_pipeline, ["ExpenseCat"]),
    ("boolean", boolean_pipeline, ["CryoSleep", "VIP"])
])

titanic_train = load_train_data()
y = titanic_train["Transported"]

titanic_train = split_cabin_feature(titanic_train)
titanic_train = calculate_total_expenses(titanic_train)

expense_cat_bins = range(0, 55000, 5000)
expense_cat_labels = range(len(expense_cat_bins) - 1)

titanic_train = make_expenses_categories(titanic_train, bins=expense_cat_bins, labels=expense_cat_labels)

titanic_train = titanic_train.drop("Transported", axis=1)

titanic_prepared = preprocessing.fit_transform(titanic_train)
preprocessing.get_feature_names_out()

In [None]:
titanic_prepared.shape

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(titanic_prepared, y)

In [None]:
titanic_test = load_test_data()
titanic_test.info()

In [None]:

titanic_test = split_cabin_feature(titanic_test)
titanic_test = calculate_total_expenses(titanic_test)
titanic_test = make_expenses_categories(titanic_test, bins=expense_cat_bins, labels=expense_cat_labels)
titanic_test.shape

In [None]:
titanic_test_prepared = preprocessing.fit_transform(titanic_test)
titanic_test_prepared.shape

In [None]:
predictions = model.predict(titanic_test_prepared)

In [None]:
output = pd.DataFrame({'PassengerId': titanic_test["PassengerId"], 'Transported': predictions})
output.to_csv('submission.csv', index=False)