# ***Spaceship Titanic - Final Term Project for LING 539***

This project is based on an NLP modeling Kaggle challenge, [Spaceship Titanic Competition](https://www.kaggle.com/competitions/spaceship-titanic).

The purpose is to predict whether passengers were transported to an alternate dimension during the accident.  
This is framed as a **binary classification problem** using structured passenger data.

___

## Author Information
- Name: Dimitri Mihaylov
- Email: dmihaylov@arizona.edu

## Dataset

- Source: [Spaceship Titanic Kaggle Competition](https://www.kaggle.com/competitions/spaceship-titanic)
- Files: `train.csv` and `test.csv` provided by Kaggle

## Modeling Approach

- Preprocessing will include:
  - Handling missing values
  - Feature engineering (extracting **Deck**, **Cabin Number**, and **Side** from the `Cabin` column)
  - Encoding categorical features

- Models to be used:
  - **Random Forest** with hyperparameter tuning (`GridSearchCV`)
  - **XGBoost** classifier

- Evaluation metrics to be used:
  - **Accuracy**
  - **F1-Score**
  - Classification reports

___

# Before Getting Started...
## Setup Instructions

Before running the notebook, please make sure all necessary packages are installed. The required packages include:

- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- xgboost

## Requirements

To install the dependencies manually, you can use either of the following methods:

1. **Using pip**:
```bash
pip install -r requirements.txt
```

2. **Using Conda**:
```bash
conda env create -f environment.yml
conda activate spaceship-titanic-env
```

___


In [2]:
# NOTE: If you encounter the  ~ ModuleNotFoundError: No module named 'xgboost' ~  message, uncomment and run the below line of code:
# %pip install xgboost

## Imported Libraries

We first import all the libraries needed for data handling, visualization, modeling, and evaluation.

In [4]:
# Data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning models and tools
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder

## Load the Data

We load the training and test datasets. The training set includes the target variable (`Transported`), while the test set does not. 

In [6]:
# Load training and test datasets
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
train_df.head()

# Supresses a warning that is irrelevant
pd.set_option('future.no_silent_downcasting', True)

Train shape: (8693, 14)
Test shape: (4277, 13)


## Data Preprocessing

We handle missing values and encode categorical features to prepare the data for machine learning models.

In [8]:
# Fill missing values with sensible defaults or median/mode values
train_df = train_df.fillna({
    'HomePlanet': 'Earth',
    'CryoSleep': False,
    'Cabin': 'A/0/P',
    'Destination': 'TRAPPIST-1e',
    'Age': train_df['Age'].median(),
    'VIP': False,
    'RoomService': 0,
    'FoodCourt': 0,
    'ShoppingMall': 0,
    'Spa': 0,
    'VRDeck': 0,
    'Name': 'Unknown'
}).infer_objects(copy=False)

# For test set: use train medians and modes
test_df = test_df.fillna(train_df.median(numeric_only=True).to_dict())
test_df = test_df.fillna(train_df.mode().iloc[0].to_dict())
test_df = test_df.infer_objects(copy=False)

## Feature Engineering: Extract Deck, Cabin Number, and Side

We create new features from the `Cabin` column to give the models more structured information.

In [10]:
# Function to split the Cabin column into three new features
def process_cabin(df):
    df[['Deck', 'CabinNum', 'Side']] = df['Cabin'].str.split('/', expand=True)
    df['CabinNum'] = pd.to_numeric(df['CabinNum'], errors='coerce')
    return df

# Apply to both datasets
train_df = process_cabin(train_df)
test_df = process_cabin(test_df)

## Encode Categorical Columns

We convert categorical text columns into numbers so that machine learning models can work with them.

In [12]:
# Encode boolean-like columns using Label Encoding (0 and 1)
label_cols = ['CryoSleep', 'VIP']
le = LabelEncoder()
for col in label_cols:
    train_df[col] = le.fit_transform(train_df[col])
    test_df[col] = le.transform(test_df[col])

# One-hot encode multi-class categorical columns
categorical_cols = ['HomePlanet', 'Destination', 'Deck', 'Side']
train_df = pd.get_dummies(train_df, columns=categorical_cols)
test_df = pd.get_dummies(test_df, columns=categorical_cols)

# Align columns in test set to match train set 
train_df, test_df = train_df.align(test_df, join='left', axis=1, fill_value=0)

## Define Features and Target Variable

We select the columns we want to use for modeling and define our prediction target.

In [14]:
# Feature columns include numerical, engineered, and encoded categorical variables
features = [
    'Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'VIP', 'CryoSleep', 'CabinNum'
] + [col for col in train_df.columns if col.startswith(('HomePlanet_', 'Destination_', 'Deck_', 'Side_'))]

# Input features
X = train_df[features]

# Target: Transported (converted to 0 or 1)
y = train_df['Transported'].astype(int)

## Train/Validation Split

We split the data so we can evaluate our model on unseen data (validation set).

In [16]:
# Split into 80% training and 20% validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

## Random Forest: Hyperparameter Tuning

We use GridSearchCV to try different settings of the Random Forest model and find the best one.

In [18]:
# Define grid of hyperparameters
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10, None]
}

# Initialize Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Perform grid search
grid_search = GridSearchCV(rf_model, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model from grid search
best_rf = grid_search.best_estimator_

# Evaluate on validation set
rf_preds = best_rf.predict(X_valid)
print("Random Forest Accuracy:", accuracy_score(y_valid, rf_preds))
print("Random Forest F1 Score:", f1_score(y_valid, rf_preds))

Random Forest Accuracy: 0.79700977573318
Random Forest F1 Score: 0.8033426183844011


## XGBoost Model

We try a more powerful model, XGBoost, to see if we can improve our scores.

In [20]:
# Initialize and train XGBoost model
xgb_model = XGBClassifier(random_state=42, eval_metric='logloss')
xgb_model.fit(X_train, y_train)

# Predictions and evaluation
xgb_preds = xgb_model.predict(X_valid)
print("\nXGBoost Accuracy:", accuracy_score(y_valid, xgb_preds))
print("XGBoost F1 Score:", f1_score(y_valid, xgb_preds))


XGBoost Accuracy: 0.8050603795284647
XGBoost F1 Score: 0.807495741056218


## Compare Models

We print detailed classification reports to compare Random Forest and XGBoost side by side.

In [22]:
print("\n--- Classification Reports ---")
print("Random Forest:\n", classification_report(y_valid, rf_preds))
print("XGBoost:\n", classification_report(y_valid, xgb_preds))


--- Classification Reports ---
Random Forest:
               precision    recall  f1-score   support

           0       0.81      0.77      0.79       861
           1       0.79      0.82      0.80       878

    accuracy                           0.80      1739
   macro avg       0.80      0.80      0.80      1739
weighted avg       0.80      0.80      0.80      1739

XGBoost:
               precision    recall  f1-score   support

           0       0.80      0.80      0.80       861
           1       0.81      0.81      0.81       878

    accuracy                           0.81      1739
   macro avg       0.81      0.81      0.81      1739
weighted avg       0.81      0.81      0.81      1739



## Predict on Test Set & Save Submission

We generate predictions on the test set and save them in the format required for Kaggle submission.

In [24]:
# Use XGBoost model (or best performing one)
test_preds = xgb_model.predict(test_df[features])
test_preds = test_preds.astype(bool)

# Create submission file
submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Transported': test_preds})

# Save as CSV
submission.to_csv('../submission.csv', index=False)
print("Submission file saved!")

Submission file saved!


## Conclusion

In this notebook, we explored machine learning models to predict passenger transportation on the Spaceship Titanic dataset. 

- We began by preprocessing the data, handling missing values, and engineering new features (like extracting `Deck`, `CabinNum`, and `Side` from the `Cabin` column)
- We then encoded categorical variables and selected features for modeling
- Two models were trained and evaluated:
  - A **Random Forest Classifier** with hyperparameter tuning
  - An **XGBoost Classifier** for improved performance

Both models achieved strong accuracy and F1 scores, with XGBoost slightly outperforming Random Forest in this case.  

For future improvements:
- We could further engineer features (i.e., parsing passenger names or creating interaction terms)
- We could handle outliers in the spending columns
- We could experiment with model stacking or more advanced ensemble methods
- We could tune hyperparameters more exhaustively for XGBoost and other models

Below is a screenshot of the submitted CSV to Kaggle, along with the score result from the submission:


![SCREENSHOT](kaggle_submission.png)