### Introduction & Purpose 

The goal of this project is to predict whether passengers on the fictional Spaceship Titanic were transported to another dimension (`Transported`). We will load the data, preprocess it, train a machine learning model, and evaluate its performance.

### Dataset Acquisition

Dataset sourced from Kaggle's [Spaceship Titanic competition](https://www.kaggle.com/competitions/spaceship-titanic).

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from IPython.display import display, Markdown

# Load training and test datasets
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
train_df.head()

# Supresses a warning that is irrelevant
pd.set_option('future.no_silent_downcasting', True)

Train shape: (8693, 14)
Test shape: (4277, 13)


### Data Preprocessing

We handle missing values and encode categorical features to prepare the data for machine learning models.

In [5]:
# Fill missing values
train_df.fillna({'HomePlanet': 'Earth', 
                 'CryoSleep': False, 
                 'Cabin': 'A/0/P', 
                 'Destination': 'TRAPPIST-1e',
                 'Age': train_df['Age'].median(),
                 'VIP': False,
                 'RoomService': 0,
                 'FoodCourt': 0,
                 'ShoppingMall': 0,
                 'Spa': 0,
                 'VRDeck': 0,
                 'Name': 'Unknown'}, inplace=True)

test_df.fillna(train_df.median(numeric_only=True).to_dict(), inplace=True)
test_df.fillna(train_df.mode().iloc[0].to_dict(), inplace=True)

# Encode categorical columns
cat_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP']
le = LabelEncoder()
for col in cat_cols:
    train_df[col] = le.fit_transform(train_df[col])
    test_df[col] = le.transform(test_df[col])

### Feature Selection

We then select the most important features and define the target variable.

In [7]:
# Define features and target
features = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP',
            'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

X = train_df[features]
y = train_df['Transported'].astype(int)

### Modeling Approach

Next, we train a Random Forest Classifier and evaluate its performance using accuracy score.

In [9]:
# Split data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# GitHub struggles to render the HTML representation, this is a markdown version instead
display(Markdown("✅ **Random Forest model trained successfully!**"))

✅ **Random Forest model trained successfully!**

### Model Evaluation

We can then evaluate our model using validation accuracy.

In [11]:
# Predict on validation set
y_pred = model.predict(X_valid)

# Evaluate using accuracy
accuracy = accuracy_score(y_valid, y_pred)
print("Validation Accuracy:", accuracy)

Validation Accuracy: 0.777458309373203


### Make Predictions on Test Set

Finally, we use the trained model to predict on the unseen test data and prepare a submission file for Kaggle.

In [13]:
# Predict on test set
test_preds = model.predict(test_df[features])
test_preds = test_preds.astype(bool)

# Create submission dataframe
submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Transported': test_preds})

# Save to CSV
submission.to_csv('../submission.csv', index=False)
print("Submission file saved!")

Submission file saved!


### Conclusion

We trained a simple Random Forest model that achieved an initial validation accuracy. Future improvements could include feature engineering (like extracting deck info from Cabin), hyperparameter tuning, and experimenting with other models like XGBoost.

Below is a screenshot of the submitted CSV to Kaggle, along with the score result from the submission:
![SCREENSHOT](kaggle_result.png)