<a href="https://colab.research.google.com/github/fareedf/ML-spaceship-titanic/blob/main/spaceship_titanic_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Goal: Determine if the passenger has transported or not

In [None]:
# Imports
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [4]:
# Read FIles
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [5]:
# Display first few rows
print(train.head())
print(test.head())

  PassengerId HomePlanet CryoSleep  Cabin  Destination   Age    VIP  \
0     0001_01     Europa     False  B/0/P  TRAPPIST-1e  39.0  False   
1     0002_01      Earth     False  F/0/S  TRAPPIST-1e  24.0  False   
2     0003_01     Europa     False  A/0/S  TRAPPIST-1e  58.0   True   
3     0003_02     Europa     False  A/0/S  TRAPPIST-1e  33.0  False   
4     0004_01      Earth     False  F/1/S  TRAPPIST-1e  16.0  False   

   RoomService  FoodCourt  ShoppingMall     Spa  VRDeck               Name  \
0          0.0        0.0           0.0     0.0     0.0    Maham Ofracculy   
1        109.0        9.0          25.0   549.0    44.0       Juanna Vines   
2         43.0     3576.0           0.0  6715.0    49.0      Altark Susent   
3          0.0     1283.0         371.0  3329.0   193.0       Solam Susent   
4        303.0       70.0         151.0   565.0     2.0  Willy Santantines   

   Transported  
0        False  
1         True  
2        False  
3        False  
4         True  
  

Preprocessing (Getting rid of data that is not useful for predictions):

In [6]:
# Not useful since each name is unique
train = train.drop("Name", axis=1)
test = test.drop("Name", axis=1)

In [7]:
# Splitting Cabin col into Deck, CabinNum, Side for easier processing
train[['Deck', 'CNum', 'Side']] = train['Cabin'].str.split('/', expand=True)
test[['Deck', 'CNum', 'Side']] = test['Cabin'].str.split('/', expand=True)
train = train.drop("Cabin", axis=1)
test = test.drop("Cabin", axis=1)

In [None]:
# Filling missing numerical columns with median for the most accuracy
num_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
for col in num_cols:
    train[col] = train[col].fillna(train[col].median())
    test[col] = test[col].fillna(test[col].median())

In [8]:
# Filling missing boolean columns with False, then converting to int
train['CryoSleep'] = train['CryoSleep'].fillna(False).astype(int)
test['CryoSleep'] = test['CryoSleep'].fillna(False).astype(int)

train['VIP'] = train['VIP'].fillna(False).astype(int)
test['VIP'] = test['VIP'].fillna(False).astype(int)

  train['CryoSleep'] = train['CryoSleep'].fillna(False).astype(int)
  test['CryoSleep'] = test['CryoSleep'].fillna(False).astype(int)
  train['VIP'] = train['VIP'].fillna(False).astype(int)
  test['VIP'] = test['VIP'].fillna(False).astype(int)


In [9]:
# Filling missing categorical columns with most common value
cat_cols = ['HomePlanet', 'Destination', 'Deck', 'Side']
for col in cat_cols:
    train[col] = train[col].fillna(train[col].mode()[0])
    test[col] = test[col].fillna(train[col].mode()[0])

In [10]:
# Encode categorical columns with numbers
encoder = LabelEncoder()
for col in cat_cols:
    train[col] = encoder.fit_transform(train[col])
    test[col] = encoder.transform(test[col])

In [11]:
# Convert target variable to 0 or 1
train['Transported'] = train['Transported'].astype(int)


In [12]:
# Prepare final datasets
X = train.drop(['Transported', 'PassengerId'], axis=1)
y = train['Transported']

test_ids = test['PassengerId']
X_test = test.drop('PassengerId', axis=1)

In [13]:
# Display processed data (ensure everything is accurate)
print(train.head())
print(test.head())

  PassengerId  HomePlanet  CryoSleep  Destination   Age  VIP  RoomService  \
0     0001_01           1          0            2  39.0    0          0.0   
1     0002_01           0          0            2  24.0    0        109.0   
2     0003_01           1          0            2  58.0    1         43.0   
3     0003_02           1          0            2  33.0    0          0.0   
4     0004_01           0          0            2  16.0    0        303.0   

   FoodCourt  ShoppingMall     Spa  VRDeck  Transported  Deck CNum  Side  
0        0.0           0.0     0.0     0.0            0     1    0     0  
1        9.0          25.0   549.0    44.0            1     5    0     1  
2     3576.0           0.0  6715.0    49.0            0     0    0     1  
3     1283.0         371.0  3329.0   193.0            0     0    0     1  
4       70.0         151.0   565.0     2.0            1     5    1     1  
  PassengerId  HomePlanet  CryoSleep  Destination   Age  VIP  RoomService  \
0     0013

Training Model:

In [14]:
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

Predictions:

In [15]:
predictions = model.predict(X_test)
train_preds = model.predict(X)

acc = accuracy_score(y, train_preds)
print("Training accuracy:", acc)

Training accuracy: 0.9993097894857932
