# Spaceship Titanic

#### Kaggle:
Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

#### Me:
Challenge accepted!

Let's take a quick look to our dataset.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np

In [2]:
df = pd.read_csv("train.csv")
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


### Statistical Hypothesis Tests

In order to continue with prediction algorithm, we should choose the specific features that have the highest correlation with "Transported" column.

We need to test whether our data have a normal distribution or not. If it does so, then we will apply the parametric statistical hypothesis tests, and if not, we will continue with nonparametric statistical hypothesis tests.

In [48]:
#numerical data
data_num1 = df["Age"]
data_num2 = df["RoomService"]
data_num3 = df["FoodCourt"]
data_num4 = df["ShoppingMall"]
data_num5 = df["Spa"]
data_num6 = df["VRDeck"]

#categorical data
data_cat1 = df["HomePlanet"]
data_cat2 = df["CryoSleep"]
data_cat3 = df["Destination"]
data_cat4 = df["VIP"]
data_cat5 = df["Cabin"]
outcome_data = df["Transported"]

In [50]:
from scipy.stats import normaltest

column_list = [data_num1, data_num2, data_num3, data_num4, data_num5, data_num6,
               data_cat1, data_cat2, data_cat3, data_cat4, data_cat5, outcome_data] 
def normality_test(column_list):
    data_num = 0
    for column in column_list:
        data_num += 1
        stat, p = normaltest(column)
        print("data", data_num,":")
        print('stat=%.3f, p=%.3f' % (stat, p))
        if p > 0.05:
            print('Probably Gaussian')
        else:
            print('Probably not Gaussian')
        print(" ")
            
normality_test(column_list)

data 1 :
stat=248.583, p=0.000
Probably not Gaussian
 
data 2 :
stat=10480.443, p=0.000
Probably not Gaussian
 
data 3 :
stat=11188.441, p=0.000
Probably not Gaussian
 
data 4 :
stat=15648.109, p=0.000
Probably not Gaussian
 
data 5 :
stat=11667.216, p=0.000
Probably not Gaussian
 
data 6 :
stat=11847.700, p=0.000
Probably not Gaussian
 
data 7 :
stat=nan, p=nan
Probably not Gaussian
 
data 8 :
stat=nan, p=nan
Probably not Gaussian
 
data 9 :
stat=nan, p=nan
Probably not Gaussian
 
data 10 :
stat=nan, p=nan
Probably not Gaussian
 
data 11 :
stat=nan, p=nan
Probably not Gaussian
 
data 12 :
stat=29953.173, p=0.000
Probably not Gaussian
 


As we see, our data does not meet the basic principles of "Parametric Statistical Hypothesis Tests", that is, it does not have the normal distribution. This means that we have to go on with the "Nonparametric Statistical Hypothesis Tests". By applying Spearman's Rank Correlation Test, we will see whether the "quantitative" data that we have is dependent on our outcome data, namely "Transported".

In [51]:
from scipy.stats import spearmanr
predictor_data = [data_num1, data_num2, data_num3, data_num4, data_num5, data_num6] 
outcome_datum = outcome_data

def spearmanr_test(predictor_data_list, outcome_data):
    data_num = 0
    for data in predictor_data_list:
        data_num += 1
        stat, p = spearmanr(data, outcome_data)
        print("comparison", data_num, ":")
        print('stat=%.3f, p=%.3f' % (stat, p))
        if p > 0.05:
            print('Probably independent')
        else:
            print('Probably dependent')
        print(" ")
            
spearmanr_test(predictor_data, outcome_datum)

comparison 1 :
stat=-0.070, p=0.000
Probably dependent
 
comparison 2 :
stat=-0.364, p=0.000
Probably dependent
 
comparison 3 :
stat=-0.176, p=0.000
Probably dependent
 
comparison 4 :
stat=-0.215, p=0.000
Probably dependent
 
comparison 5 :
stat=-0.366, p=0.000
Probably dependent
 
comparison 6 :
stat=-0.341, p=0.000
Probably dependent
 


In [52]:
# Kruskal-Wallis H Test
from scipy.stats import kruskal
predictor_cat_data = [data_cat1, data_cat2, data_cat3, data_cat4, data_cat5] 
outcome_datum = outcome_data

def kruskal_test(predictor_data_list, outcome_data):
    data_num = 0
    for data in predictor_data_list:
        data_num += 1
        stat, p = kruskal(data, outcome_data)
        print("comparison", data_num, ":")
        print('stat=%.3f, p=%.3f' % (stat, p))
        if p > 0.05:
            print('Probably the same distribution')
        else:
            print('Probably different distributions')
        print(" ")
            
kruskal_test(predictor_cat_data, outcome_datum)

comparison 1 :
stat=nan, p=nan
Probably different distributions
 
comparison 2 :
stat=nan, p=nan
Probably different distributions
 
comparison 3 :
stat=nan, p=nan
Probably different distributions
 
comparison 4 :
stat=nan, p=nan
Probably different distributions
 
comparison 5 :
stat=nan, p=nan
Probably different distributions
 


To sum up, all ofthe numerical columns were related with the "Transported" column while categorical columns did not have a correlation with the "Transported" column. This means that we will continue with these categories: "Age", "Spa", "RoomService", "ShoppingMall", "FoodCourt" and "VRDeck".

### Missing Values

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn import datasets, linear_model, metrics
from sklearn.model_selection import train_test_split

In [8]:
X= df[["Age", "Spa", "RoomService", "ShoppingMall", "FoodCourt", "VRDeck"]]
Y= df[["Transported"]]  # the target output

X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.4,random_state=100)

In [9]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(5215, 6)
Age             112
Spa             107
RoomService     116
ShoppingMall    133
FoodCourt       117
VRDeck          121
dtype: int64


In [10]:
col_with_missing_values = [col for col in X_train.columns
                             if X_train[col].isnull().any()]
print(col_with_missing_values)

['Age', 'Spa', 'RoomService', 'ShoppingMall', 'FoodCourt', 'VRDeck']


In [11]:
from sklearn.impute import SimpleImputer
# Define imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Perform the imputation on 'your_column'
df['Age'] = imp_mean.fit_transform(df['Age'].values.reshape(-1, 1))
df['Spa'] = imp_freq.fit_transform(df['Spa'].values.reshape(-1, 1))
df['RoomService'] = imp_mean.fit_transform(df['RoomService'].values.reshape(-1, 1))
df['ShoppingMall'] = imp_mean.fit_transform(df['ShoppingMall'].values.reshape(-1, 1))
df['FoodCourt'] = imp_mean.fit_transform(df['FoodCourt'].values.reshape(-1, 1))
df['VRDeck'] = imp_mean.fit_transform(df['VRDeck'].values.reshape(-1, 1))

In [312]:
X= df[["Age", "Spa", "RoomService", "ShoppingMall", "FoodCourt", "VRDeck"]] 
Y= df[["Transported"]]  # the target output
passenger_ids = df['PassengerId'] # to keep track of passenger's IDs

X_train,X_test,y_train,y_test, ids_train, ids_test = train_test_split(X,Y, passenger_ids, test_size = 4277/8693,
                                                                      random_state=100)

In [313]:
col_with_missing_values = [col for col in X_train.columns
                             if X_train[col].isnull().any()]
print(col_with_missing_values)

[]


In [314]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(4416, 6)
Series([], dtype: int64)


### Prediction Algorithm

In [315]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred.ravel()

  y = column_or_1d(y, warn=True)


array([False, False,  True, ...,  True, False, False])

In [316]:
from sklearn.metrics import accuracy_score
#y_pred_rounded = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 76.60%


In [317]:
import xgboost as xgb
model = xgb.XGBRegressor()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [318]:
from sklearn.metrics import accuracy_score
y_pred_rounded = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, y_pred_rounded)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 77.18%


In [319]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred.ravel()

  model.fit(X_train, y_train)


array([False, False,  True, ...,  True, False, False])

In [320]:
from sklearn.metrics import accuracy_score
#y_pred_rounded = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 77.74%


In [321]:
import os
SOURCE = "/Users/cagriertem/Desktop/PROJECTS AND TUTORIALS/Kaggle Spaceship Titanic"
df_submit = pd.DataFrame({'PassengerId': ids_test,'Transported': y_pred})
df_submit.to_csv(os.path.join(SOURCE,"submit.csv"),index=False)


Before we finish, let us test the neural network algorithm.

In [322]:
import torch
from torch import nn

# Converting data to PyTorch tensors
X_train_torch = torch.from_numpy(X_train.to_numpy()).float()
y_train_torch = torch.from_numpy(y_train.to_numpy()).float().view(-1, 1)  
X_test_torch = torch.from_numpy(X_test.to_numpy()).float()
y_test_torch = torch.from_numpy(y_test.to_numpy()).float().view(-1, 1)

# Define the model
model = nn.Sequential(
    nn.Linear(6, 288),  # Input layer with 6 neurons 
    nn.ReLU(),
    nn.Linear(288, 18), # Hidden layer with 8 neurons 
    nn.ReLU(),
    nn.Linear(18, 1),  # Output layer with 1 neuron (binary classification)
    nn.Sigmoid()
)

# Set loss and optimizer
criterion = nn.BCELoss()  # Binary Cross Entropy Loss for binary classification
optimizer = torch.optim.Rprop(model.parameters())

# Train the model
epochs = 200
for epoch in range(epochs):
    optimizer.zero_grad()
    y_pred = model(X_train_torch)
    loss = criterion(y_pred, y_train_torch)
    loss.backward()
    optimizer.step()

# Evaluate the model
with torch.no_grad():
    y_pred = model(X_test_torch)
    y_pred_cls = y_pred.round()
    acc = y_pred_cls.eq(y_test_torch).sum().item() / y_test_torch.shape[0]
    print(f'Accuracy: {acc*100:.2f}%')

Accuracy: 77.09%


In [323]:
y_pred_cls_str = np.where(y_pred_cls==1, 'True', 'False')
y_pred_cls_str.ravel()

array(['False', 'False', 'True', ..., 'True', 'False', 'False'],
      dtype='<U5')

In [324]:
import os
SOURCE = "/Users/cagriertem/Desktop/PROJECTS AND TUTORIALS/Kaggle Spaceship Titanic"
df_submit = pd.DataFrame({'PassengerId': ids_test,'Transported': y_pred_cls_str.ravel()})
df_submit.to_csv(os.path.join(SOURCE,"submit.csv"),index=False)

In [325]:
df_submit

Unnamed: 0,PassengerId,Transported
6913,7330_01,False
7066,7520_01,False
8001,8559_03,True
1200,1281_01,True
4442,4723_02,False
...,...,...
3323,3567_01,True
3389,3645_03,True
976,1035_01,True
4679,4987_01,False
