<a href="https://www.kaggle.com/code/vidhikishorwaghela/spaceship-titanic-predicting-transported-ids?scriptVersionId=116699279" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### PROBLEM STATEMENT:

To predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

### IMPORTING LIBRARIES:

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble._hist_gradient_boosting.gradient_boosting import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score

### DIVING DEEP INTO DATASETS:

In [2]:
#Reading the training datasets:
train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")

In [3]:
#Creating a dataframe with the features and target variables:
X = train_data[["HomePlanet", "CryoSleep", "Cabin", "Destination", "Age", "VIP", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]]
y = train_data["Transported"]

In [4]:
#Spliting the data into training and validations sets:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Creating the preprocessing pipeline
preprocess = make_column_transformer(
    (SimpleImputer(strategy='most_frequent'), ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]),
    (OneHotEncoder(handle_unknown='ignore'), ["HomePlanet", "CryoSleep", "Cabin", "Destination", "VIP"])
)

In [6]:
# Creating the final pipeline
pipeline = make_pipeline(preprocess, RandomForestClassifier(n_estimators=100, random_state = 42))

In [7]:
# fit the pipeline on the training data
pipeline.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('simpleimputer',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  ['Age', 'RoomService',
                                                   'FoodCourt', 'ShoppingMall',
                                                   'Spa', 'VRDeck']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['HomePlanet', 'CryoSleep',
                                                   'Cabin', 'Destination',
                                                   'VIP'])])),
                ('randomforestclassifier',
                 RandomForestClassifier(random_state=42))])

In [8]:
# Make predictions on the validation set
val_predictions = pipeline.predict(X_val)

In [9]:
# Print the accuracy of the model on the validation set
print("Validation Accuracy:", accuracy_score(y_val, val_predictions))
print("Confusion Matrix:", confusion_matrix(y_val, val_predictions))

Validation Accuracy: 0.7883841288096607
Confusion Matrix: [[667 194]
 [174 704]]


In [10]:
from sklearn.model_selection import cross_validate

# Define the scoring metrics
scoring = {'precision': 'precision_macro', 'recall': 'recall_macro', 'f1': 'f1_macro', 'roc_auc': 'roc_auc'}

# Cross-validate the model using precision, recall, f1-score and AUC-ROC
scores = cross_validate(pipeline, X, y, cv=10, scoring=['precision', 'recall', 'f1', 'roc_auc'], return_train_score=False)

# Print the average precision, recall, f1-score and AUC-ROC
print("Precision: %0.2f (+/- %0.2f)" % (scores['test_precision'].mean(), scores['test_precision'].std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores['test_recall'].mean(), scores['test_recall'].std() * 2))
print("F1-Score: %0.2f (+/- %0.2f)" % (scores['test_f1'].mean(), scores['test_f1'].std() * 2))
print("AUC-ROC: %0.2f (+/- %0.2f)" % (scores['test_roc_auc'].mean(), scores['test_roc_auc'].std() * 2))


Precision: 0.80 (+/- 0.05)
Recall: 0.79 (+/- 0.07)
F1-Score: 0.79 (+/- 0.02)
AUC-ROC: 0.86 (+/- 0.03)


WHAT DOES THE ABOVE RESULTS ACTUALLY TELLS US?

A precision of 0.80 and recall of 0.79, along with a F1-score of 0.79 and AUC-ROC of 0.86, suggests that the model is performing well and accurately identifying the positive class (survived) with a good balance of precision and recall.

In [11]:
#Read in the test data
test_data = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")

In [12]:
test_data.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name'],
      dtype='object')

In [13]:
# Make predictions on the test set
test_predictions = pipeline.predict(test_data[['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name']])

In [14]:
#Save the submission file in the correct format:
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Transported': test_predictions})
submission.to_csv('/kaggle/working//submission.csv', index=False)
