<center><img src='https://images.pexels.com/photos/2159/flight-sky-earth-space.jpg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1' width=500, height=300 /></center>

<center><h2 style='font-family:monospace;'><b>SPACESHIP TITANIC TRANSPORTATION PREDICTION USING ML</b></h2></center>
<center>Dataset Link : <a style='color:blue;' 'https://www.kaggle.com/competitions/spaceship-titanic'>Spaceship Titanic</a></center>

<p style='font-family:Verdana;'>
Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.<br><br>The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.<br><br> While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!
    <br><br>To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.<br><br>Help save them and change history!
</p>    



<h5> <b>TASK:</b>  In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.</h5>


### Data Field Descriptions
* train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.

* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

### Files
`test.csv` - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

`sample_submission.csv` - A submission file in the correct format.
    `PassengerId` - Id for each passenger in the test set.
     `Transported` - The target. For each passenger, predict either True or False.

In [None]:
### Importing Necessary Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import random

import plotly
import plotly.express as px

In [None]:
# loading datasets
df_train = pd.read_csv('../input/spaceship-titanic/train.csv')

In [None]:
### Top 5 Rows
df_train.head()

> By looking at the top 5 rows I can say that data is a mixture of categorical and numerical features. Also transported column is in text form so we need to fix it as well.

In [None]:
### dtypes info
df_train.info()

> we can see there is some difference between features size, that indicates out data contais some missing values. 

In [None]:
### Missing Values
df_train.isna().sum()

> Data seems to have some missing values, let's see it in percentage to get a better idea

In [None]:
### Percentage Null Values
df_train.isnull().sum()*100/len(df_train)

> Except for PassengerId and Transported column, every column has some amount of null values between 2-2.5%. Dropping them will not be an issue but just for the sake of thinking they contain useful data let's try replace them with some other value.

> We will handle missing values as we encounter them in our data while Analyzing it.

**You can check out my previous note on [10 Different Techniques to Handle Missing Values In Datasets Using Python](https://www.kaggle.com/code/abhayparashar31/fe-10-ways-to-handle-missing-values)**

Before performing any kind of feature engineering or EDA let's build a sample basic model with base parameters and see how well it performed. 

In [None]:
###### MAKING A COPY OF DATA #############
temp = df_train.copy()

##### DROPPING NAN VALUES ###############
temp = temp.dropna()

####### DROPPING UNNECESSARY COLUMNS ######
X = temp.drop(['PassengerId','Transported','Name'],axis=1)
y = temp['Transported']

###### ONE HOT ENCODING FEATURES ###########
from sklearn.preprocessing import OneHotEncoder
oh = OneHotEncoder()
X = oh.fit_transform(X)

######## LOADING RANDOME FOREST CLASSIFIER ##########
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

####### SPLITTING DATA ###############
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

####### BUILDING A SIMPLE RANDOM FOREST MODEL #######
model = RandomForestClassifier()

####### TRAINING THE MODEL AND GENERATING PREDICTIONS #########
model.fit(X_train,y_train)
prediction_rf=model.predict(X_test)

####### EVALUATING BASE MODEL USING ACCURACY SCORE #############
print("Base Model Accuracy",round(accuracy_score(prediction_rf,y_test)*100,2))
print(classification_report(prediction_rf,y_test))

Great!!!, We Got 75% Accuracy Which We Can Surely Improve Using Feature Engineering and other ML Technqies.

# EDA
- Types
    1. Univariate Analysis
    2. Bivariate Analysis
    3. Multivariate Analysis

We will perform analysis on each column.

In [None]:
#### Creating a new seperate copy for EDA Purpose Only
df = df_train.copy()

In [None]:
df.columns

### Transported

In [None]:
df['Transported'].unique()

In [None]:
# Figure size
plt.figure(figsize=(5,5))

# Pie plot
df['Transported'].value_counts().plot.pie(explode=[0.1,0.1], 
                                          autopct='%1.1f%%', 
                                          shadow=True, 
                                          textprops={'fontsize':13}).set_title("Target distribution");

**Analysis**
> Data is divided almost equally in column Transported column(Outcome).

### Home planet

In [None]:
df['HomePlanet'].unique()

In [None]:
df['HomePlanet'].value_counts()

In [None]:
##### Filling Data Proportinally
def fill_proportionally(col, dataset):
    values = dataset[col].dropna().unique()
    
    # getting weights for probability weighting
    weights = dataset[col].value_counts().values / dataset[col].value_counts().values.sum()
    
    # filling
    dataset[col] = dataset[col].apply(lambda x: random.choices(values, weights=weights)[0] if pd.isnull(x) else x)

In [None]:
 fill_proportionally('HomePlanet', df)

In [None]:
df['HomePlanet'].isna().sum()

In [None]:
data = df['HomePlanet'].value_counts()
px.bar(x=data.index,y=data.values,labels={
    'x':'HomePlanet',
    'y':'Number of Passengers'
},title='Number of Passengers On Different HomePlanets')

In [None]:
hp_t_grp = df.groupby(['HomePlanet','Transported'])['PassengerId'].count()
hp_t_grp = hp_t_grp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})


px.bar(data_frame=hp_t_grp,x='HomePlanet',y='Number of Passengers',color='Transported',labels={
    'x':'HomePlanet',
},title='Number of Passengers Tranported vs HomePlanet')

**Analysis**
> People from Earth has 42% Chances of getting Transported.

> People from HomePlanet Europa has 65% Changes of getting Transported.

> Poeple from Mars has 52% Chances of getting Transported.

> Most people preferred to live on Earth, approx 53% of them.

In [None]:
plt.figure(figsize=(8,5))
temp = df.groupby('HomePlanet')['Age'].mean()
ax = sns.barplot(x=temp.index,y=temp.values)
ax.bar_label(ax.containers[0])
ax.set_title('Average Age On Different HomePlanets');

**Analysis**
> People From Europa are much older than other people on different planets, whereas people on Earth are usually younger than other planets population.

### VIP

In [None]:
ax = sns.countplot(x=df['VIP'])
ax.bar_label(ax.containers[0]);

**Analysis**
> Number of VIP passenger is very less inside out dataframe.

In [None]:
df['VIP'].isna().sum()

In [None]:
df['VIP'] = df['VIP'].fillna(False)

In [None]:
df['VIP'].isna().sum()

In [None]:
hp_t_grp = df.groupby(['HomePlanet','VIP'])['PassengerId'].count()
hp_t_grp = hp_t_grp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})


px.bar(data_frame=hp_t_grp,x='HomePlanet',y='Number of Passengers',color='VIP',labels={
    'x':'HomePlanet',
},title='Number of VIPs On Different HomePlanets')

**Analysis**
> Earth has NO VIPs.

> Most VIPs are from Europa Planet.

In [None]:
hp_t_grp = df.groupby(['HomePlanet','VIP','Transported'])['PassengerId'].count()
hp_t_grp = hp_t_grp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})


px.bar(data_frame=hp_t_grp,x='HomePlanet',y='Number of Passengers',color='VIP',facet_col='Transported',labels={
    'x':'HomePlanet',
},title='Number of VIPs, HomePlanets vs Transported')

**Analysis**
> VIP tag has less effect on transportation. 


### CryoSleep

In [None]:
df['CryoSleep'].unique()

In [None]:
df['CryoSleep'].value_counts()

In [None]:
fill_proportionally('CryoSleep',df)

In [None]:
px.histogram(df['CryoSleep'],title='Count Distribution of CryoSleep')

In [None]:
temp = df.groupby(['CryoSleep','Transported'])['PassengerId'].count()
temp = temp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})

px.bar(data_frame=temp,
      x='CryoSleep',
      y='Number of Passengers',
      color='Transported',title='CryoSleep vs Transported')

> People in CryoSleep has higher chances of getting transported, approx 81%.

### Cabin

In [None]:
df['Cabin'].unique()

In [None]:
df['Cabin'] = df['Cabin'].apply(lambda x:str(x).split('/')[0])

In [None]:
### Handling Missing Values
df['Cabin'] = df['Cabin'].replace('nan','other')

In [None]:
df['Cabin'].isna().sum()

In [None]:
df.Cabin.unique()

In [None]:
df.Cabin.isna().sum()

In [None]:
### Considering Most Frequent Categories
keep = df['Cabin'].value_counts().index[:5]
df['Cabin'] = np.where(df['Cabin'].isin(keep), df['Cabin'], 'other')

In [None]:
df['Cabin'].value_counts()

In [None]:
px.histogram(df['Cabin'],title='Distribution of Passengers In Different Cabin')

**Analysis**
> Cabin F has most people in it, closely followed by Cabin G.

In [None]:
temp = df.groupby(['Cabin','Transported'])['PassengerId'].count()
temp = temp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})

px.bar(data_frame=temp,
      x='Cabin',
      y='Number of Passengers',
      color='Transported',title='Cabin vs Transported')

**Analysis**
> Almost 73% people traveling through Cabin B got Transported, Closely Followed By Cabin C with 68%.

> E is the worst Cabin travel with Transported rate as low as 35% only.

> Cabin F and G occupying almost 50% of the population has an Transportation average 48%.

### Destination

In [None]:
df['Destination'].unique()

In [None]:
fill_proportionally('Destination',df)

In [None]:
df['Destination'].unique()

In [None]:
df["Destination"].isna().sum()

In [None]:
###### Destination Count plot
px.histogram(df['Destination'],title='Number of Passengers Traveling to Different Desinations Distribution')

**Analysis**
> Almost 68% People are traveling to `TRAPPIST-1e` Destination.


In [None]:
temp = df.groupby(['Destination','Transported'])['PassengerId'].count()
temp = temp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})

px.bar(data_frame=temp,
      x='Destination',
      y='Number of Passengers',
      color='Transported',title='Destination vs Transported')

**Analysis**
> Approx 47% People traveling to `TRAPPIST-1e` got transported.

### Age

In [None]:
sns.distplot(a = df['Age'],bins=20).set_title('Age Distribution');

**Analysis**
> Most passengers are is between 20 to 40 years.


In [None]:
### Filling NAN Values
##### Distribution of the age is close to normal distribution, also it contains some outliers so we should use median.
median = df['Age'].median()
median

In [None]:
df['Age'].fillna(median,inplace=True)

In [None]:
df.isna().sum()

We are going to split this column into 4 categories so we don't have to care about Outliers.

In [None]:
df['Age'] = pd.cut(df['Age'], bins=[-1,12,20,40,100], labels=['Children','Teenage','Adult','Elder'])

In [None]:
px.histogram(df['Age'],labels={
    'value':'Type',
    'variable':'Column'
},title='Age Distribution')

**Analysis**
> Majority of the passengers are adults, almost 52%

In [None]:
temp = df.groupby(['HomePlanet','Age','Transported'])['PassengerId'].count()
temp = temp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})

px.bar(data_frame=temp,
      x='HomePlanet',
      y='Number of Passengers',
      color='Age',
       facet_col='Transported',title='HomePlanet, Age vs Transported')

### RoomService

In [None]:
df['RoomService'].dtype

In [None]:
max_fare = df['RoomService'].max()
max_fare

In [None]:
df['RoomService'].min()

In [None]:
mean = df['RoomService'].mean()
mean

In [None]:
median = df['RoomService'].median()
median

In [None]:
df['RoomService'].isna().sum()

In [None]:
df['RoomService'].hist()

Because most of the columns are 0 let's fill these missing values as 0.

In [None]:
####### Let's fill missing values using median
df['RoomService'] = df['RoomService'].fillna(median).astype(int)

In [None]:
df['RoomService'].isna().sum()

In [None]:
df['RoomService'] = pd.cut(df['RoomService'], bins=[-1,224,15000], labels=['NoCharge','SomeCharge'])

In [None]:
px.histogram(df['RoomService'],labels={
    'value':'Room Service Charge'
},title='Room Service Distribution')

In [None]:
temp = df.groupby(['HomePlanet','RoomService','Transported'])['PassengerId'].count()
temp = temp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})

px.bar(data_frame=temp,
      x='HomePlanet',
      y='Number of Passengers',
      color='RoomService',
      facet_col='Transported',title='Room Service Charge, Homeplanet vs Transported')

**Analysis**
> Mostly RoomService charge is low.

#### FoodCourt

In [None]:
df['FoodCourt'].isna().sum()

In [None]:
median = df['FoodCourt'].median()
median

In [None]:
df['FoodCourt'] = df['FoodCourt'].fillna(median)

In [None]:
df['FoodCourt'] = df['FoodCourt'].astype(int)

In [None]:
df['FoodCourt'].isna().sum()

In [None]:
df['FoodCourt'].max()

In [None]:
df['FoodCourt'].describe()

In [None]:
df[df['FoodCourt']==0].count()[0]

In [None]:
df['FoodCourt'] = pd.cut(df['FoodCourt'], bins=[-1,1,30000], labels=['NoCharge','SomeCharge'])

In [None]:
px.histogram(df['FoodCourt'],labels={
    'value':'Food Court Charge'
},title='Food Court Charge Distribution')

In [None]:
temp = df.groupby(['HomePlanet','FoodCourt','Transported'])['PassengerId'].count()
temp = temp.reset_index().rename(columns={'PassengerId':'Number of Passengers'})

px.bar(data_frame=temp,
      x='HomePlanet',
      y='Number of Passengers',
      color='FoodCourt',pattern_shape='Transported',title='Food Court Charge, Homeplanet vs Transported')

### Shopping mall

In [None]:
df['ShoppingMall'].isna().sum()

In [None]:
df['ShoppingMall'].max()

In [None]:
median = df['ShoppingMall'].median()
df['ShoppingMall'] = df['ShoppingMall'].fillna(median)

In [None]:
df['ShoppingMall'].astype(int)

In [None]:
df[df['ShoppingMall']==0].count()[0]

There are 5587 rows that has value as 0 means no information, so we will divide it into two categories only. `Low` and `High`

In [None]:
df['ShoppingMall'] = pd.cut(df['ShoppingMall'], bins=[-1,1,24000], labels=['NoCharge','SomeCharge'])

In [None]:
px.histogram(df['ShoppingMall'],labels={
    'value':'Shopping Mall'
})

### Spa

In [None]:
df['Spa']

In [None]:
df['Spa'].isna().sum()

In [None]:
df['Spa'] = df['Spa'].fillna(df['Spa'].median())

In [None]:
df['Spa'].describe()

In [None]:
df[df['Spa']==0].count()[0]

In [None]:
df['Spa'].max()

We will divide this as well into two categories only.

In [None]:
df['Spa'] = pd.cut(df['Spa'], bins=[-1,1,24000], labels=['NoCharge','SomeCharge'])

In [None]:
px.histogram(df['Spa'],labels={
    'value':'Spa'
})

### VRDeck

In [None]:
df['VRDeck']

In [None]:
df['VRDeck'].isna().sum()

In [None]:
median = df['VRDeck'].median()

In [None]:
median

In [None]:
df['VRDeck'] = df['VRDeck'].fillna(median)

In [None]:
df[df['VRDeck']==0].count()[0]

In [None]:
df['VRDeck'].max()

Same, we will also divide this into same two categories.

In [None]:
df['VRDeck'] = pd.cut(df['VRDeck'], bins=[-1,1,25000], labels=['NoCharge','SomeCharge'])

In [None]:
px.histogram(df['VRDeck'],labels={
    'value':'VRDeck'
})

Its not the end, copy and edit the notebook and generate more insights.

# Feature Engineering

In [None]:
df_train = pd.read_csv('../input/spaceship-titanic/train.csv')
df_train.head()

In [None]:
### Loading Test Data
df_test = pd.read_csv('../input/spaceship-titanic/test.csv')
df_test.head()

In [None]:
### Combining Train and Test for Preprocessing
##### Adding an additional column for spllting train and test data
concat_df = pd.concat([df_train,df_test])

In [None]:
concat_df.head()

In [None]:
concat_df.tail()

In [None]:
concat_df.drop(['PassengerId', 'Name', 'Cabin'], axis=1, inplace=True)

Missing Values

In [None]:
#### Let's fill Categorical Columns First
cat_cols = ['HomePlanet','CryoSleep','Destination','VIP']    
for col in cat_cols:
    fill_proportionally(col,concat_df)

Extracting cat_cols

In [None]:
hp_des_df = pd.get_dummies(concat_df[['HomePlanet','Destination']])

In [None]:
hp_des_df

In [None]:
concat_df.head()

Numerical Cols

In [None]:
!pip install missingpy

In [None]:
num_df = concat_df[['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].copy()

In [None]:
num_df.iloc[:,:]

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import sys
import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest

# Impute
imputer = MissForest()
data_imputed = imputer.fit_transform(num_df.iloc[:,:])
data_imputed = pd.DataFrame(data=data_imputed, columns=num_df.iloc[:,:].columns)

data_imputed

In [None]:
data_imputed.isna().sum()

In [None]:
concat_df

In [None]:
concat_df.drop(['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck'],axis=1,inplace=True)

In [None]:
concat_df.drop(['HomePlanet','Destination'],axis=1,inplace=True)

Encoded Data with no missing values.

In [None]:
### Numerical
data_imputed

In [None]:
hp_des_df = hp_des_df.reset_index()

In [None]:
hp_des_df.drop(['index'],axis=1,inplace=True)

In [None]:
### Categorical Data
hp_des_df

In [None]:
### Bool data 
concat_df = concat_df.reset_index().drop(['index'],axis=1)
concat_df

Preparing Final DataFrame

In [None]:
merged_df = pd.concat([data_imputed,concat_df,hp_des_df],axis=1)

In [None]:
merged_df.isna().sum()

In [None]:
### Extracting train_data
train = merged_df[merged_df.Transported.notna()]
train.head()

In [None]:
len(train),len(df_train)

In [None]:
### Extracting Test Data
test = merged_df[merged_df.Transported.isna()]
test.head()

In [None]:
### Dropping Transported Column
test.drop(['Transported'],axis=1,inplace=True)

In [None]:
len(test),len(df_test)

### MODELING

In [None]:
train['Transported'] = train.Transported.map({
    True:1,
    False:0
})

In [None]:
from sklearn.model_selection import train_test_split
y = train['Transported']
X = train.drop('Transported', axis=1)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

boost_model = XGBClassifier(n_jobs=-1, random_state=42,max_depth = 5)

#Fitting the model
boost_model.fit(X_train,y_train)

In [None]:
#Prediction
pred = boost_model.predict(X_val)

In [None]:
pred

In [None]:
#Evaluation
accuracy = accuracy_score(y_val, pred)

In [None]:
### Predicting on test data
to_submit = boost_model.predict(test)

In [None]:
to_submit = pd.DataFrame(to_submit, columns=["Transported"])

In [None]:
to_submit['Transported'] = to_submit['Transported'].map({1:True,0:False})

In [None]:
submission = pd.concat([pd.read_csv("/kaggle/input/spaceship-titanic/test.csv"), pd.DataFrame(to_submit)], axis=1)[["PassengerId", "Transported"]]

In [None]:
submission.to_csv('submission.csv',index=False)

### VOTE

* Give a Upvote 🙌 if You Liked The Notebook.

### CONNECT WITH ME

[LinkedIN](https://www.linkedin.com/in/abhayparashar31/) | [Medium](https://medium.com/@abhayparashar31) | [Twitter](https://twitter.com/abhayparashar31) | [Github](https://github.com/Abhayparashar31)

##### HOPE TO SEE YOU IN MY NEXT KAGGLE NOTEBOOK 😀