#Predict which passengers are transported to an alternate dimension

##1. Problem definition

How well can we predict whether passengers in a spaceship were transported to an alternate dimension or not

##2. Data

The data for this competition is provided by kaggle (https://www.kaggle.com/competitions/spaceship-titanic)

The data is split into three parts:

*   train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data
*   test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this se





##3. Evaluation

The evaluation metric for this problem is the accuracy of the classification model

##4. Features

The features used in the given data are:

* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

## Importing data and preparing it 

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
df_1 = pd.read_csv("train.csv")
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [6]:
df_1.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [7]:
df_columns = list(df_1.columns.values)
df_columns.remove("PassengerId")
df_columns

['HomePlanet',
 'CryoSleep',
 'Cabin',
 'Destination',
 'Age',
 'VIP',
 'RoomService',
 'FoodCourt',
 'ShoppingMall',
 'Spa',
 'VRDeck',
 'Name',
 'Transported']

In [8]:
def prepare_df(df):
  df_temp = df.copy()
  for name in df_columns:
    if pd.api.types.is_numeric_dtype(df_temp[name]):
      if bool(pd.isnull(df_temp[name]).sum()):
        df_temp[name] = df_temp[name].fillna(df_temp[name].median())
    else:
      df_temp[name] = df_temp[name].astype("category")
      df_temp[name] = pd.Categorical(df_temp[name]).codes + 1
      if bool(pd.isnull(df_temp[name]).sum()):
        df_temp[name] = df_temp[name].fillna(0)

  return df_temp

df_1["PassengerId"] = df_1["PassengerId"].astype("string")
df_train = prepare_df(df_1)
df_train.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8683,8684,8685,8686,8687,8688,8689,8690,8691,8692
PassengerId,0001_01,0002_01,0003_01,0003_02,0004_01,0005_01,0006_01,0006_02,0007_01,0008_01,...,9272_02,9274_01,9275_01,9275_02,9275_03,9276_01,9278_01,9279_01,9280_01,9280_02
HomePlanet,2,1,2,2,1,1,1,1,1,2,...,1,0,2,2,2,2,1,1,2,2
CryoSleep,1,1,1,1,1,1,1,2,1,2,...,1,2,1,1,0,1,2,1,1,1
Cabin,150,2185,2,2,2187,2184,3426,4560,3566,151,...,3407,5291,145,145,145,147,5281,5286,2132,2132
Destination,3,3,3,3,3,2,3,3,3,1,...,3,3,3,3,3,1,2,3,1,3
Age,39.0,24.0,58.0,33.0,16.0,44.0,26.0,28.0,35.0,14.0,...,21.0,23.0,0.0,32.0,30.0,41.0,18.0,26.0,32.0,44.0
VIP,1,1,2,1,1,1,1,1,1,1,...,1,1,1,1,1,2,1,1,1,1
RoomService,0.0,109.0,43.0,0.0,303.0,0.0,42.0,0.0,0.0,0.0,...,86.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,126.0
FoodCourt,0.0,9.0,3576.0,1283.0,70.0,483.0,1539.0,0.0,785.0,0.0,...,3.0,0.0,0.0,1146.0,3208.0,6819.0,0.0,0.0,1049.0,4688.0
ShoppingMall,0.0,25.0,0.0,371.0,151.0,0.0,3.0,0.0,17.0,0.0,...,149.0,0.0,0.0,0.0,0.0,0.0,0.0,1872.0,0.0,0.0


In [9]:
df_test = pd.read_csv("test.csv")
df_columns.remove("Transported")
df_test = prepare_df(df_test)


In [10]:
df_test.columns.values

array(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination',
       'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa',
       'VRDeck', 'Name'], dtype=object)

In [11]:
# temp_array = []
# for val in df_train["Transported"]:
#   if val:
#     temp_array.append(1)
#   else:
#     temp_array.append(0)

# df_train["Transported"] = temp_array

In [12]:
df_train.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8683,8684,8685,8686,8687,8688,8689,8690,8691,8692
PassengerId,0001_01,0002_01,0003_01,0003_02,0004_01,0005_01,0006_01,0006_02,0007_01,0008_01,...,9272_02,9274_01,9275_01,9275_02,9275_03,9276_01,9278_01,9279_01,9280_01,9280_02
HomePlanet,2,1,2,2,1,1,1,1,1,2,...,1,0,2,2,2,2,1,1,2,2
CryoSleep,1,1,1,1,1,1,1,2,1,2,...,1,2,1,1,0,1,2,1,1,1
Cabin,150,2185,2,2,2187,2184,3426,4560,3566,151,...,3407,5291,145,145,145,147,5281,5286,2132,2132
Destination,3,3,3,3,3,2,3,3,3,1,...,3,3,3,3,3,1,2,3,1,3
Age,39.0,24.0,58.0,33.0,16.0,44.0,26.0,28.0,35.0,14.0,...,21.0,23.0,0.0,32.0,30.0,41.0,18.0,26.0,32.0,44.0
VIP,1,1,2,1,1,1,1,1,1,1,...,1,1,1,1,1,2,1,1,1,1
RoomService,0.0,109.0,43.0,0.0,303.0,0.0,42.0,0.0,0.0,0.0,...,86.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,126.0
FoodCourt,0.0,9.0,3576.0,1283.0,70.0,483.0,1539.0,0.0,785.0,0.0,...,3.0,0.0,0.0,1146.0,3208.0,6819.0,0.0,0.0,1049.0,4688.0
ShoppingMall,0.0,25.0,0.0,371.0,151.0,0.0,3.0,0.0,17.0,0.0,...,149.0,0.0,0.0,0.0,0.0,0.0,0.0,1872.0,0.0,0.0


##Building a machine learning model

In [13]:
x_train = df_train.drop("Transported", axis = 1)
y_train = df_train["Transported"]

###Random Forest Classifier

In [31]:
np.random.seed(305)

rfc_params = {
    "max_features" : [None, "sqrt", 0.25, 0.5, 0.75],
    "n_estimators" : np.arange(100,1000,100),
    "min_samples_split" : np.arange(10,20,5),
    #"min_samples_leaf" : np.arange(10,20,5),
    "max_samples": [1000],
    "max_depth" : np.arange(5,35,5)
}
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

rfc_rscv_model = RandomizedSearchCV(RandomForestClassifier(), rfc_params,  n_iter = 30,
                                                cv = 5,
                                                verbose = True)

rfc_rscv_model.fit(x_train,y_train)
df_final_rscv = pd.DataFrame({
    "PassengerId" : df_test["PassengerId"],
    "Transported" : rfc_rscv_model.predict(df_test)
})

Fitting 5 folds for each of 30 candidates, totalling 150 fits


In [32]:
df_final_rscv

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,False
...,...,...
4272,9266_02,True
4273,9269_01,False
4274,9271_01,True
4275,9273_01,True


In [33]:
df_final_rscv.to_csv("df_final_3.csv", index = False)

In [20]:
df_temp = pd.read_csv("test.csv")

###Linear SVC

In [21]:
from sklearn.svm import LinearSVC
lsvc_model = LinearSVC()

In [22]:
lsvc_model.fit(x_train,y_train)
df_final = pd.DataFrame({
    "PassengerId" : df_test["PassengerId"],
    "Transported" : lsvc_model.predict(df_test)
})

df_final.to_csv("df_final.csv", index = False)



###KNeighbour Classifier

In [23]:
from sklearn.neighbors import KNeighborsClassifier
knc_model = KNeighborsClassifier()

knc_model.fit(x_train,y_train)
df_final = pd.DataFrame({
    "PassengerId" : df_test["PassengerId"],
    "Transported" : knc_model.predict(df_test)
})

df_final.to_csv("df_final.csv", index = False)

  estimator=estimator,


In [34]:
3

3