<a href="https://www.kaggle.com/code/yashkaul/spaceship-titanic-dataset-with-pycaret-low-code?scriptVersionId=140218511" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Spaceship Titanic Kaggle Competition with PyCaret

In this notebook, I am trying to demonstrate how PyCaret, an open-source, low-code machine learning library in Python, can be used to accelerate machine learning workflows, accelerate baseline model building, and deployment. 

You can find the PyCaret documentation [here](https://pycaret.gitbook.io/docs/).

### Key items from the competition description:

1. **Year and Context**: 
   - The events are set in the year 2912.
   - A cosmic mystery needs solving.

2. **Transmission**:
   - Received from a distance of four lightyears.
   - Indicates a potential problem.

3. **Spaceship Titanic**: 
   - Interstellar passenger liner.
   - Launched a month prior to the events.
   - Had almost 13,000 passengers on board.
   - Purpose: Transport emigrants from our solar system to other habitable exoplanets.

4. **Destination and Route**:
   - En route to its first destination, 55 Cancri E, after rounding Alpha Centauri.
   - Journey involved traveling to three newly habitable exoplanets orbiting nearby stars.

5. **Accident Details**:
   - Collided with a spacetime anomaly hidden within a dust cloud.
   - Mirrored a similar fate of another entity (its namesake) from 1000 years ago.
   - Result: Almost half of the passengers were transported to an alternate dimension, but the ship remained intact.

6. **Challenge**:
   - Predict which passengers were transported to the alternate dimension.
   - Base predictions on records from the spaceship’s damaged computer system.

In [1]:
# Installing pycaret
!pip install -q pycaret[full]

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx-bsl 1.12.0 requires google-api-python-client<2,>=1.7.11, but you have google-api-python-client 2.79.0 which is incompatible.
tensorflow 2.11.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.
tensorflow-serving-api 2.11.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.
pytoolconfig 1.2.5 requires packaging>=22.0, but you have packaging 21.3 which is incompatible.
pydocstyle 6.2.3 requires importlib-metadata<5.0.0,>=2.0.0; python_version < "3.8", but you have importlib-metadata 5.2.0 which is incompatible.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.8.1 which is incompatible.
onnx 1.13.1 requires protobuf<4,>=3.20.2, but you have protobuf 3.20.1 which is incompatible.
mxnet 1.9.1 requires graphviz<0.9.0,>=0.8.

In [2]:
# Importing dependencies
import pandas as pd
from pycaret.classification import *

In [3]:
# Load the training & test data into a pandas dataframes
train = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
test = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
print("Full train dataset shape is {}".format(train.shape))
print("Full test dataset shape is {}".format(test.shape))

Full train dataset shape is (8693, 14)
Full test dataset shape is (4277, 13)


In [4]:
# Take a look at the first 5 rows of the training data
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [5]:
# Take a look at the first 5 rows of the test data
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez


In [6]:
# Take a look at the first 5 rows of the sample submission file
sample = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')
sample.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,False
1,0018_01,False
2,0019_01,False
3,0021_01,False
4,0023_01,False


### File and Data Field Descriptions

**train.csv** 
- Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

**Fields in train.csv**:
- **PassengerId**: A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and `pp` is their number within the group. People in a group are often family members, but not always.
- **HomePlanet**: The planet the passenger departed from, typically their planet of permanent residence.
- **CryoSleep**: Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- **Cabin**: The cabin number where the passenger is staying. Takes the form `deck/num/side`, where side can be either `P` for Port or `S` for Starboard.
- **Destination**: The planet the passenger will be debarking to.
- **Age**: The age of the passenger.
- **VIP**: Whether the passenger has paid for special VIP service during the voyage.
- **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck**: Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- **Name**: The first and last names of the passenger.
- **Transported**: Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

**test.csv** 
- Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of `Transported` for the passengers in this set.


Notes - 
-  We should break `PassengerId` into group & group number. We can also create a new label for if the group members are family based on if their last names match but we will need to see if last names in a group match sometimes. This might or might not give some added information to the models. 
- We can create a label for passenger Cabin -> Yes/No for passengers that elected to be in cryosleep we can set them as `Yes` as well as 

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [8]:
def preprocess_data(df):
    """
    Steps - 
    1. Extracting GroupID and IndividualID from PassengerId.
    2. Splitting the Cabin column into Deck, Num, and Side.
    3. Processing the Name column to extract family status and name length.
    4. Binning the Age into categories.
    5. Creating a feature for total expenditure.
    
    Parameters:
    - df: DataFrame to preprocess
    
    Returns:
    - Preprocessed DataFrame
    """
    # Step 1: Extract 'gggg' and 'pp' from the 'PassengerId' column
    df['GroupID'] = df['PassengerId'].str.split('_').str[0]
    df['IndividualID'] = df['PassengerId'].str.split('_').str[1]
    
    # Step 2: Split the 'Cabin' column into 'Deck', 'Num', and 'Side'
    df['Deck'] = df['Cabin'].str.split('/').str[0]
    df['Num'] = df['Cabin'].str.split('/').str[1]
    df['Side'] = df['Cabin'].str.split('/').str[2]
    
    # Step 3: Process the 'Name' column
    df['LastName'] = df['Name'].str.split(' ').str[1]  # Splitting by space to get the last name
    family_counts = df.groupby(['GroupID', 'LastName']).size().reset_index(name='FamilyCount')
    df = df.merge(family_counts, on=['GroupID', 'LastName'], how='left')
    df['Family'] = df['FamilyCount'] > 1
    df.drop(columns=['FamilyCount'], inplace=True)  # Drop the temporary count column
    df['NameLength'] = df['Name'].apply(lambda x: len(str(x)) if pd.notnull(x) else 0)
    
    # Step 4: Binning the Age into categories
    bins = [0, 12, 18, 35, 60, 100]
    labels = ['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']
    df['AgeCategory'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
    
    # Step 5: Creating a feature for total expenditure
    amenities = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
    df['TotalExpenditure'] = df[amenities].sum(axis=1)
    
    return df


In [9]:
train = preprocess_data(train)
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,GroupID,IndividualID,Deck,Num,Side,LastName,Family,NameLength,AgeCategory,TotalExpenditure
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,...,1,1,B,0,P,Ofracculy,False,15,Adult,0.0
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,...,2,1,F,0,S,Vines,False,12,Young Adult,736.0
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,...,3,1,A,0,S,Susent,True,13,Adult,10383.0
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,...,3,2,A,0,S,Susent,True,12,Young Adult,5176.0
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,...,4,1,F,1,S,Santantines,False,17,Teenager,1091.0


In [10]:
s = setup(train, target = 'Transported', session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Transported
2,Target type,Binary
3,Original data shape,"(8693, 24)"
4,Transformed data shape,"(8693, 46)"
5,Transformed train set shape,"(6085, 46)"
6,Transformed test set shape,"(2608, 46)"
7,Ordinal features,3
8,Numeric features,8
9,Categorical features,14


In [11]:
# compare baseline models
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.7704,0.8371,0.7733,0.7721,0.7724,0.5408,0.5412,0.805
nb,Naive Bayes,0.7597,0.8386,0.86,0.7186,0.7828,0.5187,0.5297,0.753
lr,Logistic Regression,0.7394,0.8241,0.7697,0.7286,0.7485,0.4785,0.4794,1.916
svm,SVM - Linear Kernel,0.7067,0.0,0.7496,0.6976,0.7115,0.4132,0.4264,0.639
ridge,Ridge Classifier,0.6703,0.0,0.7484,0.6505,0.6958,0.3399,0.3441,0.611
et,Extra Trees Classifier,0.5975,0.6603,0.6284,0.6171,0.6065,0.1945,0.2037,1.104
rf,Random Forest Classifier,0.5057,0.6979,0.9941,0.5047,0.6695,0.0041,0.0158,1.213
dt,Decision Tree Classifier,0.5037,0.5,1.0,0.5037,0.6699,0.0,0.0,0.659
ada,Ada Boost Classifier,0.5037,0.5,1.0,0.5037,0.6699,0.0,0.0,0.696
gbc,Gradient Boosting Classifier,0.5037,0.5,1.0,0.5037,0.6699,0.0,0.0,1.072


Processing:   0%|          | 0/69 [00:00<?, ?it/s]

In [12]:
# tune model
best = tune_model(best)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7734,0.8406,0.8208,0.7522,0.785,0.5464,0.5488
1,0.7767,0.8465,0.8502,0.7436,0.7933,0.5528,0.5587
2,0.78,0.8581,0.8534,0.7464,0.7964,0.5594,0.5653
3,0.7915,0.8563,0.8827,0.7486,0.8102,0.5823,0.592
4,0.7865,0.8714,0.8404,0.7611,0.7988,0.5727,0.5759
5,0.7911,0.8635,0.8366,0.7688,0.8013,0.582,0.5843
6,0.7812,0.8558,0.8529,0.7479,0.7969,0.5621,0.5678
7,0.8026,0.8673,0.8627,0.7719,0.8148,0.6049,0.6092
8,0.7862,0.8514,0.8856,0.7404,0.8065,0.5718,0.5833
9,0.7961,0.8554,0.866,0.7615,0.8104,0.5917,0.5975


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [13]:
# Evaluate the model 
evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [14]:
# Preprocessing on test data 
test = preprocess_data(test)

In [15]:
# predict model on test dataset
predictions = predict_model(best, data = test, raw_score=True, encoded_labels=False)
predictions.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Num,Side,LastName,Family,NameLength,AgeCategory,TotalExpenditure,prediction_label,prediction_score_0,prediction_score_1
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,...,3,S,Carsoning,False,15,Young Adult,0.0,1,0.3469,0.6531
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,...,4,S,Peckers,False,14,Young Adult,2832.0,0,0.9388,0.0612
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,...,0,S,Unhearfus,False,15,Young Adult,0.0,1,0.0612,0.9388
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,...,1,S,Caltilter,False,16,Adult,7418.0,1,0.102,0.898
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,...,5,S,Harperez,False,15,Young Adult,645.0,1,0.3265,0.6735


In [16]:
predictions['prediction_label'] = predictions['prediction_label'].map({1: True, 0: False})
predictions = predictions[['PassengerId', 'prediction_label']]
predictions = predictions.rename(columns={'prediction_label': 'Transported'})
predictions.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [17]:
predictions.to_csv('/kaggle/working/submission.csv', index=False)