Dataset Description
-------------------

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the *Spaceship Titanic*'s collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

File and Data Field Descriptions
================================

-   **train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    -   `PassengerId` - A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and `pp` is their number within the group. People in a group are often family members, but not always.
    -   `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.
    -   `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    -   `Cabin` - The cabin number where the passenger is staying. Takes the form `deck/num/side`, where `side` can be either `P` for *Port* or `S` for *Starboard*.
    -   `Destination` - The planet the passenger will be debarking to.
    -   `Age` - The age of the passenger.
    -   `VIP` - Whether the passenger has paid for special VIP service during the voyage.
    -   `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the *Spaceship Titanic*'s many luxury amenities.
    -   `Name` - The first and last names of the passenger.
    -   `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
-   **test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of `Transported` for the passengers in this set.
-   **sample_submission.csv** - A submission file in the correct format.
    -   `PassengerId` - Id for each passenger in the test set.
    -   `Transported` - The target. For each passenger, predict either `True` or `False`.

## EDA

In [32]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn

from sklearn.metrics import get_scorer_names
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, KFold
#get_scorer_names()

In [33]:
dfraw = pd.read_csv("data/train.csv")

target = "Transported"

df, y = dfraw.drop(columns=target), dfraw[target]

df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines


In [34]:
# balanced classes
y.value_counts().agg({"count": lambda x: x, "pct": lambda x: (x/x.sum()).apply("{:.1%}".format)}).unstack(0)

Unnamed: 0,count,pct
True,4378,50.4%
False,4315,49.6%


## Data Prep and Feature Engineering

In [50]:
def prep_data(df):
    df = df.copy()

    # feature: size of each passenger's group
    df['Group'] = df['PassengerId'].str.split('_').str[0]
    df['n_group'] = df.groupby(['Group'])['Group'].transform('count')

    # split cabin features
    df[["deck","num","side"]] = df['Cabin'].str.split("/", expand=True)
    df['num'] = df['num'].astype('Int64')
    #df['deck'].unique(), df['side'].unique(), df['num'].unique()

    # fix wrong country, TODO: convert to Categorical with 2 categories
    df['HomePlanet'] = df['HomePlanet'].replace("Europa", "Earth")
    #df['HomePlanet'].unique()

    # convert booleans
    df['CryoSleep'] = df['CryoSleep']*1
    df['VIP'] = df['VIP']*1

    # TODO: feat: sum of expenses

    df = df.convert_dtypes().replace(np.nan, pd.NA)

    return df



features = pd.Index([
    # 'PassengerId',
    #'Cabin',
    # 'Name',
    'HomePlanet',
    'Destination',
    'side',#binary
    'deck',
    #'Group',

    'CryoSleep',
    'VIP',
    
    'Age',
    'RoomService',
    'FoodCourt',
    'ShoppingMall',
    'Spa',
    'VRDeck',
    
    
    'n_group',
    #'num',
])

# TODO: more caution to not leak data
X = prep_data(df)[features].copy()
X.info()

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

X_train.head(3)




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   HomePlanet    8492 non-null   string
 1   Destination   8511 non-null   string
 2   side          8494 non-null   string
 3   deck          8494 non-null   string
 4   CryoSleep     8476 non-null   Int64 
 5   VIP           8490 non-null   Int64 
 6   Age           8514 non-null   Int64 
 7   RoomService   8512 non-null   Int64 
 8   FoodCourt     8510 non-null   Int64 
 9   ShoppingMall  8485 non-null   Int64 
 10  Spa           8510 non-null   Int64 
 11  VRDeck        8505 non-null   Int64 
 12  n_group       8693 non-null   Int64 
dtypes: Int64(9), string(4)
memory usage: 959.4 KB


Unnamed: 0,HomePlanet,Destination,side,deck,CryoSleep,VIP,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,n_group
8486,Earth,TRAPPIST-1e,P,B,0,0,44,0,4313,0,568,7,5
51,Earth,TRAPPIST-1e,S,F,0,0,25,0,0,1938,0,1,1
3536,Earth,55 Cancri e,S,G,1,0,23,0,0,0,0,0,1


In [54]:
# pipe:
# - imputar dados faltantes
# - fazer encoding das variaveis categoricas
# - padronizar/normalizar dados?
# - modelo

cat_features = X.select_dtypes(include='string').columns
num_features = X.select_dtypes(exclude='string').columns
print(f"{cat_features=}\n{num_features=}")


cat_prep = Pipeline([
    ("imp", SimpleImputer(strategy="most_frequent", missing_values=pd.NA)),
    ("oh", OneHotEncoder())
])

num_prep = SimpleImputer(strategy="median", missing_values=pd.NA)

prep = ColumnTransformer(
    [
        ("cat", cat_prep, cat_features),
        ("num", num_prep, num_features),
    ],
    verbose_feature_names_out =False
)



pipe = Pipeline([
    ("prep", prep),
    ("model", None)
])

#pd.DataFrame(prep.fit_transform(X), X.index, prep.get_feature_names_out())

cat_features=Index(['HomePlanet', 'Destination', 'side', 'deck'], dtype='object')
num_features=Index(['CryoSleep', 'VIP', 'Age', 'RoomService', 'FoodCourt', 'ShoppingMall',
       'Spa', 'VRDeck', 'n_group'],
      dtype='object')


## GridSearch

In [55]:
#TODO: fix random states
param_grid = [
    {
        "model": [RandomForestClassifier()],
        "model__ccp_alpha": np.linspace(0.001,0.05, num=10),
    },
    {
        "model": [GradientBoostingClassifier()],
        "model__learning_rate": np.logspace(-3,0, num=5, endpoint=False),
    },
]


gs = GridSearchCV(pipe, param_grid, scoring=["roc_auc", "accuracy"], refit="roc_auc" ,cv=9, verbose=2, n_jobs=-1).fit(X_train, y_train)


Fitting 9 folds for each of 15 candidates, totalling 135 fits


In [56]:
score = 'roc_auc'
scores = pd.DataFrame(gs.cv_results_).sort_values(f'rank_test_{score}')
scores = scores.rename(columns= lambda x: x.replace('param_model__', '_'))

select = lambda x, pat: x.loc[:,x.columns.str.contains(pat)]
scores.set_index('params').pipe(select, "mean_test_").head(10)


Unnamed: 0_level_0,mean_test_roc_auc,mean_test_accuracy
params,Unnamed: 1_level_1,Unnamed: 2_level_1
"{'model': GradientBoostingClassifier(learning_rate=0.25118864315095796), 'model__learning_rate': 0.25118864315095796}",0.879704,0.793261
"{'model': GradientBoostingClassifier(learning_rate=0.25118864315095796), 'model__learning_rate': 0.0630957344480193}",0.878622,0.795563
"{'model': RandomForestClassifier(), 'model__ccp_alpha': 0.001}",0.871728,0.786527
"{'model': GradientBoostingClassifier(learning_rate=0.25118864315095796), 'model__learning_rate': 0.015848931924611134}",0.854348,0.773213
"{'model': RandomForestClassifier(), 'model__ccp_alpha': 0.0064444444444444445}",0.841567,0.740841
"{'model': GradientBoostingClassifier(learning_rate=0.25118864315095796), 'model__learning_rate': 0.003981071705534973}",0.840266,0.74988
"{'model': GradientBoostingClassifier(learning_rate=0.25118864315095796), 'model__learning_rate': 0.001}",0.828611,0.746754
"{'model': RandomForestClassifier(), 'model__ccp_alpha': 0.01188888888888889}",0.818052,0.739526
"{'model': RandomForestClassifier(), 'model__ccp_alpha': 0.017333333333333333}",0.799995,0.740183
"{'model': RandomForestClassifier(), 'model__ccp_alpha': 0.028222222222222225}",0.784145,0.740677


AUC scores for GridSearch's best estimator in train and test data :

In [58]:
from sklearn.metrics import roc_auc_score

gs_model = gs.best_estimator_

roc_auc_score(y_train, gs_model.predict(X_train)), roc_auc_score(y_test, gs_model.predict(X_test))

(0.8413774406985758, 0.8068181015920082)

# Predict Test Data

TODO