# SpaceShip Titanic vs LigthGBM

- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination - The planet the passenger will be debarking to.
- Age - The age of the passenger.
- VIP - Whether the passenger has paid for special VIP service during the voyage.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- Name - The first and last names of the passenger.
- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [5]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

data = pd.read_csv('train.csv', index_col=0)

categorical_feature =  ['HomePlanet','Destination','VIP','CryoSleep','Deck','Side']
feature_name = ['HomePlanet','CryoSleep','Destination','Age','VIP','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Deck','Side']

# Cabin - split into three part: Deck Num Side
data_cabin=data['Cabin'].str.split('/', n=2, expand=True).rename(columns={0: "Deck", 1: "Num", 2: "Side" })
data=data.join(data_cabin,how='left')

# Encode them all
enc = OrdinalEncoder()
data[categorical_feature] = enc.fit_transform(data[categorical_feature])

# drop extra columns
data = data.drop('Cabin', axis=1).drop('Name', axis=1).drop('Num', axis=1)


data.head(100)

Unnamed: 0_level_0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Side
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0001_01,1.0,0.0,2.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,False,1.0,0.0
0002_01,0.0,0.0,2.0,24.0,0.0,109.0,9.0,25.0,549.0,44.0,True,5.0,1.0
0003_01,1.0,0.0,2.0,58.0,1.0,43.0,3576.0,0.0,6715.0,49.0,False,0.0,1.0
0003_02,1.0,0.0,2.0,33.0,0.0,0.0,1283.0,371.0,3329.0,193.0,False,0.0,1.0
0004_01,0.0,0.0,2.0,16.0,0.0,303.0,70.0,151.0,565.0,2.0,True,5.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
0103_01,0.0,0.0,2.0,24.0,0.0,0.0,,0.0,0.0,17.0,True,5.0,1.0
0103_02,0.0,1.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,True,6.0,1.0
0103_03,0.0,1.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,False,6.0,1.0
0105_01,0.0,,2.0,27.0,0.0,0.0,0.0,570.0,2.0,131.0,False,5.0,0.0


In [6]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Index: 8693 entries, 0001_01 to 9280_02
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8492 non-null   float64
 1   CryoSleep     8476 non-null   float64
 2   Destination   8511 non-null   float64
 3   Age           8514 non-null   float64
 4   VIP           8490 non-null   float64
 5   RoomService   8512 non-null   float64
 6   FoodCourt     8510 non-null   float64
 7   ShoppingMall  8485 non-null   float64
 8   Spa           8510 non-null   float64
 9   VRDeck        8505 non-null   float64
 10  Transported   8693 non-null   bool   
 11  Deck          8494 non-null   float64
 12  Side          8494 non-null   float64
dtypes: bool(1), float64(12)
memory usage: 1.1+ MB


In [7]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('Transported', axis=1), data['Transported'], test_size=0.3, random_state=84)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, label=y_train, feature_name=feature_name, categorical_feature=categorical_feature)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": {"binary_error","binary_logloss"},
   # "metric": {"l2", "l1"},
  #  "num_leaves": 31,
  #  "learning_rate": 0.5,
  #  "feature_fraction": 0.9,
  #  "bagging_fraction": 0.8,
   # "bagging_freq": 5,
   # "verbose": 0,
}

print("Starting training...")
# train
gbm = lgb.train(
    params, lgb_train, num_boost_round=20, valid_sets=lgb_eval, callbacks=[lgb.early_stopping(stopping_rounds=5)]
)

print("Starting train predicting...")
# predict
y_pred_train = gbm.predict(X_train, num_iteration=gbm.best_iteration)
y_pred_train = (y_pred_train > 0.5).astype("int")
lgbm_train = accuracy_score(y_pred_train, y_train)
print(lgbm_train)

Starting training...
[LightGBM] [Info] Number of positive: 3010, number of negative: 3075
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001227 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1380
[LightGBM] [Info] Number of data points in the train set: 6085, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.494659 -> initscore=-0.021365
[LightGBM] [Info] Start training from score -0.021365
Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[19]	valid_0's binary_error: 0.202837	valid_0's binary_logloss: 0.438649
Starting train predicting...
0.8277732128184059


In [8]:

print("Starting predicting...")
# predict
y_pred_test = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_pred_test = (y_pred_test > 0.5).astype("int")
lgbm_test = accuracy_score(y_pred_test, y_test)
print(lgbm_test)

Starting predicting...
0.7971625766871165


In [9]:
y_pred_test

array([1, 1, 0, ..., 0, 1, 0])