<h2>Gradient Boosted Trees Model</h2>
<h5>- using Yggdrasil Decision Forests</h5>

In [8]:
import numpy as np # numerical computing
import pandas as pd # data manipulation and analysis
import os # OS module, which provides a way to interact with the operating system.

import tensorflow as tf 
import ydf as ydf # v 0.8.0
import matplotlib.pyplot as plt

In [9]:
# import dataset
train_df = pd.read_csv("train.csv")
serving_df = pd.read_csv("test.csv")

In [10]:
def preprocess(df):
    # copy the dataframe to avoid modifying original data
    df = df.copy()
    
    # clean up names by removing unwanted characters and spaces -> ,()[].\"' <-
    # split name on space, remove unwanted part and rejoin them with single space
    def normalize_name(x):
        return " ".join([v.strip(",()[].\"'") for v in x.split(" ")])
    
    # extract ticket number
    # retrieves last part(-1) to split ticket number
    def ticket_number(x):
        return x.split(" ")[-1]
    
    # one word, return NONE
    # 1 < len, combine all part of ticket string except last one, using underscore( _ ) as separator
    def ticket_item(x):
        items = x.split(" ")
        if len(items) == 1:
            return "NONE"
        return "_".join(items[0:-1])
    
    # apply function
    df["Name"] = df["Name"].apply(normalize_name)
    df["Ticket_number"] = df["Ticket"].apply(ticket_number)
    df["Ticket_item"] = df["Ticket"].apply(ticket_item)

    # return cleaned dataframe
    return df

# store cleaned data to preprocessed_..
preprocessed_train_df = preprocess(train_df)
preprocessed_serving_df = preprocess(serving_df)

preprocessed_train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_number,Ticket_item
0,1,0,3,Braund Mr Owen Harris,male,22.0,1,0,A/5 21171,7.25,,S,21171,A/5
1,2,1,1,Cumings Mrs John Bradley Florence Briggs Thayer,female,38.0,1,0,PC 17599,71.2833,C85,C,17599,PC
2,3,1,3,Heikkinen Miss Laina,female,26.0,0,0,STON/O2. 3101282,7.925,,S,3101282,STON/O2.
3,4,1,1,Futrelle Mrs Jacques Heath Lily May Peel,female,35.0,1,0,113803,53.1,C123,S,113803,NONE
4,5,0,3,Allen Mr William Henry,male,35.0,0,0,373450,8.05,,S,373450,NONE


In [11]:
# initialize with all column from preprocessed_train_df
input_features = list(preprocessed_train_df.columns)

# remove some unused columns
input_features.remove("Ticket")
input_features.remove("PassengerId")
input_features.remove("Survived")
#input_features.remove("Ticket_number")

print(f"Input features: {input_features}")

Input features: ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked', 'Ticket_number', 'Ticket_item']


<h3>Train model</h3>

In [34]:
model = ydf.GradientBoostedTreesLearner(label="Survived").train(train_df)
model.describe()

Train model on 891 examples
Model trained in 0:00:00.135719


In [35]:
evaluation = model.evaluate(train_df)
print(f"Accuracy: {evaluation.accuracy} Loss:{evaluation.loss}")

Accuracy: 0.9068462401795735 Loss:0.27726722110008484


In [63]:
evaluation

Label \ Pred,0,1
0,526,60
1,23,282


In [48]:
def prediction_to_kaggle_format(model, threshold=0.5):
    proba_survive = model.predict(serving_df)
    return pd.DataFrame({
        "PassengerId": serving_df["PassengerId"],
        "Survived": (proba_survive >= threshold).astype(int)
    })

def make_submission(kaggle_predictions):
    path="D:/aSE411/project/submission1.csv"
    kaggle_predictions.to_csv(path, index=False)
    print(f"Submission exported to {path}")
    
kaggle_predictions = prediction_to_kaggle_format(model)
make_submission(kaggle_predictions)
! head D:/aSE411/project/submission1.csv

Submission exported to D:/aSE411/project/submission1.csv


'head' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���


<h3>Training a model with hyperparameter tunning</h3>

In [57]:
tuner = ydf.RandomSearchTuner(num_trials=1000)
tuner.choice("min_examples", [2, 5, 7, 10])
tuner.choice("categorical_algorithm", ["CART", "RANDOM"])
tuner.choice("subsample", [1.0, 0.9, 0.8])

local_search_space = tuner.choice("growing_strategy", ["LOCAL"])
local_search_space.choice("max_depth", [3, 4, 5, 6, 8])

global_search_space = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_search_space.choice("max_num_nodes", [16, 32, 64, 128, 256])

#tuner.choice("use_hessian_gain", [True, False])
tuner.choice("shrinkage", [0.02, 0.05, 0.10, 0.15])
tuner.choice("num_candidate_attributes_ratio", [0.2, 0.5, 0.9, 1.0])


tuner.choice("split_axis", ["AXIS_ALIGNED"])
oblique_space = tuner.choice("split_axis", ["SPARSE_OBLIQUE"], merge=True)
oblique_space.choice("sparse_oblique_normalization",["NONE", "STANDARD_DEVIATION", "MIN_MAX"])
oblique_space.choice("sparse_oblique_weights", ["BINARY", "CONTINUOUS"])
oblique_space.choice("sparse_oblique_num_projections_exponent", [1.0, 1.5])

# Tune the model. Notice the `tuner=tuner`.
learner = ydf.GradientBoostedTreesLearner(
    label="Survived",
    num_trees=100,
    tuner=tuner,).train(train_df)

tuned_self_evaluation = learner.evaluate(train_df)
print(f"Accuracy: {tuned_self_evaluation.accuracy} Loss:{tuned_self_evaluation.loss}")

Train model on 891 examples
Model trained in 0:04:05.536988
Accuracy: 0.9719416386083053 Loss:0.14614386160102255


In [69]:
tuned_self_evaluation

Label \ Pred,0,1
0,540,16
1,9,326


<h3>Making an ensemble</h3>

In [60]:
predictions = None
num_predictions = 0

for i in range(100):
    print(f"i:{i}")
    # Possible models: GradientBoostedTreesModel or RandomForestModel
    model = ydf.GradientBoostedTreesLearner(label="Survived").train(train_df)

    sub_predictions = model.predict(serving_df)
    if predictions is None:
        predictions = sub_predictions
    else:
        predictions += sub_predictions
    num_predictions += 1

predictions/=num_predictions

kaggle_predictions = pd.DataFrame({
        "PassengerId": serving_df["PassengerId"],
        "Survived": (predictions >= 0.5).astype(int)
    })

make_submission(kaggle_predictions)
kaggle_predictions.tail()

i:0
Train model on 891 examples
Model trained in 0:00:00.148813
i:1
Train model on 891 examples
Model trained in 0:00:00.149803
i:2
Train model on 891 examples
Model trained in 0:00:00.133388
i:3
Train model on 891 examples
Model trained in 0:00:00.138371
i:4
Train model on 891 examples
Model trained in 0:00:00.133643
i:5
Train model on 891 examples
Model trained in 0:00:00.135948
i:6
Train model on 891 examples
Model trained in 0:00:00.146229
i:7
Train model on 891 examples
Model trained in 0:00:00.149384
i:8
Train model on 891 examples
Model trained in 0:00:00.155850
i:9
Train model on 891 examples
Model trained in 0:00:00.132423
i:10
Train model on 891 examples
Model trained in 0:00:00.137313
i:11
Train model on 891 examples
Model trained in 0:00:00.139091
i:12
Train model on 891 examples
Model trained in 0:00:00.144339
i:13
Train model on 891 examples
Model trained in 0:00:00.161944
i:14
Train model on 891 examples
Model trained in 0:00:00.136285
i:15
Train model on 891 examples
Mo

Unnamed: 0,PassengerId,Survived
413,1305,0
414,1306,1
415,1307,0
416,1308,0
417,1309,0


<p>reference from: <a href="https://www.kaggle.com/code/gusthema/titanic-competition-w-tensorflow-decision-forests/notebook">kaggle tfdf</a></p>
<p>ps: Window can't download tfdf library, so I used <i>ydf(Yggdrasil Decision Forest)</i> instead of tfdf</p>