<h2>Gradient Boosted Trees Model</h2>
<h5>- using Yggdrasil Decision Forests</h5>

In [14]:
# imports
import numpy as np # numerical computing
import pandas as pd # data manipulation and analysis
import os # OS module, which provides a way to interact with the operating system.

import tensorflow as tf 
import ydf as ydf # v 0.8.0
import matplotlib.pyplot as plt

In [15]:
# import dataset
train_df = pd.read_csv("data/train.csv")
serving_df = pd.read_csv("data/test.csv")

<h3>Preprocess</h3>

In [16]:
def preprocess(df):
    # copy the dataframe to avoid modifying original data
    df = df.copy()
    
    # clean up names by removing unwanted characters and spaces -> ,()[].\"' <-
    # split name on space, remove unwanted part and rejoin them with single space
    def normalize_name(x):
        return " ".join([v.strip(",()[].\"'") for v in x.split(" ")])
    
    # extract ticket number
    # retrieves last part(-1) to split ticket number
    def ticket_number(x):
        return x.split(" ")[-1]
    
    # one word, return NONE
    # 1 < len, combine all part of ticket string except last one, using underscore( _ ) as separator
    def ticket_item(x):
        items = x.split(" ")
        if len(items) == 1:
            return "NONE"
        return "_".join(items[0:-1])
    
    # fill the missing age with Mean value
    def fill_age(x):
        mean_age = df["Age"].mean()
        return x.fillna(mean_age)
    
    # apply function
    df["Name"] = df["Name"].apply(normalize_name)
    df["Ticket_number"] = df["Ticket"].apply(ticket_number)
    df["Ticket_item"] = df["Ticket"].apply(ticket_item)
    df["Age"].fillna(df["Age"].mean(), inplace=True)
    df['Embarked'].fillna(df['Embarked'].mode()[0])
    df.drop(["Cabin"],axis=1)

    # return cleaned dataframe
    return df

# store cleaned data to preprocessed_..
preprocessed_train_df = preprocess(train_df)
preprocessed_serving_df = preprocess(serving_df)

preprocessed_train_df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].mean(), inplace=True)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_number,Ticket_item
0,1,0,3,Braund Mr Owen Harris,male,22.0,1,0,A/5 21171,7.25,,S,21171,A/5
1,2,1,1,Cumings Mrs John Bradley Florence Briggs Thayer,female,38.0,1,0,PC 17599,71.2833,C85,C,17599,PC
2,3,1,3,Heikkinen Miss Laina,female,26.0,0,0,STON/O2. 3101282,7.925,,S,3101282,STON/O2.
3,4,1,1,Futrelle Mrs Jacques Heath Lily May Peel,female,35.0,1,0,113803,53.1,C123,S,113803,NONE
4,5,0,3,Allen Mr William Henry,male,35.0,0,0,373450,8.05,,S,373450,NONE


In [17]:
# initialize with all column from preprocessed_train_df
input_features = list(preprocessed_train_df.columns)

# remove some unused columns
input_features.remove("Ticket")
input_features.remove("PassengerId")
input_features.remove("Survived")
#input_features.remove("Ticket_number")

print(f"Input features: {input_features}")

Input features: ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked', 'Ticket_number', 'Ticket_item']


<h3>Train model</h3>

In [18]:
# train the model
model = ydf.GradientBoostedTreesLearner(label="Survived").train(preprocessed_train_df)
model.describe()

Train model on 891 examples
Model trained in 0:00:00.210024


In [19]:
evaluation = model.evaluate(preprocessed_train_df)
print(f"Accuracy: {evaluation.accuracy} Loss:{evaluation.loss}")

Accuracy: 0.9281705948372615 Loss:0.2131330627314621


<img src="https://i.redd.it/1rxo44rmhec41.png">

In [20]:
evaluation

Label \ Pred,0,1
0,530,45
1,19,297


In [21]:
# define the format of submission
def prediction_to_kaggle_format(model, threshold=0.5):
    proba_survive = model.predict(preprocessed_serving_df)
    return pd.DataFrame({
        "PassengerId": preprocessed_serving_df["PassengerId"],
        "Survived": (proba_survive >= threshold).astype(int)
    })

# make a submission
def make_submission(kaggle_predictions):
    path="data/submission1.csv"
    kaggle_predictions.to_csv(path, index=False)
    print(f"Submission exported to {path}")

kaggle_predictions = prediction_to_kaggle_format(model)
make_submission(kaggle_predictions)

Submission exported to data/submission1.csv


<h3>Training a model with hyperparameter tunning</h3>

<h3>What is Tunning</h3>
<p>Model tuning, also known as automated model hyperparameter optimization or AutoML, involves finding the optimal hyperparameters for a learner to maximize the performance of a model. YDF supports model tuning out-of-the-box.</p>

In [22]:
# Initializing the Random Search Tuner:
tuner = ydf.RandomSearchTuner(num_trials=1000)

# Defining Search Space for Hyperparameters:
# min_examples: Specifies minimum examples required in a leaf, with options 2, 5, 7, and 10.
tuner.choice("min_examples", [2, 5, 7, 10])
# categorical_algorithm: Specifies the algorithm for handling categorical features, with "CART" and "RANDOM" as choices.
tuner.choice("categorical_algorithm", ["CART", "RANDOM"])
# subsample: Controls the sampling ratio for data, with full data (1.0) and partial data (0.9 or 0.8) as options.
tuner.choice("subsample", [1.0, 0.9, 0.8])

# A local growing strategy, "LOCAL", is defined with options for tree depth
local_search_space = tuner.choice("growing_strategy", ["LOCAL"])
local_search_space.choice("max_depth", [3, 4, 5, 6, 8])

# A global strategy, "BEST_FIRST_GLOBAL", 
# is specified with options for the maximum number of nodes, 
# allowing it to explore configurations with different node limits:
global_search_space = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_search_space.choice("max_num_nodes", [16, 32, 64, 128, 256])

# The tuner also explores choices for shrinkage (learning rate)
#tuner.choice("use_hessian_gain", [True, False])
tuner.choice("shrinkage", [0.02, 0.05, 0.10, 0.15])
tuner.choice("num_candidate_attributes_ratio", [0.2, 0.5, 0.9, 1.0])
tuner.choice("split_axis", ["AXIS_ALIGNED"])

# Additional tuning is done specifically for oblique splits, 
# adding normalization and weighting options when split_axis is set to "SPARSE_OBLIQUE":
oblique_space = tuner.choice("split_axis", ["SPARSE_OBLIQUE"], merge=True)
oblique_space.choice("sparse_oblique_normalization",["NONE", "STANDARD_DEVIATION", "MIN_MAX"])
oblique_space.choice("sparse_oblique_weights", ["BINARY", "CONTINUOUS"])
oblique_space.choice("sparse_oblique_num_projections_exponent", [1.0, 1.5])

# Tune the model. Notice the `tuner=tuner`.
learner = ydf.GradientBoostedTreesLearner(
    label="Survived",
    num_trees=100,
    tuner=tuner,)

model = learner.train(preprocessed_train_df)
model.describe()


Train model on 891 examples
Model trained in 0:04:11.088029


trial,score,duration,min_examples,categorical_algorithm,subsample,growing_strategy,max_depth,shrinkage,num_candidate_attributes_ratio,split_axis,sparse_oblique_normalization,sparse_oblique_weights,sparse_oblique_num_projections_exponent,max_num_nodes
635,-0.580658,159.219,10,RANDOM,0.8,BEST_FIRST_GLOBAL,,0.15,0.2,SPARSE_OBLIQUE,NONE,BINARY,1.5,32.0
358,-0.618985,93.1543,7,CART,0.8,LOCAL,6.0,0.15,1.0,SPARSE_OBLIQUE,NONE,BINARY,1.0,
910,-0.624599,228.795,7,RANDOM,0.8,LOCAL,4.0,0.15,0.2,SPARSE_OBLIQUE,NONE,CONTINUOUS,1.5,
83,-0.631624,22.8026,10,RANDOM,0.8,BEST_FIRST_GLOBAL,,0.1,1.0,SPARSE_OBLIQUE,NONE,CONTINUOUS,1.5,256.0
852,-0.635241,214.413,7,CART,0.8,BEST_FIRST_GLOBAL,,0.15,0.2,SPARSE_OBLIQUE,STANDARD_DEVIATION,CONTINUOUS,1.5,128.0
571,-0.635946,143.572,7,RANDOM,1.0,BEST_FIRST_GLOBAL,,0.15,0.9,SPARSE_OBLIQUE,NONE,BINARY,1.5,64.0
226,-0.643665,59.5255,2,CART,0.9,LOCAL,8.0,0.15,0.2,SPARSE_OBLIQUE,NONE,BINARY,1.0,
65,-0.645094,17.548,2,RANDOM,1.0,LOCAL,4.0,0.15,0.2,SPARSE_OBLIQUE,NONE,BINARY,1.5,
136,-0.646216,36.4129,10,CART,0.9,BEST_FIRST_GLOBAL,,0.15,1.0,SPARSE_OBLIQUE,NONE,BINARY,1.5,32.0
498,-0.648146,125.358,10,RANDOM,0.8,LOCAL,6.0,0.15,1.0,SPARSE_OBLIQUE,MIN_MAX,CONTINUOUS,1.0,


In [23]:
tuned_self_evaluation = model.evaluate(preprocessed_train_df)
print(f"Accuracy: {tuned_self_evaluation.accuracy} Loss:{tuned_self_evaluation.loss}")

Accuracy: 0.9685746352413019 Loss:0.14998528368454364


In [24]:
tuned_self_evaluation

Label \ Pred,0,1
0,539,18
1,10,324


<h3>Making an ensemble</h3>

In [25]:
predictions = None
num_predictions = 0

# Training Models and Making Predictions:
for i in range(100):
    print(f"i:{i}")
    # Possible models: GradientBoostedTreesModel or RandomForestModel
    model = ydf.GradientBoostedTreesLearner(label="Survived").train(preprocessed_train_df)

    sub_predictions = model.predict(preprocessed_serving_df)
    # If predictions is None (first iteration), it’s set to the current model’s predictions.
    # Otherwise, the predictions are accumulated by adding sub_predictions to predictions
    if predictions is None:
        predictions = sub_predictions
    else:
        predictions += sub_predictions
    num_predictions += 1

# Average the Predictions:
# After the loop, predictions contains the summed predictions from 100 models. 
# Dividing by num_predictions gives the average prediction:
predictions/=num_predictions

# create submission dataframe
kaggle_predictions = pd.DataFrame({
        "PassengerId": preprocessed_serving_df["PassengerId"],
        "Survived": (predictions >= 0.5).astype(int)
    })

make_submission(kaggle_predictions)
kaggle_predictions.tail()

i:0
Train model on 891 examples
Model trained in 0:00:00.229880
i:1
Train model on 891 examples
Model trained in 0:00:00.196978
i:2
Train model on 891 examples
Model trained in 0:00:00.198505
i:3
Train model on 891 examples
Model trained in 0:00:00.188292
i:4
Train model on 891 examples
Model trained in 0:00:00.203921
i:5
Train model on 891 examples
Model trained in 0:00:00.202254
i:6
Train model on 891 examples
Model trained in 0:00:00.203922
i:7
Train model on 891 examples
Model trained in 0:00:00.187147
i:8
Train model on 891 examples
Model trained in 0:00:00.188301
i:9
Train model on 891 examples
Model trained in 0:00:00.185583
i:10
Train model on 891 examples
Model trained in 0:00:00.190281
i:11
Train model on 891 examples
Model trained in 0:00:00.203923
i:12
Train model on 891 examples
Model trained in 0:00:00.194240
i:13
Train model on 891 examples
Model trained in 0:00:00.200422
i:14
Train model on 891 examples
Model trained in 0:00:00.185963
i:15
Train model on 891 examples
Mo

Unnamed: 0,PassengerId,Survived
413,1305,0
414,1306,1
415,1307,0
416,1308,0
417,1309,0


<p>reference from: <a href="https://www.kaggle.com/code/gusthema/titanic-competition-w-tensorflow-decision-forests/notebook">kaggle tfdf</a></p>
<p>ps: Window can't download tfdf library, so I used <i>ydf(Yggdrasil Decision Forest)</i> instead of tfdf</p>