This notebook is copied from: https://www.kaggle.com/code/gusthema/titanic-competition-w-tensorflow-decision-forests 

In [None]:
import numpy as np
import pandas as pd
import os

import tensorflow as tf
import tensorflow_decision_forests as tfdf

print(f"Found TF-DF {tfdf.__version__}")

In [None]:
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
serving_df = pd.read_csv("/kaggle/input/titanic/test.csv")

train_df.head(10)

In [None]:
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

## Prepare dataset
Apply folowing transformations: 
1. Tokenize the names. For example, "Braund, Mr. Owen Harris" will become ["Braund", "Mr.", "Owen", "Harris"].
2. Extract any prefix in the ticket. For example ticket "STON/O2. 3101282" will become "STON/O2." and 3101282.

In [None]:
def preprocess(df):
    df = df.copy()
    
    def normalize_name(x):
        return " ".join([v.strip(",()[].\"'") for v in x.split(" ")])
    
    def ticket_number(x):
        return x.split(" ")[-1]
        
    def ticket_item(x):
        items = x.split(" ")
        if len(items) == 1:
            return "NONE"
        return "_".join(items[0:-1])
    
    df["Name"] = df["Name"].apply(normalize_name)
    df["Ticket_number"] = df["Ticket"].apply(ticket_number)
    df["Ticket_item"] = df["Ticket"].apply(ticket_item)                     
    return df
    
preprocessed_train_df = preprocess(train_df)
preprocessed_serving_df = preprocess(serving_df)

preprocessed_train_df.head(5)

In [None]:
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Ticket_number	Ticket_item
0	1	0	3	Braund Mr Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	21171	A/5
1	2	1	1	Cumings Mrs John Bradley Florence Briggs Thayer	female	38.0	1	0	PC 17599	71.2833	C85	C	17599	PC
2	3	1	3	Heikkinen Miss Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	3101282	STON/O2.
3	4	1	1	Futrelle Mrs Jacques Heath Lily May Peel	female	35.0	1	0	113803	53.1000	C123	S	113803	NONE
4	5	0	3	Allen Mr William Henry	male	35.0	0	0	373450	8.0500	NaN	S	373450	NONE

In [None]:
input_features = list(preprocessed_train_df.columns)
input_features.remove("Ticket")
input_features.remove("PassengerId")
input_features.remove("Survived")
#input_features.remove("Ticket_number")

print(f"Input features: {input_features}")

In [None]:
Input features: ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked', 'Ticket_number', 'Ticket_item']
add Codeadd Markdown

## Convert Pandas dataset to TensorFlow Dataset¶

In [None]:
def tokenize_names(features, labels=None):
    """Divite the names into tokens. TF-DF can consume text tokens natively."""
    features["Name"] =  tf.strings.split(features["Name"])
    return features, labels

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(preprocessed_train_df,label="Survived").map(tokenize_names)
serving_ds = tfdf.keras.pd_dataframe_to_tf_dataset(preprocessed_serving_df).map(tokenize_names)

## Train model with default parameters
Train model
- First, we are training a GradientBoostedTreesModel model with the default parameters.

In [None]:
model = tfdf.keras.GradientBoostedTreesModel(
    verbose=0, # Very few logs
    features=[tfdf.keras.FeatureUsage(name=n) for n in input_features],
    exclude_non_specified_features=True, # Only use the features in "features"
    random_seed=1234,
)
model.fit(train_ds)

self_evaluation = model.make_inspector().evaluation()
print(f"Accuracy: {self_evaluation.accuracy} Loss:{self_evaluation.loss}")

Accuracy: 0.8260869383811951 Loss:0.8608942627906799

## Train model with improved default parameters


In [None]:
model = tfdf.keras.GradientBoostedTreesModel(
    verbose=0, # Very few logs
    features=[tfdf.keras.FeatureUsage(name=n) for n in input_features],
    exclude_non_specified_features=True, # Only use the features in "features"
    
    #num_trees=2000,
    
    # Only for GBT.
    # A bit slower, but great to understand the model.
    # compute_permutation_variable_importance=True,
    
    # Change the default hyper-parameters
    # hyperparameter_template="benchmark_rank1@v1",
    
    #num_trees=1000,
    #tuner=tuner
    
    min_examples=1,
    categorical_algorithm="RANDOM",
    #max_depth=4,
    shrinkage=0.05,
    #num_candidate_attributes_ratio=0.2,
    split_axis="SPARSE_OBLIQUE",
    sparse_oblique_normalization="MIN_MAX",
    sparse_oblique_num_projections_exponent=2.0,
    num_trees=2000,
    #validation_ratio=0.0,
    random_seed=1234,
    
)
model.fit(train_ds)

self_evaluation = model.make_inspector().evaluation()
print(f"Accuracy: {self_evaluation.accuracy} Loss:{self_evaluation.loss}")

In [None]:
Accuracy: 0.760869562625885 Loss:1.0154211521148682

In [None]:
model.summary()

In [None]:
Model: "gradient_boosted_trees_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
=================================================================
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (11):
	Age
	Cabin
	Embarked
	Fare
	Name
	Parch
	Pclass
	Sex
	SibSp
	Ticket_item
	Ticket_number

No weights

Variable Importance: INV_MEAN_MIN_DEPTH:
    1.           "Sex"  0.576632 ################
    2.           "Age"  0.364297 #######
    3.          "Fare"  0.278839 ####
    4.          "Name"  0.208548 #
    5. "Ticket_number"  0.180792 
    6.        "Pclass"  0.176962 
    7.         "Parch"  0.176659 
    8.   "Ticket_item"  0.175540 
    9.      "Embarked"  0.172339 
   10.         "SibSp"  0.170442 

Variable Importance: NUM_AS_ROOT:
    1.  "Sex" 28.000000 ################
    2. "Name"  5.000000 

Variable Importance: NUM_NODES:
    1.           "Age" 406.000000 ################
    2.          "Fare" 290.000000 ###########
    3.          "Name" 44.000000 #
    4.   "Ticket_item" 42.000000 #
    5.           "Sex" 31.000000 #
    6.         "Parch" 28.000000 
    7. "Ticket_number" 22.000000 
    8.        "Pclass" 15.000000 
    9.      "Embarked" 12.000000 
   10.         "SibSp"  5.000000 

Variable Importance: SUM_SCORE:
    1.           "Sex" 460.497828 ################
    2.           "Age" 355.963333 ############
    3.          "Fare" 292.870316 ##########
    4.          "Name" 108.548952 ###
    5.        "Pclass" 28.132254 
    6.   "Ticket_item" 23.818676 
    7. "Ticket_number" 23.772288 
    8.         "Parch" 19.303155 
    9.      "Embarked"  8.155722 
   10.         "SibSp"  0.015225 



Loss: BINOMIAL_LOG_LIKELIHOOD
Validation loss value: 1.01542
Number of trees per iteration: 1
Node format: NOT_SET
Number of trees: 33
Total number of nodes: 1823

Number of nodes by tree:
Count: 33 Average: 55.2424 StdDev: 5.13473
Min: 39 Max: 63 Ignored: 0
----------------------------------------------
[ 39, 40) 1   3.03%   3.03% #
[ 40, 41) 0   0.00%   3.03%
[ 41, 42) 0   0.00%   3.03%
[ 42, 44) 0   0.00%   3.03%
[ 44, 45) 0   0.00%   3.03%
[ 45, 46) 0   0.00%   3.03%
[ 46, 47) 0   0.00%   3.03%
[ 47, 49) 2   6.06%   9.09% ###
[ 49, 50) 2   6.06%  15.15% ###
[ 50, 51) 0   0.00%  15.15%
[ 51, 52) 2   6.06%  21.21% ###
[ 52, 54) 5  15.15%  36.36% #######
[ 54, 55) 0   0.00%  36.36%
[ 55, 56) 5  15.15%  51.52% #######
[ 56, 57) 0   0.00%  51.52%
[ 57, 59) 4  12.12%  63.64% ######
[ 59, 60) 7  21.21%  84.85% ##########
[ 60, 61) 0   0.00%  84.85%
[ 61, 62) 3   9.09%  93.94% ####
[ 62, 63] 2   6.06% 100.00% ###

Depth by leafs:
Count: 928 Average: 4.8847 StdDev: 0.380934
Min: 2 Max: 5 Ignored: 0
----------------------------------------------
[ 2, 3)   1   0.11%   0.11%
[ 3, 4)  17   1.83%   1.94%
[ 4, 5)  70   7.54%   9.48% #
[ 5, 5] 840  90.52% 100.00% ##########

Number of training obs by leaf:
Count: 928 Average: 28.4127 StdDev: 70.8313
Min: 1 Max: 438 Ignored: 0
----------------------------------------------
[   1,  22) 731  78.77%  78.77% ##########
[  22,  44)  74   7.97%  86.75% #
[  44,  66)  37   3.99%  90.73% #
[  66,  88)   3   0.32%  91.06%
[  88, 110)   9   0.97%  92.03%
[ 110, 132)   8   0.86%  92.89%
[ 132, 154)  18   1.94%  94.83%
[ 154, 176)   8   0.86%  95.69%
[ 176, 198)   6   0.65%  96.34%
[ 198, 220)   2   0.22%  96.55%
[ 220, 241)   2   0.22%  96.77%
[ 241, 263)   1   0.11%  96.88%
[ 263, 285)   2   0.22%  97.09%
[ 285, 307)   5   0.54%  97.63%
[ 307, 329)   1   0.11%  97.74%
[ 329, 351)   2   0.22%  97.95%
[ 351, 373)   6   0.65%  98.60%
[ 373, 395)   6   0.65%  99.25%
[ 395, 417)   2   0.22%  99.46%
[ 417, 438]   5   0.54% 100.00%

Attribute in nodes:
	406 : Age [NUMERICAL]
	290 : Fare [NUMERICAL]
	44 : Name [CATEGORICAL_SET]
	42 : Ticket_item [CATEGORICAL]
	31 : Sex [CATEGORICAL]
	28 : Parch [NUMERICAL]
	22 : Ticket_number [CATEGORICAL]
	15 : Pclass [NUMERICAL]
	12 : Embarked [CATEGORICAL]
	5 : SibSp [NUMERICAL]

Attribute in nodes with depth <= 0:
	28 : Sex [CATEGORICAL]
	5 : Name [CATEGORICAL_SET]

Attribute in nodes with depth <= 1:
	39 : Age [NUMERICAL]
	28 : Sex [CATEGORICAL]
	21 : Fare [NUMERICAL]
	5 : Name [CATEGORICAL_SET]
	3 : Pclass [NUMERICAL]
	2 : Ticket_number [CATEGORICAL]
	1 : Parch [NUMERICAL]

Attribute in nodes with depth <= 2:
	102 : Age [NUMERICAL]
	65 : Fare [NUMERICAL]
	28 : Sex [CATEGORICAL]
	15 : Name [CATEGORICAL_SET]
	7 : Ticket_number [CATEGORICAL]
	5 : Pclass [NUMERICAL]
	4 : Parch [NUMERICAL]
	2 : Ticket_item [CATEGORICAL]
	2 : Embarked [CATEGORICAL]

Attribute in nodes with depth <= 3:
	206 : Age [NUMERICAL]
	156 : Fare [NUMERICAL]
	33 : Name [CATEGORICAL_SET]
	29 : Sex [CATEGORICAL]
	19 : Ticket_number [CATEGORICAL]
	11 : Ticket_item [CATEGORICAL]
	11 : Parch [NUMERICAL]
	7 : Pclass [NUMERICAL]
	3 : Embarked [CATEGORICAL]

Attribute in nodes with depth <= 5:
	406 : Age [NUMERICAL]
	290 : Fare [NUMERICAL]
	44 : Name [CATEGORICAL_SET]
	42 : Ticket_item [CATEGORICAL]
	31 : Sex [CATEGORICAL]
	28 : Parch [NUMERICAL]
	22 : Ticket_number [CATEGORICAL]
	15 : Pclass [NUMERICAL]
	12 : Embarked [CATEGORICAL]
	5 : SibSp [NUMERICAL]

Condition type in nodes:
	744 : ObliqueCondition
	122 : ContainsBitmapCondition
	29 : ContainsCondition
Condition type in nodes with depth <= 0:
	31 : ContainsBitmapCondition
	2 : ContainsCondition
Condition type in nodes with depth <= 1:
	64 : ObliqueCondition
	33 : ContainsBitmapCondition
	2 : ContainsCondition
Condition type in nodes with depth <= 2:
	176 : ObliqueCondition
	51 : ContainsBitmapCondition
	3 : ContainsCondition
Condition type in nodes with depth <= 3:
	380 : ObliqueCondition
	77 : ContainsBitmapCondition
	18 : ContainsCondition
Condition type in nodes with depth <= 5:
	744 : ObliqueCondition
	122 : ContainsBitmapCondition
	29 : ContainsCondition

Training logs:
Number of iteration to final model: 33
	Iter:1 train-loss:1.266350 valid-loss:1.360049  train-accuracy:0.624531 valid-accuracy:0.543478
	Iter:2 train-loss:1.213702 valid-loss:1.321897  train-accuracy:0.624531 valid-accuracy:0.543478
	Iter:3 train-loss:1.165783 valid-loss:1.286817  train-accuracy:0.624531 valid-accuracy:0.543478
	Iter:4 train-loss:1.122469 valid-loss:1.256133  train-accuracy:0.624531 valid-accuracy:0.543478
	Iter:5 train-loss:1.081461 valid-loss:1.229342  train-accuracy:0.808511 valid-accuracy:0.771739
	Iter:6 train-loss:1.045305 valid-loss:1.204601  train-accuracy:0.826033 valid-accuracy:0.728261
	Iter:16 train-loss:0.794952 valid-loss:1.058568  train-accuracy:0.914894 valid-accuracy:0.771739
	Iter:26 train-loss:0.646146 valid-loss:1.021539  train-accuracy:0.926158 valid-accuracy:0.793478
	Iter:36 train-loss:0.558627 valid-loss:1.023663  train-accuracy:0.929912 valid-accuracy:0.771739
	Iter:46 train-loss:0.493899 valid-loss:1.025164  train-accuracy:0.931164 valid-accuracy:0.760870
	Iter:56 train-loss:0.451528 valid-loss:1.032880  train-accuracy:0.938673 valid-accuracy:0.771739

（gpt）Here's a simple breakdown of the Keras model summary:

##### Model Type & Task:
- **Model:** Gradient Boosted Trees Model (like XGBoost or similar).
- **Task:** Classification problem (predicting categories).
- **Label:** The target label is called `__LABEL`.

##### Input Features:
- **11 Features:** These are the columns or variables used by the model for prediction, like `Age`, `Cabin`, `Sex`, `Ticket_number`, etc.

##### Variable Importance:
- **INV_MEAN_MIN_DEPTH:** Shows which features are most important based on their depth in the trees.
- **NUM_AS_ROOT:** Shows how many times each feature is used as the "root" feature at the top of the trees.
- **NUM_NODES:** Counts how often a feature is used across the total number of nodes in all trees.
- **SUM_SCORE:** Measures the overall contribution of each feature to the model’s predictions.

##### Model Loss & Performance:
- **Loss Function:** The model uses **Binomial Log Likelihood**, which is common for binary classification tasks.
- **Validation Loss:** The model's error (1.01542) on the validation data, lower values are better.
- **Number of Trees:** 33 trees in total, where each tree helps make a decision.
- **Total Nodes:** There are 1,823 decision points in the trees.

##### Tree Structure:
- **Number of Nodes per Tree:** There are 33 trees, and each tree has around 55 nodes on average.
- **Depth by Leafs:** Most leaves in the trees have depth 5 (indicating the final decision-making level).
- **Training Observations per Leaf:** The number of training examples (data points) handled by each leaf.

##### Attributes Used in Nodes:
- **Age, Sex, Fare, Name, etc.:** These are the attributes/features involved in the decision-making at different levels of the tree.
- **Attributes with Depth <= 0:** Features like `Sex` and `Name` appear early (shallow) in the decision trees.
- **Condition Types:** Oblique conditions are used to make complex splits in the data at the decision nodes.

##### Training Logs:
- **Iterations:** The model underwent 33 iterations to improve, with the training loss decreasing over time and the accuracy increasing.
- **Final Model Accuracy:** It achieved around 92.9% accuracy on the training data at the end.

In summary, this model is a Gradient Boosted Trees classifier that uses several features to predict categories, with `Sex`, `Age`, and `Fare` being the most influential. The model is progressively improved through multiple iterations, and it has good training performance (accuracy).

## Make predictions

In [None]:
def prediction_to_kaggle_format(model, threshold=0.5):
    proba_survive = model.predict(serving_ds, verbose=0)[:,0]
    return pd.DataFrame({
        "PassengerId": serving_df["PassengerId"],
        "Survived": (proba_survive >= threshold).astype(int)
    })

def make_submission(kaggle_predictions):
    path="/kaggle/working/submission.csv"
    kaggle_predictions.to_csv(path, index=False)
    print(f"Submission exported to {path}")
    
kaggle_predictions = prediction_to_kaggle_format(model)
make_submission(kaggle_predictions)
!head /kaggle/working/submission.csv

In [None]:
Submission exported to /kaggle/working/submission.csv

Final accuracy (submitted):**0.80143**

## Training a model with hyperparameter tuning
Hyper-parameter tuning is enabled by specifying the tuner constructor argument of the model. The tuner object contains all the configuration of the tuner (search space, optimizer, trial and objective).

In [None]:
tuner = tfdf.tuner.RandomSearch(num_trials=1000)
tuner.choice("min_examples", [2, 5, 7, 10])
tuner.choice("categorical_algorithm", ["CART", "RANDOM"])

local_search_space = tuner.choice("growing_strategy", ["LOCAL"])
local_search_space.choice("max_depth", [3, 4, 5, 6, 8])

global_search_space = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_search_space.choice("max_num_nodes", [16, 32, 64, 128, 256])

#tuner.choice("use_hessian_gain", [True, False])
tuner.choice("shrinkage", [0.02, 0.05, 0.10, 0.15])
tuner.choice("num_candidate_attributes_ratio", [0.2, 0.5, 0.9, 1.0])


tuner.choice("split_axis", ["AXIS_ALIGNED"])
oblique_space = tuner.choice("split_axis", ["SPARSE_OBLIQUE"], merge=True)
oblique_space.choice("sparse_oblique_normalization",
                     ["NONE", "STANDARD_DEVIATION", "MIN_MAX"])
oblique_space.choice("sparse_oblique_weights", ["BINARY", "CONTINUOUS"])
oblique_space.choice("sparse_oblique_num_projections_exponent", [1.0, 1.5])

# Tune the model. Notice the `tuner=tuner`.
tuned_model = tfdf.keras.GradientBoostedTreesModel(tuner=tuner)
tuned_model.fit(train_ds, verbose=0)

tuned_self_evaluation = tuned_model.make_inspector().evaluation()
print(f"Accuracy: {tuned_self_evaluation.accuracy} Loss:{tuned_self_evaluation.loss}")

In [None]:
Accuracy: 0.9178082346916199 Loss:0.6503586769104004

In [None]:
kaggle_predictions = prediction_to_kaggle_format(tuned_model)
make_submission(kaggle_predictions)
!head /kaggle/working/submission.csv

For more information you can follow this tutorial: [Automated hyper-parameter tuning](https://www.tensorflow.org/decision_forests/tutorials/automatic_tuning_colab)

Final result: **0.80143**

## Making an ensemble
Here you'll create 100 models with different seeds and combine their results.
- This approach removes a little bit the random aspects related to creating ML models
- In the GBT creation is used the honest parameter. It will use different training examples to infer the structure and the leaf values. This regularization technique trades examples for bias estimates。

In [None]:
predictions = None
num_predictions = 0

for i in range(100):
    print(f"i:{i}")
    # Possible models: GradientBoostedTreesModel or RandomForestModel
    model = tfdf.keras.GradientBoostedTreesModel(
        verbose=0, # Very few logs
        features=[tfdf.keras.FeatureUsage(name=n) for n in input_features],
        exclude_non_specified_features=True, # Only use the features in "features"

        #min_examples=1,
        #categorical_algorithm="RANDOM",
        ##max_depth=4,
        #shrinkage=0.05,
        ##num_candidate_attributes_ratio=0.2,
        #split_axis="SPARSE_OBLIQUE",
        #sparse_oblique_normalization="MIN_MAX",
        #sparse_oblique_num_projections_exponent=2.0,
        #num_trees=2000,
        ##validation_ratio=0.0,
        random_seed=i,
        honest=True,
    )
    model.fit(train_ds)
    
    sub_predictions = model.predict(serving_ds, verbose=0)[:,0]
    if predictions is None:
        predictions = sub_predictions
    else:
        predictions += sub_predictions
    num_predictions += 1

predictions/=num_predictions

kaggle_predictions = pd.DataFrame({
        "PassengerId": serving_df["PassengerId"],
        "Survived": (predictions >= 0.5).astype(int)
    })

make_submission(kaggle_predictions)

Final result: **0.80143**

(GPT)
In simple terms, this section is explaining how to **create an ensemble of models** to improve the accuracy and stability of the predictions:

1. **Making an Ensemble of 100 Models:**
   - The idea is to create **100 different models** with different **random seeds** (which are like starting points for the model's training process).
   - By using different seeds, each model will be slightly different, and this diversity helps to improve the overall performance when you combine their results. This is called an **ensemble**, where you take multiple models and combine their outputs to make a final prediction. It can reduce errors that happen because of randomness in how a single model might learn from the data.

2. **Removing Randomness:**
   - Normally, machine learning models can have some randomness in how they are trained, which can lead to different results each time you train a model (even on the same data).
   - By creating many models with different random seeds and combining their results, you **reduce the effect of randomness**. This helps make the final prediction more reliable and consistent.

3. **"Honest" Parameter in Gradient Boosted Trees (GBT):**
   - The **"honest" parameter** is a special setting in the model’s creation. It helps the model by using **different examples** (data points) to **learn the structure of the model** (how the tree splits at each point) and to **estimate the values** (what the predicted output should be for each branch of the tree).
   - This helps reduce the risk of the model **overfitting** the data (getting too specific to the training data and not generalizing well to new data).
   - The idea is that by trading off a bit of **bias (slightly less accurate predictions on the training set)** for **more stable estimates** (better predictions on new data), the model becomes more **robust** and better at making predictions.

### Summary:
- **Ensemble of 100 models**: This creates multiple models with different seeds, and then combines their predictions to make the final output more reliable.
- **Honest parameter**: This regularization technique makes the model more general by using different training examples for learning the structure and values of the tree. It helps reduce overfitting and makes the model better at generalizing to new data.

In short, it’s a way to make the model **stronger** and **more stable** by reducing randomness and improving how the model generalizes from the data.

---

### (gpt) Further to try with
If using an ensemble of 100 models in **Gusthema's approach** still results in a model accuracy of about **80%** and you're looking to improve it further, there are a few strategies and improvements you could try. I’ll break them down into possible steps for you:

### 1. **Hyperparameter Tuning for TF-DT**
   - **Parameter tuning** is one of the most effective ways to improve model performance. While you mentioned using ensemble models, ensuring you're using **optimal hyperparameters** for **TensorFlow Decision Trees (TF-DT)** is critical. 
   - Parameters you can tune:
     - **Learning rate**: A higher learning rate can make the model converge faster but may miss the optimal solution. A lower learning rate might improve precision but slow down training.
     - **Number of trees**: You can try increasing the number of trees in the ensemble (instead of 100, try 150 or 200).
     - **Maximum depth of trees**: Deeper trees might capture more complex patterns but also risk overfitting. Try experimenting with different depths.
     - **Leaf size**: The number of data points each leaf node in the decision tree contains can impact generalization.
   - Use **GridSearch** or **RandomizedSearch** to try different combinations of these parameters.

### 2. **Advanced Ensembling Techniques**
   - Instead of just averaging the predictions of 100 models, you can use more sophisticated ensembling techniques:
     - **Stacking**: Train a second-level model that combines the predictions from the individual models to make a final prediction. This can often improve accuracy by leveraging the strengths of individual models.
     - **Weighted Averaging**: Instead of giving each model equal weight, give more importance to models that perform better on a validation set.
     - **Bagging**: Bagging (Bootstrap Aggregating) can help by creating different subsets of your training data and training each model on a different subset. This often reduces variance and improves model generalization.

### 3. **Feature Engineering**
   - **Feature scaling**: Decision trees are usually insensitive to feature scaling, but other models in your ensemble might benefit from it. Scaling or normalizing features could help some algorithms like **Logistic Regression** or **SVMs** that might be part of your ensemble.
   - **Feature selection**: You could try reducing the feature space by selecting only the most relevant features for training. Sometimes, less is more, and trimming irrelevant or noisy features can help boost accuracy.
   - **Creating new features**: You could try creating interaction terms or higher-order features (e.g., combining **Age** and **Fare**) to give the model more information.

### 4. **Try Other Models in the Ensemble**
   - **Include different types of models**: While TF-DT is a great model, sometimes combining it with other models can improve accuracy. Consider adding models like **XGBoost**, **Random Forests**, or **Logistic Regression** as part of your ensemble.
   - **Gradient Boosting**: Instead of using just TensorFlow Decision Trees, you could include **Gradient Boosting Models (GBM)** or even **XGBoost**. These models perform well on tabular data and might capture patterns that TF-DT misses.

### 5. **Cross-validation**
   - **K-fold cross-validation**: Instead of using a single training-validation split, use **K-fold cross-validation** to make sure your model is evaluated across multiple splits. This will give you a better sense of model performance and reduce overfitting.
   - **Stratified K-folds**: For classification problems, ensure you're using **Stratified K-fold cross-validation**, which ensures the distribution of the target variable is the same in each fold, preventing biased validation results.

### 6. **Early Stopping / Model Checkpoints**
   - Use **early stopping** to prevent overfitting and allow the model to stop training once the performance on the validation set stops improving. This is useful when training a large number of trees and ensures you're not overtraining.
   - If you're already using early stopping, make sure your **patience** parameter is set appropriately to avoid prematurely stopping training before the model has learned enough.

### 7. **Data Augmentation / More Data**
   - If your model is still not reaching the desired accuracy, consider if there are ways to **increase the amount of data**.
     - **Synthetic data generation**: If you don’t have more data, you could try **synthetic data augmentation** techniques or **bootstrapping** (sampling with replacement) to increase the variety of data your models are trained on.
     - **Feature engineering** to create new data points, or transforming existing ones into more useful formats, can help in some cases.

### 8. **Model Diagnostics**
   - **Analyze model errors**: Look at **misclassified instances** and see if you can identify patterns in the errors. It might help you fine-tune your feature engineering or even spot outliers that are negatively affecting the model.
   - **Shapley Values** or **Feature Importance**: Use tools like **SHAP** or **LIME** to analyze which features are important for the model's predictions and if they can be improved.

### 9. **Hyperparameter Tuning on Multiple Models in the Ensemble**
   - Instead of just tuning parameters for one model in the ensemble, consider **hyperparameter optimization for each individual model** in the ensemble. If you're using a variety of models, you can apply **Bayesian Optimization** or **RandomizedSearch** for each model in the ensemble separately.

---

### Summary of Suggestions:
| **Improvement Area**                  | **Suggestions**                                                                                  |
|---------------------------------------|--------------------------------------------------------------------------------------------------|
| **Hyperparameter Tuning**             | Tune hyperparameters like learning rate, number of trees, depth, and leaf size. Use grid or random search. |
| **Ensemble Techniques**               | Try stacking, weighted averaging, or bagging for better results.                                  |
| **Feature Engineering**               | Normalize features, reduce noise, or create new meaningful features (e.g., Age + Fare interaction).|
| **Model Variety in Ensemble**        | Include models like XGBoost, Random Forest, or even Logistic Regression alongside TF-DT.          |
| **Cross-Validation**                  | Use K-fold cross-validation (Stratified) to reduce overfitting and improve model evaluation.     |
| **Early Stopping**                    | Implement early stopping with appropriate patience to avoid overtraining.                        |
| **More Data / Data Augmentation**    | Increase the data size through augmentation or synthetic data generation.                        |
| **Model Diagnostics**                 | Analyze misclassifications and use SHAP/LIME to understand feature importance.                   |
| **Hyperparameter Tuning on Ensemble** | Perform hyperparameter tuning separately on each individual model in the ensemble.               |

By implementing some of these techniques, especially tuning hyperparameters, diversifying the ensemble, and using advanced ensembling methods, you might see an improvement in your model accuracy.