# Build, Train and evaluate models with tensorflow decision forests

Decision Forests (df) are a large family of machine learning algorithms for superivsed classification, regression and ranking. They use decision trees as a building block.

today the two most popular df algorithms are random forests and gradient boosted decision trees. Both are ensemble tecniques that use multiple decision trees, but differ on how they do it

In this tutorial, you will learn how to:

1. Train a binary classification Random Forest on a dataset containing numerical, categorical  and missing features.
2. Evaluate the model on a test dataset.
3. Prepare the model for TensorFlow Serving.
4. Examine the overall structure of the model and the importance of each feature.
5. Re-train the model with a different learning algorithm (Gradient Boosted Decision Trees).
6. Use a different set of input features.
7. Change the hyperparameters of the model.
8. Preprocess the features.
9. Train a model for regression.
10. Train a model for ranking.

url: https://www.tensorflow.org/decision_forests/tutorials/beginner_colab

### Import libraries

In [1]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

In [2]:
# check version of the tensorflow df
print("Found tensorflow df v" + tfdf.__version__)

Found tensorflow df v1.0.1


## Training a Random Forest Model

We will train, evaluate, analyse and export binary classification Random Forest trained on the palmer penquins dataset


### Load the dataset and convert it in a tf.dataset

This dataset is small and stored as a csv-like file

Lets assemble the dataset into a csv file and load it

In [5]:
# load a dataset into pandas dataframe
fname = "../../dataset/penguins.csv"
dataset_df = pd.read_csv(fname)

# display the first 3 examples
dataset_df.head(3)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007


Dataset conatain a mix of numerical (bill_depth_mm), categorical (island) and missing features.

TFDF supports all these feature types natively (differently than nn based models), therefore there is no need to preprocess in the form of one-hot encoding, normalization or extra is_present feature.

labels are a bit different. Keras metrics expect inegers. The label (species) is stored as a string so lets convert it into a integer 

### Encode the categorical labels as integers

This stage is necessary if your classification label is represented as a string since keras expect integer classification labels.

When using pd_dataframe_to_tf_dataset, this step can be skipped

In [6]:
# name of the label column
label = "species"

classes = dataset_df[label].unique().tolist()
print(f"label classes: {classes}")

dataset_df[label] = dataset_df[label].map(classes.index)

label classes: ['Adelie', 'Gentoo', 'Chinstrap']


split the dataset into training and testing

In [7]:
# split the dataset into a training and testing dataset

def split_dataset(dataset, test_ratio=0.30):
    """Split a panda dataframe in two"""
    test_indices = np.random.rand(len(dataset)) < test_ratio
    return dataset[~test_indices], dataset[test_indices]

train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing" .format(
    len(train_ds_pd), len(test_ds_pd)
))

236 examples in training, 108 examples for testing


Finally convert the pandas dataframe (pd.dataframe) into tensorflow datasets (tf.data.dataset)

Notes: Recall that pd_dataframe_to_tf_dataset converts string labels to integers if necessary

In [8]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB



  features_dataframe = dataframe.drop(label, 1)
2022-12-22 20:55:52.570550: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-22 20:55:52.572158: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
  features_dataframe = dataframe.drop(label, 1)


if you want to create the tf.data.dataset yourself, there are couple of things to remember:
- the learning algorithms work with a one-epoch dataset and without shuffling
- The batch size does not impact the training algorithm, but a small value might slow down reading the dataset

## Train the model

In [9]:
# specify the model
model_1 = tfdf.keras.RandomForestModel(verbose=2)

# train the model
model_1.fit(x=train_ds)

Use 8 thread(s) for training
Use /var/folders/sk/f7k402kx1wvdmcz91gdz6hs00000gn/T/tmp5bvbu0oq as temporary training directory
Reading training dataset...
Training tensor examples:
Features: {'island': <tf.Tensor 'data_4:0' shape=(None,) dtype=string>, 'bill_length_mm': <tf.Tensor 'data_1:0' shape=(None,) dtype=float64>, 'bill_depth_mm': <tf.Tensor 'data:0' shape=(None,) dtype=float64>, 'flipper_length_mm': <tf.Tensor 'data_3:0' shape=(None,) dtype=float64>, 'body_mass_g': <tf.Tensor 'data_2:0' shape=(None,) dtype=float64>, 'sex': <tf.Tensor 'data_5:0' shape=(None,) dtype=string>, 'year': <tf.Tensor 'data_6:0' shape=(None,) dtype=int64>}
Label: Tensor("data_7:0", shape=(None,), dtype=int64)
Weights: None
Normalized tensor features:
 {'island': SemanticTensor(semantic=<Semantic.CATEGORICAL: 2>, tensor=<tf.Tensor 'data_4:0' shape=(None,) dtype=string>), 'bill_length_mm': SemanticTensor(semantic=<Semantic.NUMERICAL: 1>, tensor=<tf.Tensor 'Cast:0' shape=(None,) dtype=float32>), 'bill_depth_

2022-12-22 21:01:20.666251: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-12-22 21:01:20.667602: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Training dataset read in 0:00:02.064181. Found 236 examples.
Training model...
Standard output detected as not visible to the user e.g. running in a notebook. Creating a training log redirection. If training get stuck, try calling tfdf.keras.set_training_logs_redirection(False).


[INFO kernel.cc:813] Start Yggdrasil model training
[INFO kernel.cc:814] Collect training examples
[INFO kernel.cc:422] Number of batches: 1
[INFO kernel.cc:423] Number of examples: 236
[INFO kernel.cc:836] Training dataset:
Number of records: 236
Number of columns: 8

Number of columns by type:
	NUMERICAL: 5 (62.5%)
	CATEGORICAL: 3 (37.5%)

Columns:

NUMERICAL: 5 (62.5%)
	0: "bill_depth_mm" NUMERICAL num-nas:1 (0.423729%) mean:17.1706 min:13.1 max:21.5 sd:2.01451
	1: "bill_length_mm" NUMERICAL num-nas:1 (0.423729%) mean:43.9362 min:32.1 max:58 sd:5.45471
	2: "body_mass_g" NUMERICAL num-nas:1 (0.423729%) mean:4177.02 min:2700 max:6000 sd:770.551
	3: "flipper_length_mm" NUMERICAL num-nas:1 (0.423729%) mean:200.745 min:174 max:230 sd:13.4213
	6: "year" NUMERICAL mean:2008.01 min:2007 max:2009 sd:0.820711

CATEGORICAL: 3 (37.5%)
	4: "island" CATEGORICAL has-dict vocab-size:4 zero-ood-items most-frequent:"Biscoe" 103 (43.6441%)
	5: "sex" CATEGORICAL num-nas:7 (2.9661%) has-dict vocab-size:

Model trained in 0:00:00.088352
Compiling model...
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.


2022-12-22 21:01:21.678108: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-22 21:01:21.730975: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


<keras.callbacks.History at 0x17dc29d90>

### Remarks

- No input featuers are specified. Therefore all the columsn will be used as input features expect for the label. The feature used by the model are shown in the training logs and in the model.summary()
- df consume natively numerical, categorical, categorical-set features and missing-value. Numerical features do not need to be normalized. Categorical strings values do not need to be encoded in a dictionary
- No training hyper parameters are specified. Therefore the default hyper-parameters will be used. Default hyper-parameters provide reasonable results in most situations
- Calling compile on the model before fit is optional. Compile can be used to provide extra evaluation metrics
- training algorithms do not need validation datatsets. If validation dataset is provided, it will not only be used to show metrics 
- tweak the verbose argument to radnomforestmodel to control the amount of displayed training logs. Set verbose=0 to hide most of the logs. Set verbose=2 to show all the logs

## Evaluate the model

Lets evaluate our model on the test dataset

In [10]:
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(test_ds, return_dict=True)
print()

for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")


loss: 0.0000
accuracy: 0.9815


2022-12-22 21:10:07.063254: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
