## Classification with TensorFlow Decision Forests

Models include Random forests, gradient boosted trees and carts, and can be used for regression, classification and raking task

Example below used gradient boosted trees model in binary classification of structured data, and covers the following:
- Build a decision forests model by specifying the input
- implement a custom binary target encoder
- Encode the categorical features as embedding, train these embdding in simple nn model

In [1]:
import math
import urllib
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_decision_forests as tfdf

### Prepare the data

Dataset will be from the united states census income dataset provided by uc irvine

Example will be an binary classification to determine whether a person makes over 50k a year

In [2]:
BASE_PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census-income"

CSV_HEADER = [
    l.decode("utf-8").split(":")[0].replace(" ", "_")
    for l in urllib.request.urlopen(f"{BASE_PATH}.names")
    if not l.startswith(b"|")
][2:]

CSV_HEADER.append("income_level")

train_data = pd.read_csv(f"{BASE_PATH}.data.gz", header=None, names=CSV_HEADER,)
test_data = pd.read_csv(f"{BASE_PATH}.test.gz", header=None, names=CSV_HEADER,)

### Define dataset metadata

Define the metadata of the dataset tha twill be useful for encoding the input featuers 


In [4]:
# Target column name.
TARGET_COLUMN_NAME = "income_level"
# The labels of the target columns.
TARGET_LABELS = [" - 50000.", " 50000+."]
# Weight column name.
WEIGHT_COLUMN_NAME = "instance_weight"
# Numeric feature names.
NUMERIC_FEATURE_NAMES = [
    "age",
    "wage_per_hour",
    "capital_gains",
    "capital_losses",
    "dividends_from_stocks",
    "num_persons_worked_for_employer",
    "weeks_worked_in_year",
]
# Categorical features and their vocabulary lists.
CATEGORICAL_FEATURE_NAMES = [
    "class_of_worker",
    "detailed_industry_recode",
    "detailed_occupation_recode",
    "education",
    "enroll_in_edu_inst_last_wk",
    "marital_stat",
    "major_industry_code",
    "major_occupation_code",
    "race",
    "hispanic_origin",
    "sex",
    "member_of_a_labor_union",
    "reason_for_unemployment",
    "full_or_part_time_employment_stat",
    "tax_filer_stat",
    "region_of_previous_residence",
    "state_of_previous_residence",
    "detailed_household_and_family_stat",
    "detailed_household_summary_in_household",
    "migration_code-change_in_msa",
    "migration_code-change_in_reg",
    "migration_code-move_within_reg",
    "live_in_this_house_1_year_ago",
    "migration_prev_res_in_sunbelt",
    "family_members_under_18",
    "country_of_birth_father",
    "country_of_birth_mother",
    "country_of_birth_self",
    "citizenship",
    "own_business_or_self_employed",
    "fill_inc_questionnaire_for_veteran's_admin",
    "veterans_benefits",
    "year",
]

### Basic data preparation

We will need to convert the target labels from string to int.

Then we need to cast categorical features to strings

In [5]:
def prepare_dataframe(dataframe):
    # convert the target labels from string to int
    dataframe[TARGET_COLUMN_NAME] = dataframe[TARGET_COLUMN_NAME].map(
        TARGET_LABELS.index
    )
    # Cast the categorical features to strings
    for feature_name in CATEGORICAL_FEATURE_NAMES:
        dataframe[feature_name] = dataframe[feature_name].astype(str)

prepare_dataframe(train_data)
prepare_dataframe(test_data)

Display the shapes of the training and test dataframe

Additional display some instances

In [6]:
print(f"Train data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")
print(train_data.head().T)

Train data shape: (199523, 42)
Test data shape: (99762, 42)
                                                                                    0  \
age                                                                                73   
class_of_worker                                                       Not in universe   
detailed_industry_recode                                                            0   
detailed_occupation_recode                                                          0   
education                                                        High school graduate   
wage_per_hour                                                                       0   
enroll_in_edu_inst_last_wk                                            Not in universe   
marital_stat                                                                  Widowed   
major_industry_code                                       Not in universe or children   
major_occupation_code                             

### Configure Hyperparameters

Update the hyperparmeters for the gradient boosted trees model

In [7]:
# maximum number of decision trees. 
# The effective number of trained trees can be smaller if early stopping is enabled

NUM_TREES = 250

# Minimum number of examples in a node
MIN_EXAMPLES = 6

# Maximum depth of the tree. 
# Max_depth=1 means that all trees will be roots
MAX_DEPTH = 5

# Ratio of the dataset (sampling without replacement) used to train individual
# trees for the random sampling method
SUBSAMPLE = 0.65

# Control the sampling of the datasets used to train idividual trees
SAMPLING_METHOD = "RANDOM"

# Ratio of the training dataset used to monitor the training 
# Require to be >0 if early stopping is enabled
VALIDATION_RATIO = 0.1

## Implement a training and evaluation procedure

the run_experiment() method is responsible loading the train and test datasets, training a given model, and evaluating the trained model

When training a decision forests model, only one epoch is needed to read the full dataset.
Any extra steps will result in unnecessary slower training 

In [8]:
def run_experiment(model, train_data, test_data, num_epochs=1, batch_size=None):

    train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(
        train_data, label=TARGET_COLUMN_NAME, weight=WEIGHT_COLUMN_NAME
    )
    test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(
        test_data, label=TARGET_COLUMN_NAME, weight=WEIGHT_COLUMN_NAME
    )

    model.fit(train_dataset, epochs=num_epochs, batch_size=batch_size)
    _, accuracy = model.evaluate(test_dataset, verbose=0)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

## Experiment: Decision forests with raw features

Specify model input features usages
You can attach semantics to each feature to control how it is used by the model.
If not specified, the semantics are inferred from the representation type.

Recommended to specif the feature usages explicity to avoid incorrect inferred semantics is incorrect. For example, a categorical value identifier (integer) will be inferred as numerical, while its semantically categorical

For numerical features, you can set the discretized parameters to the number of buckets by which the numerical feature should be discretized. 

This makes training faster but may lead to worse models

In [9]:
def specify_feature_usages():
    feature_usages = []

    for feature_name in NUMERIC_FEATURE_NAMES:
        feature_usage = tfdf.keras.FeatureUsage(
            name=feature_name, semantic=tfdf.keras.FeatureSemantic.NUMERICAL
        )
        feature_usages.append(feature_usage)

    for feature_name in CATEGORICAL_FEATURE_NAMES:
        feature_usage = tfdf.keras.FeatureUsage(
            name=feature_name, semantic=tfdf.keras.FeatureSemantic.CATEGORICAL
        )
        feature_usages.append(feature_usage)

    return feature_usages

### Create a gradient boosted trees model

When compiling a decision forests model, you may only provide extra evaluation metrics.

The loss is specified in the model construction, and the optimizer is irrelevant to decision models

In [10]:
def create_gbt_model():
    # see all the model parameters in
    gpt_model = tfdf.keras.GradientBoostedTreesModel(
        features=specify_feature_usages(),
        exclude_non_specified_features=True,
        num_trees=NUM_TREES,
        max_depth=MAX_DEPTH,
        min_examples=MIN_EXAMPLES,
        subsample=SUBSAMPLE,
        validation_ratio=VALIDATION_RATIO,
        task=tfdf.keras.Task.CLASSIFICATION,
    )

    gpt_model.compile(metrics=[keras.metrics.BinaryAccuracy(name="accuracy")])
    return gpt_model

### Train and evaluate the model


In [11]:
gbt_model = create_gbt_model()
run_experiment(gbt_model, train_data, test_data)

Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB

Use /var/folders/sk/f7k402kx1wvdmcz91gdz6hs00000gn/T/tmpht8ghyrm as temporary training directory


2022-12-22 19:31:14.041865: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-22 19:31:14.043165: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
  features_dataframe = dataframe.drop(label, 1)
  features_dataframe = features_dataframe.drop(weight, 1)
  features_dataframe = dataframe.drop(label, 1)
  features_dataframe = features_dataframe.drop(weight, 1)






Reading training dataset...


2022-12-22 19:31:18.645707: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-12-22 19:31:18.647797: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Training dataset read in 0:00:03.923574. Found 199523 examples.
Training model...
Model trained in 0:00:12.826989
Compiling model...


[INFO kernel.cc:1176] Loading model from path /var/folders/sk/f7k402kx1wvdmcz91gdz6hs00000gn/T/tmpht8ghyrm/model/ with prefix 0372030dab4540cc
[INFO abstract_model.cc:1249] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO kernel.cc:1022] Use fast generic engine


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.


2022-12-22 19:31:34.502168: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-22 19:31:34.598669: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-22 19:31:34.808514: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Test accuracy: 95.81%


### Inspect the model

the model.summary() will display information about your decision trees model, model type, task, input features, and feature importance

In [12]:
print(gbt_model.summary())

Model: "gradient_boosted_trees_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (40):
	age
	capital_gains
	capital_losses
	citizenship
	class_of_worker
	country_of_birth_father
	country_of_birth_mother
	country_of_birth_self
	detailed_household_and_family_stat
	detailed_household_summary_in_household
	detailed_industry_recode
	detailed_occupation_recode
	dividends_from_stocks
	education
	enroll_in_edu_inst_last_wk
	family_members_under_18
	fill_inc_questionnaire_for_veteran's_admin
	full_or_part_time_employment_stat
	hispanic_origin
	live_in_this_house_1_year_ago
	major_industry_code
	major_occupation_code
	marital_stat
	member_of_a_labor_union
	migration_code-change_in_msa
	migration_code-