In [None]:
!pip install tensorflow_decision_forests
!pip install tfds-nightly -U --quiet

# The Goal of This Notebook

I enjoy using tree-based algorithms for data science. They are successful, interpretable, easy to use, relatively easy to tune, etc. However, they do not apply to a wide range of tasks such as image preprocessing, NLP tasks, and signal processing.

In this notebook, I wanted to show how to model NLP tasks for tree-based algorithms with the help of deep learning layers.

There is also an example on the TensorFlow website about the same topic which you can find [here](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab).

# The Inspiration

The inspiration for this kernel is coming from [this](https://arxiv.org/pdf/2009.09991.pdf) article. If you are curious about this topic I highly recommend checking that out.

# The Methodology

I used gradient boosting trees and ensemble trees for a binary text classification task. The hyperparameters of the tree algorithms come from the article. I also included a couple of different deep learning algorithms for comparison. Hope you'll enjoy it!

# The Content

1. [Native Categorical Set Handling](#1)
2. [Pretrained Embedding](#2)
3. [Count Based Preprocessing](#3)
4. [Nontrained Embedding](#4)

[Final Words](#5)

In [5]:
import tensorflow_decision_forests as tfdf
import tensorflow_hub as hub
from tensorflow.keras import layers
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

2022-07-08 08:30:32.145855: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
2022-07-08 08:30:32.145914: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [6]:
import tensorflow_datasets as tfds
dataset = tfds.load('imdb_reviews',
                          as_supervised=True)

2022-07-08 08:30:37.887451: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMGLSIH/imdb_reviews-train.tfrecord*...…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMGLSIH/imdb_reviews-test.tfrecord*...:…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMGLSIH/imdb_reviews-unsupervised.tfrec…

[1mDataset imdb_reviews downloaded and prepared to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


2022-07-08 08:31:37.204160: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
2022-07-08 08:31:37.204225: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-08 08:31:37.204296: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (948127bc3417): /proc/driver/nvidia/version does not exist
2022-07-08 08:31:37.204675: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the

In [7]:
train_ds = dataset["train"].batch(100)
test_ds = dataset["test"].batch(100)

train_ds = train_ds.cache().prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.cache().prefetch(tf.data.AUTOTUNE)

In [8]:
for example, label in train_ds.take(3):
  for i in range(3):
    print(example[i])

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
tf.Tensor(b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because t

2022-07-08 08:31:37.459320: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2022-07-08 08:31:37.459420: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


<a id='1'></a>
# 1. Native Categorical Set Handling

* Tensorflow Decision forests can handle categorical set of features natively[[1]](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab). Let's see the performance of native handling.

In [9]:
rf_params = {
    "num_trees": 500,
    "max_depth":32,
    "categorical_algorithm":"RANDOM",
    'random_seed':123
    
}

model_1 = tfdf.keras.RandomForestModel(**rf_params)
model_1.fit(x=train_ds)
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(test_ds)

Use /tmp/tmpewg437zt as temporary training directory
Reading training dataset...
Training dataset read in 0:00:07.074143. Found 25000 examples.
Training model...
Model trained in 0:00:00.906747
Compiling model...


[INFO kernel.cc:1176] Loading model from path /tmp/tmpewg437zt/model/ with prefix 7f0901fca6da4225
[INFO abstract_model.cc:1246] Engine "RandomForestOptPred" built
[INFO kernel.cc:1022] Use fast generic engine


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.


In [10]:
gb_params = {"max_depth":6,
            "shrinkage" : 0.1,
            "sampling_method" : None,
            "validation_ratio" : 0.1,
            "num_trees":500}

model_1_mart = tfdf.keras.GradientBoostedTreesModel(**gb_params)
model_1_mart.fit(train_ds)
model_1_mart.compile(metrics=["accuracy"])
evaluation = model_1_mart.evaluate(test_ds)

Use /tmp/tmpeo4zw5la as temporary training directory
Reading training dataset...
Training dataset read in 0:00:00.271786. Found 25000 examples.
Training model...
Model trained in 0:00:00.308375
Compiling model...
Model compiled.


[INFO kernel.cc:1176] Loading model from path /tmp/tmpeo4zw5la/model/ with prefix 98acda81e1f4433c
[INFO kernel.cc:1022] Use fast generic engine




<a id='2'></a>
# 2. Pretrained Embedding
​
* While using neural nets, we generally compute the algorithm for several epochs due to the nature of SGD. On the other hand, for tree-based algorithms using one epoch is enough for training [[2]](https://www.tensorflow.org/decision_forests/migration). Using pre-trained embedding is a wise move for NLP modeling because embedding weights are not able to update during epochs [[3]](https://www.tensorflow.org/decision_forests/text_features).
​
* It's important to note that tree algorithms can not utilize semantic information like neural nets because they try to solve the problems via splitting. They don't utilize matrix multiplication or dot products. Although this is the case, they still perform well most of the time because of all the information that embedding has mostly not used anyways.

In [11]:
hub_url = "http://tfhub.dev/google/universal-sentence-encoder/4"
embedding = hub.KerasLayer(hub_url)

sentence = tf.keras.layers.Input(shape = (), name = 'sentence', dtype = tf.string)
embedded_sentence = embedding(sentence)
preprocessor = tf.keras.Model(sentence,embedded_sentence)
model_2 = tfdf.keras.RandomForestModel(preprocessing = preprocessor,
                                     **rf_params)
model_2.fit(train_ds)
model_2.compile(metrics=["accuracy"])
model_2.evaluate(test_ds)

Use /tmp/tmp_2fbdnur as temporary training directory
Reading training dataset...
Training dataset read in 0:00:40.068571. Found 25000 examples.
Training model...


[INFO kernel.cc:1176] Loading model from path /tmp/tmp_2fbdnur/model/ with prefix ef13ae50cbbb4117
[INFO abstract_model.cc:1246] Engine "RandomForestOptPred" built
[INFO kernel.cc:1022] Use fast generic engine


Model trained in 0:02:32.919475
Compiling model...
Model compiled.


[0.0, 0.8382400274276733]

In [12]:
model_2_mart = tfdf.keras.GradientBoostedTreesModel(**gb_params,
                                                    preprocessing = preprocessor)
model_2_mart.fit(train_ds)
model_2_mart.compile(metrics=["accuracy"])
model_2_mart.evaluate(test_ds)

Use /tmp/tmpkcgden5f as temporary training directory
Reading training dataset...
Training dataset read in 0:00:40.878853. Found 25000 examples.
Training model...
Model trained in 0:08:53.112875
Compiling model...


[INFO kernel.cc:1176] Loading model from path /tmp/tmpkcgden5f/model/ with prefix cbb295e182d24030
[INFO abstract_model.cc:1246] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO kernel.cc:1022] Use fast generic engine


Model compiled.


[0.0, 0.8526399731636047]

<a id='3'></a>
# 3. Count Based Preprocessing

* Count-based preprocessing relies on counting the number of times the token at that index appeared in the batch item [[4]](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). In the article, this preprocessing is named "BagOfWord". You can find the relative information on page 2 at the top of column 2.

In [13]:
layer = tf.keras.layers.TextVectorization()
layer.adapt(train_ds.map(lambda x,y:x))
print("Number of words: ",len(layer.get_vocabulary()))

MAX_TOKENS = 5000

count_layer = tf.keras.layers.TextVectorization(
                                    max_tokens=MAX_TOKENS, output_mode = 'count'
                                     )
count_layer.adapt(train_ds.map(lambda x,y:x))

sentence = tf.keras.layers.Input(shape = (), name = 'sentence', dtype = tf.string)
encoded_sentence = count_layer(sentence)
preprocess_model = tf.keras.Model(sentence,encoded_sentence)

model_3 = tfdf.keras.RandomForestModel(preprocessing = preprocess_model,
                                     **rf_params)
model_3.fit(train_ds)
model_3.compile(metrics=["accuracy"])
model_3.evaluate(test_ds)

Number of words:  121894
Use /tmp/tmps2pgr1u8 as temporary training directory
Reading training dataset...
Training dataset read in 0:01:11.667403. Found 25000 examples.
Training model...


[INFO kernel.cc:1176] Loading model from path /tmp/tmps2pgr1u8/model/ with prefix a64ae104f5d74587
[INFO abstract_model.cc:1246] Engine "RandomForestOptPred" built
[INFO kernel.cc:1022] Use fast generic engine


Model trained in 0:22:17.919885
Compiling model...
Model compiled.


[0.0, 0.8240799903869629]

In [14]:
model_3_mart = tfdf.keras.GradientBoostedTreesModel(**gb_params,
                                                    preprocessing = preprocess_model,
                                                    )
model_3_mart.fit(train_ds)
model_3_mart.compile(metrics=["accuracy"])
model_3_mart.evaluate(test_ds)

Use /tmp/tmpa5w9qkbu as temporary training directory
Reading training dataset...
Training dataset read in 0:01:07.693841. Found 25000 examples.
Training model...
Model trained in 1:07:37.362400
Compiling model...


[INFO kernel.cc:1176] Loading model from path /tmp/tmpa5w9qkbu/model/ with prefix 74239e1c48bc4619
[INFO abstract_model.cc:1246] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO kernel.cc:1022] Use fast generic engine


Model compiled.


[0.0, 0.8692399859428406]

<a id='4'></a>
# 4. Nontrained Embedding 

* I prepared this part just to prove why using pre-trained embedding is a better idea.

In [15]:
sentence = tf.keras.Input(shape = ())
indexer = tf.keras.layers.TextVectorization(max_tokens = MAX_TOKENS,
                                            output_mode = 'int',
                                            output_sequence_length = MAX_TOKENS)
indexer.adapt(train_ds.map(lambda x,y: x))

embedding = tf.keras.layers.Embedding(input_dim = 5000, output_dim = 512)

sentence = tf.keras.Input(shape = (), name = 'sentence', dtype = tf.string)
indexed_sentence = indexer(sentence)
embedded_sentence = embedding(indexed_sentence)
output = tf.keras.layers.GlobalAveragePooling1D()(embedded_sentence)

non_trained_embedding_model = tf.keras.Model(sentence,output)

model_4 = tfdf.keras.RandomForestModel(preprocessing = non_trained_embedding_model,
                                     **rf_params)


Use /tmp/tmpf6wm3aff as temporary training directory


In [16]:
model_4.fit(train_ds)
model_4.compile(metrics=["accuracy"])
model_4.evaluate(test_ds)

Reading training dataset...


2022-07-08 10:21:49.696943: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1024000000 exceeds 10% of free system memory.
2022-07-08 10:21:50.621850: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1024000000 exceeds 10% of free system memory.
2022-07-08 10:21:51.058738: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1024000000 exceeds 10% of free system memory.
2022-07-08 10:21:51.489772: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1024000000 exceeds 10% of free system memory.
2022-07-08 10:21:51.936681: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1024000000 exceeds 10% of free system memory.


Training dataset read in 0:01:56.556941. Found 25000 examples.
Training model...


[INFO kernel.cc:1176] Loading model from path /tmp/tmpf6wm3aff/model/ with prefix 1c150f55bd544f41
[INFO abstract_model.cc:1246] Engine "RandomForestOptPred" built
[INFO kernel.cc:1022] Use fast generic engine


Model trained in 0:03:27.490342
Compiling model...
Model compiled.


[0.0, 0.673520028591156]

<a id='5'></a>
# Final Words

* Let's continue this experiment with different neural network models before wrapping it up.

* You can find the rest of the experiment [here](https://www.kaggle.com/code/egemenuurdalg/modeling-tree-algorithms-for-nlp-tasks-part-2?scriptVersionId=100331891). 