<a href="https://colab.research.google.com/github/forthmedia/notebooks/blob/main/TensorFlow_Decision_Forrests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Walkthrough of Keras RandomForestModel on Colab

<span style="font-size:large">The Ensemble Learning method</span> trains a group of Decision Trees to make *predicitions* in a way that emulates *the wisdom of the crowd*. Despite its intuitive simplicity, **Random Forest** is one of the most powerful Machine Learning algorithms today. A new implementation by *TensorFlow* brings this method to *neural networks*.

![](https://1.bp.blogspot.com/-Ax59WK4DE8w/YK6o9bt_9jI/AAAAAAAAEQA/9KbBf9cdL6kOFkJnU39aUn4m8ydThPenwCLcBGAsYHQ/s0/Random%2BForest%2B03.gif)
[https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html](https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html)

## Let's get started
In this *notebook* I'm going to walk you through how to install and train a **RandomForestModel**, one of the new models in TensorFlow Decision Forests. Let's import the dataset and have a look at it.


In [1]:
import numpy as np
import pandas as pd

In [2]:
# import and display dataset
df = pd.read_csv('cars.csv')
df.head()

Unnamed: 0,mpg,cylinders,cubicinches,hp,weightlbs,time-to-60,year,brand
0,14.0,8,350,165,4209,12,1972,US.
1,31.9,4,89,71,1925,14,1980,Europe.
2,17.0,8,302,140,3449,11,1971,US.
3,15.0,8,400,150,3761,10,1971,US.
4,30.5,4,98,63,2051,17,1978,US.


## Analyzing your data

This was supposed to be *very easy*, however, when converting the data from *Pandas* to *Keras* format, *Python* threw an error: `The label "brand" is not a column of the dataframe.` Clearly, "brand" *is* one of the column labels. So what's wrong? If you examine the header values, you'll see that most have *leading white space*. This has to be cleaned up.

In [3]:
# column header values
display(list(df.columns.values))

['mpg',
 ' cylinders',
 ' cubicinches',
 ' hp',
 ' weightlbs',
 ' time-to-60',
 ' year',
 ' brand']

The *key takeaway* is that while it is impossible to predict what kind of issues you'll run into, there are plenty of code examples online, for instance [Stack Overflow](https://stackoverflow.com). This will help you construct a workaround.

In [4]:
# strip leading spaces from column names
df.columns = df.columns.str.lstrip(' ')

## Feature selection
Your model will successfully learn if the training data contains enough *relevant features* and not too many *irrelevant* ones. For my own take on this dataset, I reasoned that the *weight* of the vehicle, and it's *accelleration* were probably *negatively or positively correlated* to other features in the table. Those columns can be dropped.

I thought that *year* might help identify US, Japanese, or European brands, the object of this dataset challenge.

In [5]:
# drop columns
columns = ['weightlbs','time-to-60']
df.drop(columns, axis=1, inplace=True)
df.head()

Unnamed: 0,mpg,cylinders,cubicinches,hp,year,brand
0,14.0,8,350,165,1972,US.
1,31.9,4,89,71,1980,Europe.
2,17.0,8,302,140,1971,US.
3,15.0,8,400,150,1971,US.
4,30.5,4,98,63,1978,US.


## Splitting the data into Training and Test sets
​
This is where TensorFlow differs from what you might be used to. Scikit-learn model selection provides **train_test_split** to split data into random *train* and *test* subsets, with *X* and *y* data. TensorFlow Decision Forests does not require separate *X* and *y* data.
​
The following routine splits the data into training and test sets. You'll see how these are used soon.

In [6]:
# split the dataset into Training and Test sets

def split_dataset(dataset, ratio=0.25) :
    indices = np.random.rand(len(dataset)) < ratio
    return dataset[~indices], dataset[indices]

train_ds, test_ds = split_dataset(df)
print('{} samples for training, {} samples for test.'
      .format(len(train_ds),len(test_ds)))

201 samples for training, 60 samples for test.


## Install and import TensorFlow Decision Forrests

This is the point where I squeeze the new capability onto Colab. Scroll past all this output to continue.

In [7]:
# install TensorFlow Decision Forests
!pip install tensorflow_decision_forests

Collecting tensorflow_decision_forests
  Downloading tensorflow_decision_forests-0.1.8-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 4.6 MB/s 
Installing collected packages: tensorflow-decision-forests
Successfully installed tensorflow-decision-forests-0.1.8


In [8]:
import tensorflow_decision_forests as tfdf

# Convert Pandas to TensorFlow
​
Keras provides **pd_dataframe_to_tf_dataset** to prepare data for their Machine Learning model. Specify your dataset, and notice that you provide a target *column vector* by designating **label**. In our case, we do not break-out "brand" into separate *y data*.

In [9]:
# convert Pandas dataframe to Tensorflow Training set
train = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds, label="brand")

# Setup and train the model

It really is just two lines of code.

In [10]:
# train the model
model = tfdf.keras.RandomForestModel()
model.fit(train)



<tensorflow.python.keras.callbacks.History at 0x7f5f07aeec50>

# How well does the model do?

Running the test set against the model requires the same conversion we did for training. There is an extra step here to **compile** the model for what you want it do, which is common for *neural networks*.

Keras is actually a *Deep Learning* platform, and the TensorFlow team says that the Decision Forest library can serve as a *bridge* to the TensorFlow ecosystem.

In [11]:
# convert Pandas dataframe to Tensorflow Test set
test = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds, label="brand")

# evaluate the model
model.compile(metrics=["accuracy"])

evaluation = model.evaluate(test, return_dict=True)
print()

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")


loss: 0.0000
accuracy: 0.8167


# Your milage may vary

I can get approximately 80% accuracy using this technique on this particular dataset, depending on how I split *training* and *test* sets. All that without setting any *hyperparameters*.

In Machine Learning, a *hyperparameter* tells the model what to do, and a *parameter* is a value *inside* the model, as it learns. The **ratio** argument in the **split_dataset** routine *above* is a kind of *hyperparameter* that you can adjust, to play with how much data you want for *training* and how much for *testing*.

Below, we run **predict** against the *test* set to see how well the model classifies. The Random Forest outputs three data points, one of them indicating the "brand" prediction. The first item is shown. 

In [12]:
classifications = model.predict(test)
print(f'Predict first item\n{classifications[0]}\n')

test_ds.head(1)

Predict first item
[0.28333315 0.4666663  0.24999984]



Unnamed: 0,mpg,cylinders,cubicinches,hp,year,brand
4,30.5,4,98,63,1978,US.


## Tools for analyzing the model

#### Model inspector

The **variable_importances** on the model inspector calculates the importance of each feature. We can see the model favored *cylinders*, *cubicinches*, and *mpg*.

In [13]:
model.make_inspector().variable_importances()

{'MEAN_MIN_DEPTH': [("__LABEL" (4; #5), 5.173963676718403),
  ("hp" (1; #2), 2.9866184541461),
  ("year" (1; #4), 2.847622407079466),
  ("cubicinches" (4; #0), 2.79901524733972),
  ("mpg" (1; #3), 2.079762596997287),
  ("cylinders" (1; #1), 1.2646946048291956)],
 'NUM_AS_ROOT': [("cylinders" (1; #1), 211.0),
  ("mpg" (1; #3), 73.0),
  ("hp" (1; #2), 11.0),
  ("cubicinches" (4; #0), 5.0)],
 'NUM_NODES': [("mpg" (1; #3), 1710.0),
  ("hp" (1; #2), 1641.0),
  ("year" (1; #4), 919.0),
  ("cubicinches" (4; #0), 818.0),
  ("cylinders" (1; #1), 424.0)],
 'SUM_SCORE': [("cylinders" (1; #1), 13135.69299292285),
  ("mpg" (1; #3), 11829.886550331197),
  ("hp" (1; #2), 8575.641763662628),
  ("cubicinches" (4; #0), 5621.042308934033),
  ("year" (1; #4), 5476.567160189501)]}

#### Model plotter
​
*Google Colab* has a model plotter that allows you to *interpret* tree structure.
​
After considering *horsepower*, the model branches to *engine size (cubic inches)* and quickly decides the case for US cars <span style="color:green;">(solid green)</span>. Over in the fourth column, we see that *year* helped decide the case between European <span style="color:blue;">(blue)</span> and Japanese cars <span style="color:red;">(red)</span>.
​

In [14]:
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0)

#### Conclusion

What this seems to say is that the *model inspector* did not calculate *year* to be very important, nonetheless it was not a bad idea to include it in the model, as seen in the *model plotter* .

I never yet encountered an AI that had any common sense.

## What do you think?
​
Machine Learning enthusiasts will be interested to know that *TensorFlow Decision Forests* does not currently run on *Mac* or *Windows*. But, hey, you can try it on Colab!
