<h1> Random Forest using Tensorflow using Palmer Penguins dataset </h1>

Author: Vaasudevan Srinivasan <br>
Created on: July 31, 2021

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" width="40%"/>

[Tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab) | [Dataset](https://allisonhorst.github.io/palmerpenguins/)

In [2]:
%%capture

# Install Tensorflow decision forests
!pip install tensorflow_decision_forests wurlitzer

In [13]:
from sklearn.model_selection import train_test_split
from wurlitzer import sys_pipes
import tensorflow_decision_forests as tfdf
import tensorflow as tf
import pandas as pd

tf.__version__

'2.5.0'

In [4]:
df = pd.read_csv('https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [9]:
train_df, test_df = train_test_split(df, test_size=0.3)
print(train_df.shape, test_df.shape)

(240, 8) (104, 8)


In [10]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label='species')
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label='species')

In [15]:
tf_model = tfdf.keras.RandomForestModel()
tf_model.compile(metrics=["accuracy"])
with sys_pipes():
  tf_model.fit(x=train_ds)



[INFO kernel.cc:746] Start Yggdrasil model training
[INFO kernel.cc:747] Collect training examples
[INFO kernel.cc:392] Number of batches: 4
[INFO kernel.cc:393] Number of examples: 240
[INFO kernel.cc:769] Dataset:
Number of records: 240
Number of columns: 8

Number of columns by type:
	NUMERICAL: 5 (62.5%)
	CATEGORICAL: 3 (37.5%)

Columns:

NUMERICAL: 5 (62.5%)
	0: "bill_depth_mm" NUMERICAL num-nas:2 (0.833333%) mean:17.1819 min:13.1 max:21.5 sd:1.97381
	1: "bill_length_mm" NUMERICAL num-nas:2 (0.833333%) mean:43.7042 min:32.1 max:59.6 sd:5.25222
	2: "body_mass_g" NUMERICAL num-nas:2 (0.833333%) mean:4149.47 min:2700 max:6050 sd:768.014
	3: "flipper_length_mm" NUMERICAL num-nas:2 (0.833333%) mean:199.769 min:172 max:230 sd:13.5684
	6: "year" NUMERICAL mean:2008.01 min:2007 max:2009 sd:0.821542

CATEGORICAL: 3 (37.5%)
	4: "island" CATEGORICAL has-dict vocab-size:4 zero-ood-items most-frequent:"Biscoe" 112 (46.6667%)
	5: "sex" CATEGORICAL num-nas:9 (3.75%) has-dict vocab-size:3 zero-oo

In [18]:
tfdf.model_plotter.plot_model_in_colab(tf_model, tree_idx=0, max_depth=3)

In [19]:
tf_model.evaluate(test_ds)



[0.0, 0.9807692170143127]

In [20]:
tf_model.make_inspector().features()

["bill_depth_mm" (1; #0),
 "bill_length_mm" (1; #1),
 "body_mass_g" (1; #2),
 "flipper_length_mm" (1; #3),
 "island" (4; #4),
 "sex" (4; #5),
 "year" (1; #6)]

In [22]:
tf_model.make_inspector().variable_importances()

{'MEAN_MIN_DEPTH': [("__LABEL" (4; #7), 3.227749740999732),
  ("year" (1; #6), 3.20032723757723),
  ("sex" (4; #5), 3.196780811780803),
  ("body_mass_g" (1; #2), 2.8201488511488466),
  ("island" (4; #4), 2.2808163595663595),
  ("bill_depth_mm" (1; #0), 2.2031718929218926),
  ("flipper_length_mm" (1; #3), 1.4540613923113925),
  ("bill_length_mm" (1; #1), 1.0364080919080925)],
 'NUM_AS_ROOT': [("flipper_length_mm" (1; #3), 140.0),
  ("bill_length_mm" (1; #1), 111.0),
  ("bill_depth_mm" (1; #0), 41.0),
  ("island" (4; #4), 5.0),
  ("body_mass_g" (1; #2), 3.0)],
 'NUM_NODES': [("bill_length_mm" (1; #1), 704.0),
  ("bill_depth_mm" (1; #0), 439.0),
  ("flipper_length_mm" (1; #3), 340.0),
  ("island" (4; #4), 293.0),
  ("body_mass_g" (1; #2), 249.0),
  ("year" (1; #6), 24.0),
  ("sex" (4; #5), 22.0)],
 'SUM_SCORE': [("bill_length_mm" (1; #1), 28200.90860919468),
  ("flipper_length_mm" (1; #3), 21470.161531707272),
  ("island" (4; #4), 10846.209697145969),
  ("bill_depth_mm" (1; #0), 9357.9196