<a href="https://colab.research.google.com/github/chinge55/tf-decision-forest/blob/main/TFDF_Pokemon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Tensorflow recently announced Decision forests. So, I am testing out decision forests and comparing them with a basic neural network on a structured classification task(that's what decision trees are mostly used for).*
For this some problem is required. 

So, as a pokemon fan!
### Given the stats of the pokemon, could we know that they are legendary? 



The dataset required for this project.
Cloning from a github repo because when I hit enter, everything should run!

In [1]:
!git clone https://github.com/KeithGalli/pandas

Cloning into 'pandas'...
remote: Enumerating objects: 22, done.[K
remote: Total 22 (delta 0), reused 0 (delta 0), pack-reused 22[K
Unpacking objects: 100% (22/22), done.


In [2]:
# Install TensorFlow Decision Forests
!pip install tensorflow_decision_forests

# Load TensorFlow Decision Forests
import tensorflow_decision_forests as tfdf
from sklearn.model_selection import train_test_split

Collecting tensorflow_decision_forests
[?25l  Downloading https://files.pythonhosted.org/packages/ef/1e/7d2c17018512c8e6d583ccea433eecd5e93c18c7e273bc896a857789a098/tensorflow_decision_forests-0.1.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.2MB)
[K     |████████████████████████████████| 6.2MB 5.3MB/s 
Installing collected packages: tensorflow-decision-forests
Successfully installed tensorflow-decision-forests-0.1.5


In [3]:
import pandas as pd
df = pd.read_csv('pandas/pokemon_data.csv')

In [4]:
df.head(5)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False


For our classification task, we don't need the following columns.


In [5]:
new_df = df.drop(['Type 1', 'Type 2', 'Generation', 'Name','#'], axis = 1)
new_df.tail()

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Legendary
795,50,100,150,100,150,50,True
796,50,160,110,160,110,110,True
797,80,110,60,150,130,70,True
798,80,160,60,170,130,80,True
799,80,110,120,130,90,70,True


In [6]:
new_df.describe()

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
count,800.0,800.0,800.0,800.0,800.0,800.0
mean,69.25875,79.00125,73.8425,72.82,71.9025,68.2775
std,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474
min,1.0,5.0,5.0,10.0,20.0,5.0
25%,50.0,55.0,50.0,49.75,50.0,45.0
50%,65.0,75.0,70.0,65.0,70.0,65.0
75%,80.0,100.0,90.0,95.0,90.0,90.0
max,255.0,190.0,230.0,194.0,230.0,180.0


### But, could we have not just taken a sum of all the stats? 
Well, yes but that's like taking a mean of something to describe the entire data. And that can be misleading.

In [7]:
new_df.columns

Index(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Legendary'], dtype='object')

Division of the pokemons into two sets:
1. Train dataset(0.8 percent of the data)
2. Test dataset(0.2 percent of the data)

In [8]:
train_df, test_df = train_test_split(new_df, test_size = 0.2)

## The api below seems pretty easy to use. 
From the pandas dataset, choose one row as label(categorical) and you're done.

In [9]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="Legendary")

# Train the model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)



<tensorflow.python.keras.callbacks.History at 0x7f27d393f810>

It has got an accuracy of 0.95. And just so we know, there are some very weak legendary pokemons. So in this case, I would not like accuracy of like 99 percent. Even 95 percent has gotten me asking questions. 

In [10]:
# Convert it to a TensorFlow dataset
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="Legendary")

# Evaluate the model
model.compile(metrics=["accuracy"])
print(model.evaluate(test_ds))

[0.0, 0.949999988079071]


In [None]:
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0)


## If I were to try the same thing using a neural network. 

In [11]:
import tensorflow as tf

In [12]:
from tensorflow.keras import layers
from tensorflow.keras import models

In [13]:
train_df.head()

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Legendary
657,70,77,60,97,60,108,False
205,55,70,55,40,55,85,False
395,50,50,50,50,50,50,False
786,75,95,122,58,75,69,False
98,50,95,180,85,45,70,False


In the case of neural networks, I first need to separate target class with the dataset that we want. In this case trying to know if a pokemon is legendary or not. 

In [16]:
target = train_df.pop('Legendary')
test_target = test_df.pop('Legendary')

Creation of dataset. 

In [17]:
dataset = tf.data.Dataset.from_tensor_slices((train_df.values, target.values))
test_dataset = tf.data.Dataset.from_tensor_slices((test_df.values, test_target.values))

In [18]:
train_dataset = dataset.shuffle(len(train_df)).batch(1)
test_dataset = dataset.shuffle(len(train_df)).batch(1)

In [19]:
def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])
  return model

# The accuracy of both Decision Forest and Neural network is comparable. 
BUT!

In case of neural networks, I had to decide. 
1. What kind of layer am I going to use?
2. What kind of activation function?
3. What kind of loss?
4. What kind of optimization algorithm?
5. How many epochs?

On the other hand, for decision forests:
1. FIT
That's it!

In [20]:
model = get_compiled_model()
model.fit(train_dataset, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7f27d32166d0>

In [21]:
model.evaluate(test_dataset)



[0.2896032929420471, 0.9203125238418579]

## Lastly
Decision forests are fast and easy to use and for some problems they are going to be a better choice. 