<a href="https://colab.research.google.com/github/ccstan99/ccstan99.github.io/blob/main/docs/huggingface-text-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
## STEP 1: Setup the environment 👩🏾‍💻✨


### Install TensorFlow Decision Forest library (TF-DF) 🌲🌲🌲📥

In [None]:
!pip install -q tensorflow_decision_forests

### Import libraries

In [None]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

## STEP 2: Get Data 🐧🐧🐧📊

### Load the dataset ...and convert it in a tf.Dataset!

In [None]:
# Download the dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# Load a dataset into a Pandas Dataframe.
df = pd.read_csv("/tmp/penguins.csv")

# Display the first 5 examples.
df.head()

In [None]:
#this returns the entireeeeee dataframe!
df

### Explore your data 🧭✨🕵🏻‍♀️

In [None]:
#Let's group the dataframe by the column "species" of penguins!
# with the COUNT of how many unique values there are in each column
#(i.e., 'Adelie Penguins...with a bill length recorded,' or 'Chinstrap penguins... with a body mass recorded')
#(hint: they should all be the same unless we have null! Can you spot where they AREN'T the same?)
df.groupby(['species']).count()

In [None]:
#here's another way to do groupby the mean number - but some of our columns are missing!
# we are missing columns like "island" because they are not numeric - they are strings (aka words)
df.groupby(['species']).mean()

In [None]:
#Sometimes it's hard to just read a table, let's look at a graph of the average values...
df.groupby('species').mean().plot(kind='bar')

In [None]:
# let's look at the FIRST row (row 0) of the data frame - what is in that row?
df.iloc[0]
#this output reads:
#Colname        Value

In [None]:
#Let's look at all of the different variables
#this will let us know if there are any NaN (null/missing) values!
df.info()

### Data cleaning 🧹✨

In [None]:
df_clean = df.dropna()

In [None]:
df_clean = df_clean.drop(columns=['year'])

In [None]:
df_clean.head()

In [None]:
df_clean.info()

In [None]:
# since we only lost 11 "null" rows, let's make our df_clean be our main df!
df = df_clean

In [None]:
# Encode the categorical labels as integers.

# Details:
# This stage is necessary if your classification label is represented as a
# string since Keras expects integer classification labels.
# When using `pd_dataframe_to_tf_dataset` (see below), this step can be skipped.

# Name of the label column.
label = "species"

classes = df[label].unique().tolist()
print(f"Label classes: {classes}")

df[label] = df[label].map(classes.index)

### Training vs Testing Data 🚂🆚🧪

In [None]:
# Split the dataset into a training and a testing dataset.

def split_dataset(dataset, test_ratio=0.30):
  """Splits a panda dataframe in two."""
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]


train_ds_pd, test_ds_pd = split_dataset(df)
print("{} examples in training, {} examples for testing.".format(
    len(train_ds_pd), len(test_ds_pd)))

In [None]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

## STEP 3: Train the model 🏋🏻‍♀️💪🤖

In [None]:
# Specify the model.
# verbose is just an argument about how long you want the output to be!
# random_seed allows your results to be reproducible
model_1 = tfdf.keras.RandomForestModel(verbose=1, random_seed=42)

# Train the model.
model_1.fit(x=train_ds)

## STEP 4: Evaluate your model 🕵🏾‍♀️🐧❓📈


### Evaluate the model with the Testing Data 🆕🐧

In [None]:
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(test_ds, return_dict=True)
print()

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")

### Plot the model 📊📈👀

In [None]:
#remember, these are the the three "classes" of our Penguins!
print(f"Label classes: {classes}")

In [None]:
tfdf.model_plotter.plot_model_in_colab(model_1, tree_idx=200, max_depth=5)

### Plotting the training logs 📉📈

In [None]:
import matplotlib.pyplot as plt

logs = model_1.make_inspector().training_logs()

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")

plt.subplot(1, 2, 2)
plt.plot([log.num_trees for log in logs], [log.evaluation.loss for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Logloss (out-of-bag)")

plt.show()

### Let's use TensorBoard! 🏄✨

In [None]:
# This cell start TensorBoard that can be slow.
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
# Clear existing results (if any)
!rm -fr "/tmp/tensorboard_logs"

In [None]:
# Export the meta-data to tensorboard.
model_1.make_inspector().export_to_tensorboard("/tmp/tensorboard_logs")

In [None]:
# docs_infra: no_execute
# Start a tensorboard instance.
%tensorboard --logdir "/tmp/tensorboard_logs"

## STEP 5: What's next? 🐧🐧🐧🏆✨

### Re-train the model with a different learning algorithm

In [None]:
tfdf.keras.get_all_models()

### Using a subset of features

In [None]:
feature_1 = tfdf.keras.FeatureUsage(name="bill_length_mm")
feature_2 = tfdf.keras.FeatureUsage(name="island")

all_features = [feature_1, feature_2]

# Note: This model is only trained with two features. It will not be as good as
# the one trained on all features.

model_2 = tfdf.keras.GradientBoostedTreesModel(
    features=all_features, exclude_non_specified_features=True)

model_2.compile(metrics=["accuracy"])
model_2.fit(x=train_ds, validation_data=test_ds)

print(model_2.evaluate(test_ds, return_dict=True))

## ..but wait there's more!!

The fun doesn't have to stop here! There are a TON of free, online resources to help you learn ML and get started with Generative AI!

🆓 Learning Resources:
* [DeepLearning.ai](https://www.deeplearning.ai/short-courses/) series of 1-hour short courses to learn generative AI
* [Introduction to Generative AI](https://www.cloudskillsboost.google/journeys/118) earn badges while following a learning path with videos & exercises
* [ML Crash Course](https://developers.google.com/machine-learning/crash-course) Google's fast-paced, practical introduction to machine learning.
* [fast.ai](https://www.fast.ai/) if you already have some coding background, this is a practical guide to dive into deep learning.
* [Kaggle Competitions](https://www.kaggle.com/competitions), If you're ready to dive in and start coding, check out the "competitions" on Kaggle! It's a great way to apply what you've learned with a community of other learners.
* [SimpleML in Google sheets](https://simplemlforsheets.com/tutorial.html) for a no-code way to get started with ML.
* [Made with TFJS](https://goo.gle/made-with-tfjs) youtube series that highlight awesome projects made for the web!

### Thank you 💖✨✨

Congratulations on starting your ML journey with TF! We're excited to have you here!

Questions? Comments? Ideas? Inspirations?

For the full notebook with all the comments
https://github.com/ccstan99/introML