<a href="https://colab.research.google.com/github/ccstan99/introML/blob/main/WTM23_introML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
##### Copyright 2022 The TensorFlow Authors.

This was adapted from [bit.ly/WiML22_introML](bit.ly/WiML22_introML)
by [Michelle Carney🐍](https://twitter.com/michellercarney) and [Soo Sung🐼](https://www.linkedin.com/in/soo-sung-98180a15/)!

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Welcome to Intro to ML for everyone! 🤖📊✨

We're your hosts, [ChengCheng Tan](https://www.linkedin.com/in/cheng2-tan/) and [Philippa Burgess](https://www.linkedin.com/in/philippaburgess/)!

Our goal is to share how easy it is to get started with ML. Thanks to awesome resources online and the amazing advancement in tools, anyone can get started with ML!

<center><img src="https://github.com/ccstan99/introML/blob/main/images/welcome.png?raw=true" width="100%"/></center>

### Wait... what is this tool we're using?! in the browser?! to CODE?!

For this workshop, we are using the coding language [python](https://www.python.org/), one of the most common ML coding laguagues.

You are able to run the code for yourself since we're using [colab notebooks](https://research.google.com/colaboratory/faq.html) which are kinda like Google Docs of Python (based on the open source library of [jupyter](https://jupyter.org/) - formerly iPython notebooks (notice how this file type is ipynb? now you know!)).

We won't go into the details, but essentially colab makes it really easy to run and share our code in our browsers without needing to code in terminal or using a text editor, which you might use when making a website or coding with html/css and javascript.

### Intro to Machine Learning - Agenda 🤖

Here's what we will cover today:

1.   What is ML? 🤔💭🤖
2.   Data prep 🐧🐧🐧
3.   Training your model 🏋🏻‍♀️💪
4.   Model evaluation 📊🔍👀
5.   Next steps... and more! 🏆✨

In this notebook, we'll walk through a high level overview of how ML models *generally* work - esp. what are the different steps we take in order to build a model. This lays the foundation for talking about ML, key terms, and how to help you learn!

Hopefully at the end, you are able to better understand how models are made, and the vocabulary used in ML and AI, and have a little more confidence to start your ML journey 🏆




### What is machine learning? 🤔💭 🤖




Machine learning is the process of teaching a computer to learn patterns from data and then to apply those patterns to make preditions on new data. In traditional programming, you write rules to tell the computer exactly what to do. For example, if you want to write a program that converts temperature from Celsius to Fahrenheit, you would write a function that computes the following equation:

<center>
<img src="https://github.com/ccstan99/introML/blob/main/images/io_programming.png?raw=true" width="40%"/>
<img src="https://github.com/ccstan99/introML/blob/main/images/io_ML.png?raw=true" width="40%"/>
</center>

But in ML, instead of writing the rule, you provide the computer a lot of examples of input data as well as the desired output, say pairs of celsius and fahrenheit data. Then let the computer learn the rule itself.

This particular case is very simple but imagine if you had to write a program to recognize cat images or generate languages. You'd have to write many rules. You can see how the list of rules and their exceptions could grow very quickly.

ML is ideal for these types of problems, where you have lots of data that have complex relationships that would be very difficult for humans to manually create rules for.

* Image and video recognition
* Language generation
* Recommendations
* etc

If you're interested in a more in depth look at these concepts - check out the [Intro to ML](https://developers.google.com/machine-learning/intro-to-ml) course on the Google developer site!



### ML Model and Dataset 📊



Now that you've seen the difference between ML and traditional programming, let's get started on programming, or training, our own model!

For this workshop, we'll be using the [Palmer's Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset to train a decision forest model to predict the penguin species.

<center><img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" width="100%"/></center>

A decision tree is an algorithm that learns by splitting the data into smaller and smaller subsets. The splits are determined by how well each subset helps to predict the desired outcome. Here's a simple decision tree for predicting different types of penguins! Let's presume class 0 is Adelie, class 1 is Gento, and Type 2 is Chinstrap.

<center><img src="https://github.com/ccstan99/introML/blob/main/images/decision_tree.png?raw=true" width="50%"/></center>

Decision Forests (`DF`) are a large family of Machine Learning algorithms for
supervised classification, regression and ranking. As the name suggests, DFs use
decision trees as a building block.

<center><img src="https://github.com/ccstan99/introML/blob/main/images/decision_forest.png?raw=true" width="100%"/></center>

[TensorFlow Decision Forests (`TF-DF`)](https://www.tensorflow.org/decision_forests) is a library for the training Decision Forest models.


## STEP 1: Setup the environment 👩🏾‍💻✨


### Install TensorFlow Decision Forest library (TF-DF) 🌲🌲🌲📥

Run the following cell to install the library.

In [None]:
!pip install -q tensorflow_decision_forests

### Import libraries

Some common libraries we'll be using today are [pandas](https://pandas.pydata.org/), [TensorFlow](https://tensorflow.org/) (sometimes abbreviated ```tf```), [numpy](https://numpy.org/), and there are many more!

Let's load these libraries and load our data into a pandas [dataframe (`df`)](https://www.geeksforgeeks.org/python-pandas-dataframe/), which is a format our python can read instead of a csv, and we can then train a model on this df.

In [None]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math


## STEP 2: Get Data 🐧🐧🐧📊

First things first - **all ML models need DATA!**

It might seem counter intuitive, but the most amount of time most data scientists spend in making ML models is actually in understanding the data and "cleaning" it to make it machine-readable. When training ML models, it's very common to spend ~80% of the time in the Data Preparation phase!

It's important to understand your data because this is the only source your Machine Learning model will learn patterns and trends from - so if it has some weird pattern (half of the values are missing, or in different formats, etc) it will learn those weird patterns.

While in the "Data" Phase, some design questions you might want to consider are:
1. **Where does your data come from? Who collected it? Why did they collect it?**

You can imagine that a zoologist or a penguin enthusiast might both have a database of penguin attributes and species, but they probably use them in different ways. Maybe the zoologist made the database to keep track of the penguin conditions in the zoo, while an enthusiast might be interested in features that identify different types of penguins.

2. **What is the type() of each of the columns (aka "features")? Are they all numbers (integers or "int", "float")? words ("strings")?**

This matters to ML models - we need to ensure that the computer is able to read each of the columns in the way that we're perceiving them (like, if it is a number '5' it should be read as an int, not a string)

3. **What questions can this data answer?**

This is important because we can't train a model on this data and ask questions about things not seen in the data! for instance, if we have a dataset that describes penguins we can't ask questions about hippos - the model only knows about penguins!

### Load the dataset ...and convert it in a tf.Dataset!

This dataset is very small (300 examples) and stored as a .csv-like file. Therefore, use Pandas to load it.

**Note:** Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the
[TensorFlow Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to read the files may be better suited.

Let's assemble the dataset into a csv file (i.e. add the header), and load it:

In [None]:
# Download the dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# Load a dataset into a Pandas Dataframe.
df = pd.read_csv("/tmp/penguins.csv")

# Display the first 5 examples.
df.head()

In [None]:
#this returns the entireeeeee dataframe!
df

From this initial look, it doesn't seem like year is going to be that predictive of the species.

### Explore your data 🧭✨🕵🏻‍♀️

Exploring your data, or Exploratory Data Analysis (`EDA`), is an important part of the ML process!

When we show the entire dataframe, **it's too many animals to view all at once, and it's only 344!** (remember, python starts counting at 0, and our first anmial is numbered 0)  Just imagine if we had a dataset of 1000 animals, or 5 million! - there's no way we could look at EVERY row!

One thing data scientists do is explore the data in the form of **summary statistics**, which we can use directly on the pandas dataframe.

Below is some sample code to help you explore your data, using the function [groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). You can groupby() on the entire dataframe by a certain feature, like by each Species, and do calculations per species

In [None]:
#Let's group the dataframe by the column "species" of penguins!
# with the COUNT of how many unique values there are in each column
#(i.e., 'Adelie Penguins...with a bill length recorded,' or 'Chinstrap penguins... with a body mass recorded')
#(hint: they should all be the same unless we have null! Can you spot where they AREN'T the same?)
df.groupby(['species']).count()

In [None]:
#here's another way to do groupby the mean number - but some of our columns are missing!
# we are missing columns like "island" because they are not numeric - they are strings (aka words)
df.groupby(['species']).mean()

In [None]:
#Sometimes it's hard to just read a table, let's look at a graph of the average values...
df.groupby('species').mean().plot(kind='bar')

In [None]:
# let's look at the FIRST row (row 0) of the data frame - what is in that row?
df.iloc[0]
#this output reads:
#Colname        Value

In [None]:
#Let's look at all of the different variables
#this will let us know if there are any NaN (null/missing) values!
df.info()

### Data cleaning 🧹✨

The dataset contains a mix of numerical (e.g. `bill_depth_mm`), categorical
(e.g. `island`) and missing features. TF-DF supports all these feature types natively (differently than NN based models), therefore there is no need for preprocessing in the form of one-hot encoding, normalization or extra `is_present` feature.



In [None]:
df_clean = df.dropna()

In [None]:
df_clean = df_clean.drop(columns=['year'])

In [None]:
df_clean.head()

In [None]:
df_clean.info()

In [None]:
# since we only lost 11 "null" rows, let's make our df_clean be our main df!
df = df_clean

**Labels are a bit different:** Keras metrics expect integers. The label (`species`) is stored as a string, so let's convert it into an integer to make it machine readable. This will allow us to use `TFDF` to predict what species it is!

In [None]:
# Encode the categorical labels as integers.

# Details:
# This stage is necessary if your classification label is represented as a
# string since Keras expects integer classification labels.
# When using `pd_dataframe_to_tf_dataset` (see below), this step can be skipped.

# Name of the label column.
label = "species"

classes = df[label].unique().tolist()
print(f"Label classes: {classes}")

df[label] = df[label].map(classes.index)

### Training vs Testing Data 🚂🆚🧪

In order to train a ML model we need data! However, **we don't want to give 100% of our data to the model to train and learn patterns on**, we hold out a "testing data" set that comes from the same data source so we can see how well the model performs, and only give the model 80% of the full data set to train on (or "training data").

We need to use the same data source because we know what the correct answer might be, therefore we can calculate things like accuracy of how well the model predicted categories of data it has not been trained on!

Typically, we can use the ratio of 80% of our data can be training data, and 20% of our data can be testing data, or data the model has NEVER seen before that we're going to evaluate how well the model performs on it. We want to be sure that the data is randomized as well, since we don't want, say, the last 20% of the rows to be Adelie Penguins and, if that is our test data, our training data would not include Adelie Penguins!

<img src='https://drive.google.com/file/d/1ZK3t60NOPiHSNBeqcT6jauO0ycabkXTi/view?usp=sharing'>


In [None]:
# Split the dataset into a training and a testing dataset.

def split_dataset(dataset, test_ratio=0.20):
  """Splits a panda dataframe in two."""
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]


train_ds_pd, test_ds_pd = split_dataset(df)
print("{} examples in training, {} examples for testing.".format(
    len(train_ds_pd), len(test_ds_pd)))

And finally, convert the pandas dataframe (`pd.Dataframe`) into tensorflow datasets (`tf.data.Dataset`):

In [None]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

**Notes:** Recall that `pd_dataframe_to_tf_dataset` converts string labels to integers if necessary.

If you want to create the `tf.data.Dataset` yourself, there are a couple of things to remember:

- The learning algorithms work with a one-epoch dataset and without shuffling.
- The batch size does not impact the training algorithm, but a small value might slow down reading the dataset.


## STEP 3: Train the model 🏋🏻‍♀️💪🤖

When we say "train the model" it isn't really like we're writing the code ourselves to get the math equation. TensorFlow, Keras, and other libraries make it REALLY EASY to "train models" because they have functions already written to help us easily find those patterns and get the models! Let's run the one below to make a RandomForest!

In [None]:
# Specify the model.
# verbose is just an argument about how long you want the output to be!
# random_seed allows your results to be reproducible
model_1 = tfdf.keras.RandomForestModel(verbose=1, random_seed=42)

# Train the model.
model_1.fit(x=train_ds)

Remarks

-   No input features are specified. Therefore, all the columns will be used as
    input features except for the label. The feature used by the model are shown
    in the training logs and in the `model.summary()`.
-   DFs consume natively numerical, categorical, categorical-set features and
    missing-values. Numerical features do not need to be normalized. Categorical
    string values do not need to be encoded in a dictionary.
-   No training hyper-parameters are specified. Therefore the default
    hyper-parameters will be used. Default hyper-parameters provide
    reasonable results in most situations.
-   Calling `compile` on the model before the `fit` is optional. Compile can be
    used to provide extra evaluation metrics.
-   Training algorithms do not need validation datasets. If a validation dataset
    is provided, it will only be used to show metrics.
-   Tweak the `verbose` argument to `RandomForestModel` to control the amount of
    displayed training logs. Set `verbose=0` to hide most of the logs. Set
    `verbose=2` to show all the logs.

**Note:** A *Categorical-Set* feature is composed of a set of categorical values (while a *Categorical* is only one value). More details and examples are given later.

## STEP 4: Evaluate your model 🕵🏾‍♀️🐧❓📈


### Evaluate the model with the Testing Data 🆕🐧

We are now ready to see how this model performs! We call this evaluating the model. Every type of model has a [different way it can be evaluated](https://keras.io/api/metrics/), but this Decision Forest can be evaluated by this [sample code from the tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab) :)

Let's evaluate our model on the Testing Data (the 30% we DID NOT train the model on)

In [None]:
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(test_ds, return_dict=True)
print()

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")

### Plot the model 📊📈👀

Plotting a decision tree and following the first branches helps learning about decision forests. In some cases, plotting a model can even be used for debugging.

Because of the difference in the way they are trained, some models are more interesting to plan than others. Because of the noise injected during training and the depth of the trees, plotting Random Forest is less informative than plotting a CART or the first tree of a Gradient Boosted Tree.

Never the less, let's plot the first tree of our Random Forest model:

In [None]:
#remember, these are the the three "classes" of our Penguins!
print(f"Label classes: {classes}")

In [None]:
tfdf.model_plotter.plot_model_in_colab(model_1, tree_idx=200, max_depth=5)

The root node on the left contains the first condition, number of examples and label distribution (the red-blue-green bar).

Examples that evaluates true to the first condition are branched to the green path. The other ones are branched to the red path.

The deeper the node, the more `pure` they become i.e. the label distribution is biased toward a subset of classes.

**Note:** Hover the mouse on top of the plot for details.

### Plotting the training logs 📉📈

The training logs show the quality of the model (e.g. accuracy evaluated on the out-of-bag or validation dataset) according to the number of trees in the model. These logs are helpful to study the balance between model size and model quality.

The logs are available in multiple ways:

1. Displayed in during training if `fit()` is wrapped in `with sys_pipes():`.
1. At the end of the model summary i.e. `model.summary()`.
1. Programmatically, using the model inspector i.e. `model.make_inspector().training_logs()`.
1. Via TensorBoard.

Let's plot it using the `matplotlib` library!

In [None]:
import matplotlib.pyplot as plt

logs = model_1.make_inspector().training_logs()

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")

plt.subplot(1, 2, 2)
plt.plot([log.num_trees for log in logs], [log.evaluation.loss for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Logloss (out-of-bag)")

plt.show()

This dataset is small. You can see the model converging almost immediately.

### Let's use TensorBoard! 🏄✨

In [None]:
# This cell start TensorBoard that can be slow.
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
# Clear existing results (if any)
!rm -fr "/tmp/tensorboard_logs"

In [None]:
# Export the meta-data to tensorboard.
model_1.make_inspector().export_to_tensorboard("/tmp/tensorboard_logs")

In [None]:
# docs_infra: no_execute
# Start a tensorboard instance.
%tensorboard --logdir "/tmp/tensorboard_logs"

## STEP 5: What's next? 🐧🐧🐧🏆✨

### Re-train the model with a different learning algorithm

The learning algorithm is defined by the model class. For
example, `tfdf.keras.RandomForestModel()` trains a Random Forest, while
`tfdf.keras.GradientBoostedTreesModel()` trains a Gradient Boosted Decision
Trees model.

The learning algorithms are listed by calling `tfdf.keras.get_all_models()` or in the
[learner list](https://github.com/google/yggdrasil-decision-forests/blob/main/documentation/learners.md).

In [None]:
tfdf.keras.get_all_models()

The description of the learning algorithms and their hyper-parameters are also available in the [API reference](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf) and builtin help.

### Using a subset of features

The previous example did not specify the features, so all the columns were used
as input feature (except for the label). The following example shows how to
specify input features.

In [None]:
feature_1 = tfdf.keras.FeatureUsage(name="bill_length_mm")
feature_2 = tfdf.keras.FeatureUsage(name="island")

all_features = [feature_1, feature_2]

# Note: This model is only trained with two features. It will not be as good as
# the one trained on all features.

model_2 = tfdf.keras.GradientBoostedTreesModel(
    features=all_features, exclude_non_specified_features=True)

model_2.compile(metrics=["accuracy"])
model_2.fit(x=train_ds, validation_data=test_ds)

print(model_2.evaluate(test_ds, return_dict=True))

**Note:** As expected, the accuracy is lower than previously.

**TF-DF** attaches a **semantics** to each feature. This semantics controls how
the feature is used by the model. The following semantics are currently supported:

-   **Numerical**: Generally for quantities or counts with full ordering. For
    example, the age of a person, or the number of items in a bag. Can be a
    float or an integer. Missing values are represented with float(Nan) or with
    an empty sparse tensor.
-   **Categorical**: Generally for a type/class in finite set of possible values
    without ordering. For example, the color RED in the set {RED, BLUE, GREEN}.
    Can be a string or an integer. Missing values are represented as "" (empty
    sting), value -2 or with an empty sparse tensor.
-   **Categorical-Set**: A set of categorical values. Great to represent
    tokenized text. Can be a string or an integer in a sparse tensor or a
    ragged tensor (recommended). The order/index of each item doesn't matter.

If not specified, the semantics is inferred from the representation type and shown in the training logs:

- int, float (dense or sparse) → Numerical semantics.
- str (dense or sparse) → Categorical semantics
- int, str (ragged) → Categorical-Set semantics

In some cases, the inferred semantics is incorrect. For example: An Enum stored as an integer is semantically categorical, but it will be detected as numerical. In this case, you should specify the semantic argument in the input. The `education_num` field of the Adult dataset is classical example.

## ..but wait there's more!!

The fun doesn't have to stop here! There are a TON of free, online resources to help you learn ML and get started with Generative AI!

🆓 Learning Resources:
* [DeepLearning.ai](https://www.deeplearning.ai/short-courses/) series of 1-hour short courses to learn generative AI
* [Introduction to Generative AI](https://www.cloudskillsboost.google/journeys/118) earn badges while following a learning path with videos & exercises
* [ML Crash Course](https://developers.google.com/machine-learning/crash-course) Google's fast-paced, practical introduction to machine learning.
* [fast.ai](https://www.fast.ai/) if you already have some coding background, this is a practical guide to dive into deep learning.
* [Kaggle Competitions](https://www.kaggle.com/competitions), If you're ready to dive in and start coding, check out the "competitions" on Kaggle! It's a great way to apply what you've learned with a community of other learners.
* [SimpleML in Google sheets](https://simplemlforsheets.com/tutorial.html) for a no-code way to get started with ML.
* [Made with TFJS](https://goo.gle/made-with-tfjs) youtube series that highlight awesome projects made for the web!

### Thank you 💖✨✨

Congratulations on starting your ML journey with TF! We're excited to have you here!

Questions? Comments? Ideas? Inspirations?

For the full notebook with all the comments
https://github.com/ccstan99/introML