# Before you begin

Please <font color='red'>**MAKE A COPY**</font> of this colab to make sure your progress is saved.



# Practical ML with YDF

In this tutorial, we'll see how to use YDF to build (part of) an ML pipeline.

## Introduction to YDF

[YDF](https://github.com/google/yggdrasil-decision-forests) is a Python library to train, serve, interpret and productionize Decision Forest algorithms (such as Random Forests or Gradient Boosted Trees).

The YDF Documentation is available at [ydf.readthedocs.io](https://ydf.readthedocs.io/en/latest/). Please make sure to use the documentation when working on this tutorial.

**Note**: Chatbots tend to hallucinate when asked about YDF, which can make it painful to use them.

In [None]:
!pip install --quiet ydf

In [None]:
# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets

# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)

**Training models**

As discussed in class, YDF distinguishes between a **model** and a **learner**.

Let's create a GradientBoostedTrees learner and train a model on this dataset.

In [None]:
learner = ydf.GradientBoostedTreesLearner(label="income", task=ydf.Task.CLASSIFICATION)

model = learner.train(train_ds)

In [None]:
model.describe()

In [None]:
evaluation = model.evaluate(test_ds)
auc = evaluation.characteristics[0].auc  # Just the value of the AUC
evaluation  # Interactive report

In [None]:
analysis = model.analyze(test_ds)  # This might take a few seconds
analysis.variable_importances()['MEAN_DECREASE_IN_AUC_>50K_VS_OTHERS']  # Just one of the variable importances
analysis  # Interactive report

## Feature Selection

In a dataset with many features, only using a subset of features in production may be useful for a number of reasons:
*   Features may be hard or expensive to acquire
*   The model quality might improve (why?)
*   Model size or inference speed might improve

### Exercise

In this exercise, we implement an algorithm that finds (heuristically) a good set of 4 features for this dataset, measured in AUC.

### Make sure to...
*   Not use the test data during your algorithm
*   Be efficient. Trying out all combinations of features is too slow. Design an iterative (heuristic) approach
*   Use the variable importances YDF provides (see above)

In [None]:
def feature_selection(ds, num_features=4):
  # Your code here!
  # Should return the model trained with the 4 features you chose
  ...

In [None]:
pruned_model = feature_selection(train_ds)

In [None]:
features_after_pruning = [x.name for x in pruned_model.input_features()]
print(f"The better model has features {features_after_pruning}")

In [None]:
pruned_model_eval = pruned_model.evaluate(test_ds)
full_auc = evaluation.characteristics[0].auc
pruned_auc = pruned_model_eval.characteristics[0].auc
print(f"The AUC of the full model is {full_auc}, the AUC of the pruned model is {pruned_auc}")
print(f"The loss of quality is {(1-pruned_auc/full_auc)*100:.2f}%")