# HuggingFace: Active Learning for NLP Classification

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/baal-org/baal/blob/master/notebooks/compatibility/nlp_classification.ipynb)

Learn how to use Baal with [HuggingFace](https://huggingface.co/transformers/main_classes/trainer.html).
 Any model which could be trained by HuggingFace trainer and has `Dropout` layers could be used in the same manner.

We will use the `Yelp Review` dataset and `BertForSequenceClassification` as the model for the purpose of this tutorial. As usual, we need to first download the dataset.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

pretrained_weights = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=pretrained_weights)

datasets = load_dataset("yelp_review_full", cache_dir="/tmp")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

datasets = datasets.map(tokenize_function, batched=True)

ds = datasets['train']
eval_ds = datasets['test']

## ActiveLearning Dataset
In order to create an active learning dataset, we need to wrap the dataset with `baal.ActiveLearningDataset`.
This requires a `torch.utils.Dataset` so we propose a `baal.active.HuggingFaceDataset` that can take a HuggingFace dataset
and perform the preprocessing steps.

In [None]:
from baal import ActiveLearningDataset

active_set = ActiveLearningDataset(dataset=ds)
active_set.can_label = False  # Need to manually do this for research

# lets randomly label 100 samples, therefore len(active_set) should be 100
active_set.label_randomly(100)
assert len(active_set) == 100
print(len(active_set.pool))

## Active Learning Model
The process of making a model bayesian is exactly the same as before. In this case, we will get the `Bert` model and use `baal.bayesian.dropout.patch_module` to make the dropout layer stochastic at inference time. 

In [None]:

import torch
from transformers import AutoModelForSequenceClassification
from baal.bayesian.dropout import patch_module

use_cuda = torch.cuda.is_available()

model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=pretrained_weights,
                                                           num_labels=5)
model = patch_module(model)
if use_cuda:
    model.cuda()

## Heuristic

As already implemented and useful in all classification cases, we continue using `BALD` as our active learning heuristic.

Note: ActiveLearning for NLP tasks is an open and challenging field and hence, desiging a proper heuristic is out of the scope of this tutorial.
We encourage any pull request that would propose better heuristics.

In [None]:
from baal.active import get_heuristic

heuristic = get_heuristic('bald')

## HugginFace Trainer Wrapper

If you are not familiar with the HuggingFace trainer module please start [here](https://huggingface.co/transformers/main_classes/trainer.html).
HuggingFace Trainer is one of the most popular library to train Transformer models.
In order to do active learning, we need the prediction to be run over every sample in pool for number of iterations and hence our wrapper `baal.BaalTransformersTrainer` will provide this functionality on top of the provided functionalities in the `Trainer` module.
In the rest of this tutorial, we show how to initialize the `baal.active.active_loop.ActiveLearningLoop` and how to do Active Training.

In [None]:
from baal.active.stopping_criteria import LabellingBudgetStoppingCriterion
from baal.active.heuristics import BALD
from transformers import TrainingArguments
from baal.transformers_trainer_wrapper import BaalTransformersTrainer
from baal.experiments.base import ActiveLearningExperiment

#Initialization for the huggingface trainer
training_args = TrainingArguments(
    output_dir='.',  # output directory
    num_train_epochs=3,  # total # of training epochs per AL step
    per_device_train_batch_size=4,  # batch size per device during training
    per_device_eval_batch_size=4,  # batch size for evaluation
    weight_decay=0.01,  # strength of weight decay
    logging_dir='.',  # directory for storing logs
)

# create the trainer through Baal Wrapper
baal_trainer = BaalTransformersTrainer(model=model,
                                       args=training_args,
                                       train_dataset=active_set,
                                       tokenizer=tokenizer)

experiment = ActiveLearningExperiment(trainer=baal_trainer,
                                      al_dataset=active_set,
                                      eval_dataset=eval_ds, heuristic=heuristic, query_size=50, iterations=20,
                                      criterion=LabellingBudgetStoppingCriterion(active_dataset=active_set,
                                                                                 labelling_budget=100))
experiment.start()