# <a name="0">Aprimorando (fine-tuning) o BERT (Bidirecional Encoder Representations from Transformers) para classificação de reviews em positivos ou negativos</a>

[HuggingFace Tutorial](https://huggingface.co/docs/transformers/training#train-in-native-pytorch)

Estamos utilizando uma versão pequena do BERT chamada **[DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert)**

Uma outra variante conhecida para textos em português chama-se **[Bertimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased)**

__Lembre-se de que o BERT e suas variantes usam mais recursos do que os outros modelos que aprendemos até agora e às vezes, você pode encontrar o erro out_of_memory. Se isso acontecer, você pode reiniciar o kernel, reduzir o batch_size e executar novamente o código.__

In [None]:
!pip install -q transformers==4.31.0 datasets==2.13.1 pyarrow>=8.0.0 ipywidgets

In [None]:
import time
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from transformers import Trainer, TrainingArguments, DistilBertForSequenceClassification, DistilBertTokenizerFast
from torch.utils.data import DataLoader
from datasets import load_metric

Let's read the dataset

In [None]:
df = pd.read_csv("./data/train.csv")

Let's print the dataset information.

In [None]:
df.info()

We drop rows with text field missing.

In [None]:
df.dropna(subset=["reviewText"], inplace=True)

BERT requires powerful compute power. In this demo, we will only use the first 1,000 data points. 

In [None]:
df = df.head(1000)

We set the output type to int64 as it is required by this library.

In [None]:
df["isPositive"] = df["isPositive"].astype("int64")

Let's keep 10% of the data for validation.

In [None]:
# This separates 10% of the entire dataset into validation dataset.
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df["reviewText"].tolist(),
    df["isPositive"].tolist(),
    test_size=0.10,
    shuffle=True,
    random_state=324,
    stratify = df["isPositive"].tolist(),
)

Let's get the special tokenizer for BERT.

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(val_texts,
                          truncation=True,
                          padding=True)

We prepare our data below.

In [None]:
class ReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx]).to(device)
        return item

    def __len__(self):
        return len(self.labels)
    
train_dataset = ReviewDataset(train_encodings, train_labels)
val_dataset = ReviewDataset(val_encodings, val_labels)

Let's call the model. This may print some warning messages. We are using it as intended, so don't worry about them.

In [None]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased",
                                                            num_labels=2)

Let's start the fine-tuning process. This code may take __a long time__ to complete with large datasets.

In [None]:
# Freeze the encoder weights until the classfier
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

# Hyperparameters
num_epochs = 10
learning_rate=0.01

# Get the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create data loaders
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8, drop_last=True)
eval_dataloader = DataLoader(val_dataset, batch_size=8, drop_last=True)

# Setup the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

metric = load_metric("accuracy")

model=model.to(device)

for epoch in range(num_epochs):
    start = time.time()
    training_loss = 0
    val_loss = 0
    # Training loop starts
    model.train() # put the model in training mode
    for batch in train_dataloader:
        # below: ** allows us to pass multiple arguments to model()
        outputs = model(**batch)
        loss = outputs.loss
        training_loss += loss.item()
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()
    
    # Validation loop starts
    model.eval() # put the model in prediction mode
    for batch in eval_dataloader:
        with torch.no_grad():
            # below:  ** allows us to pass multiple arguments to model()
            outputs = model(**batch)
        loss = outputs.loss
        val_loss += loss.item()
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        metric.add_batch(predictions=predictions, references=batch["labels"])
        
    # Let's take the average losses
    training_loss = training_loss / len(train_dataloader)
    val_loss = val_loss / len(eval_dataloader)
    end = time.time()
    
    print(f"Epoch {epoch}. Train_loss {training_loss:.4f}. Val_loss {val_loss:.4f}. \
    Val_accuracy {metric.compute()['accuracy']:.4f}. Seconds {end-start:.3f}.")

The fine-tuned BERT is able to correctly classify the sentiment of all records in the validation set. Let's print some of the data and what's happening with it.

In [None]:
k = 0
print(len(val_dataset.encodings["input_ids"][k]))
print(val_dataset.encodings["input_ids"][k])
print(val_texts[k])
print(val_labels[k])

In [None]:
k = 24
print(len(val_dataset.encodings["input_ids"][k]))
print(val_dataset.encodings["input_ids"][k])
print(val_texts[k])
print(val_labels[k])

Let's observe in more detail how sentences are tokenized.

In [None]:
st = val_texts[24]
print(st)
tok = tokenizer(st, truncation=True, padding=True)
print(tok)

In [None]:
# The mapped vocabulary is stored in tokenizer.vocab
tokenizer.vocab_size

In [None]:
# Methods convert_ids_to_tokens and convert_tokens_to_ids allow to see how sentences are tokenized
print(tokenizer.convert_ids_to_tokens(tok['input_ids']))