<a href="https://colab.research.google.com/github/ftvalentini/itba-NLP/blob/master/SequenceClf_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Transfer Learning

Vamos a hacer fine-tuning de BERT pre-entrenado para clasificar secuencias.  

Vamos a ajustar solamente los pesos de las últimas capas y congelar el resto de la red.

In [None]:
!pip install transformers==4.24.0 datasets==2.6.1 watermark

In [None]:
import numpy as np
import pandas as pd
import torch
import datasets
from datasets import load_dataset, load_metric
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
)
from IPython.display import display, HTML
from sklearn.linear_model import LogisticRegression

In [None]:
%reload_ext watermark

In [None]:
%watermark -vp torch,transformers,datasets,sklearn

Python implementation: CPython
Python version       : 3.7.15
IPython version      : 7.9.0

torch       : 1.12.1+cu113
transformers: 4.24.0
datasets    : 2.6.1
sklearn     : 1.0.2



In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## Dataset

Vamos a resolver una de las tasks de GLUE:

[CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability). El objetivo es determinar is una oración es gramaticalmente correcta (1) o no (0).

In [None]:
full_dataset = load_dataset("glue", "cola")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
full_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [None]:
def show_random_elements(dataset, num_examples=10):
    """Copiado de notebook HF :)
    """
    picks = []
    for _ in range(num_examples):
        pick = np.random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = np.random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(full_dataset["train"], num_examples=6)

Unnamed: 0,sentence,label,idx
0,Reports on the covers of which the government prescribes the height of the lettering almost always put me to sleep.,unacceptable,1350
1,Rory ate muffins.,acceptable,6062
2,"The cops spoke to the janitor about it yesterday, that robbery.",acceptable,1837
3,John wonders where him to go.,unacceptable,418
4,Who said he would give the cloak to Lee?,acceptable,7700
5,She has kissed she.,unacceptable,8015


In [None]:
print("distribucion de clases:")
for k in full_dataset.keys():
    print(k)
    print(pd.Series(full_dataset[k]["label"]).value_counts())
    print("-"*70)

distribucion de clases:
train
1    6023
0    2528
dtype: int64
----------------------------------------------------------------------
validation
1    721
0    322
dtype: int64
----------------------------------------------------------------------
test
-1    1063
dtype: int64
----------------------------------------------------------------------


In [None]:
print("Sentence length:")
for k in full_dataset.keys():
    print(k)
    largos = pd.Series(full_dataset[k]["sentence"]).str.len()
    print(np.quantile(largos, q=np.arange(0, 1.1, .1)).astype(int))
    print("-"*70)

Sentence length:
train
[  6  21  26  30  33  37  41  46  52  65 231]
----------------------------------------------------------------------
validation
[  9  20  25  29  33  36  42  47  56  69 157]
----------------------------------------------------------------------
test
[  7  20  25  29  33  36  41  46  53  66 152]
----------------------------------------------------------------------


## Tokenización y modelo

In [None]:
model_checkpoint = "distilbert-base-cased"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
def tokenize_fn(examples):
    """Sin aplicar padding --> lo aplicamos luego en cada batch de entrenamiento
    """
    return tokenizer(examples["sentence"], truncation=True)

In [None]:
tokenize_fn(full_dataset['train'][:3])

{'input_ids': [[101, 3458, 2053, 1281, 112, 189, 4417, 1142, 3622, 117, 1519, 2041, 1103, 1397, 1141, 1195, 17794, 119, 102], [101, 1448, 1167, 23563, 1704, 2734, 1105, 146, 112, 182, 2368, 1146, 119, 102], [101, 1448, 1167, 23563, 1704, 2734, 1137, 146, 112, 182, 2368, 1146, 119, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
tokenized_dataset = full_dataset.map(tokenize_fn, batched=True, batch_size=32)

  0%|          | 0/268 [00:00<?, ?ba/s]

  0%|          | 0/33 [00:00<?, ?ba/s]

  0%|          | 0/34 [00:00<?, ?ba/s]

In [None]:
# map ignores tensor formatting while writing a cache file
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
# del full_dataset

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)
model.to(device)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifi

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

## Fine-tuning

Tenemos que definir una métrica para evaluar nuestro modelo en validación durante el entrenamiento.

Como el mejor modelo puede no ser el del final del entrenamiento, vamos a usar el mejor modelo guardado según nuestra métrica en validación al final del entrenamiento.

No hacemos búsqueda de hiperparámetros (como learning rate, regularización L2, etc.). Ver esto en [la notebook de HF](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb).

In [None]:
# freeze todas las capas
for param in model.parameters():
    param.requires_grad = False

In [None]:
# descongelar las ultimas capas
for param in model.pre_classifier.parameters():
    param.requires_grad = True
for param in model.classifier.parameters():
    param.requires_grad = True
# y el ultimo transformer block:
for param in model.distilbert.transformer.layer[-1].parameters():
    param.requires_grad = True

In [None]:
metric_name = "matthews_correlation"
metric = load_metric(metric_name)

  


Downloading builder script:   0%|          | 0.00/1.71k [00:00<?, ?B/s]

In [None]:
model_name = model_checkpoint.split("/")[-1]

In [None]:
args = TrainingArguments(
    f"{model_name}-finetuned-cola",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
    seed=33,
)

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    #print(predictions.mean())
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# pasamos el tokenizer para que aplique el padding en cada batch
# la alternativa es un usar un data_collator propio 
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 8551
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 5350
  Number of trainable parameters = 65783042
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5309,0.489371,0.435192
2,0.3603,0.523678,0.445489
3,0.2559,0.721148,0.458663
4,0.179,0.817184,0.496359
5,0.1358,1.060351,0.48097
6,0.1035,1.153641,0.503037
7,0.0798,1.279743,0.491332
8,0.067,1.350947,0.491932
9,0.0397,1.44963,0.495515
10,0.0385,1.508287,0.485945


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to distilbert-base-cased-finetuned-cola/checkpoint-535
Configuration saved in distilbert-base-cased-finetuned-cola/checkpoint-535/config.json
Model weights saved in distilbert-base-cased-finetuned-cola/checkpoint-535/pytorch_model.bin
tokenizer config file saved in distilbert-base-cased-finetuned-cola/checkpoint-535/tokenizer_config.json
Special tokens file saved in distilbert-base-cased-finetuned-cola/checkpoint-535/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If id

TrainOutput(global_step=5350, training_loss=0.16955627013589733, metrics={'train_runtime': 354.5037, 'train_samples_per_second': 241.21, 'train_steps_per_second': 15.092, 'total_flos': 465498976814988.0, 'train_loss': 0.16955627013589733, 'epoch': 10.0})

In [None]:
# corremos evaluate() sobre validation data para verificar que se conservó el 
# modelo de mejor performance
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16


{'eval_loss': 1.1536407470703125,
 'eval_matthews_correlation': 0.5030366431605939,
 'eval_runtime': 1.0834,
 'eval_samples_per_second': 962.695,
 'eval_steps_per_second': 60.918,
 'epoch': 10.0}

In [None]:
# vemos performance en train:
trainer.evaluate(tokenized_dataset["train"])

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8551
  Batch size = 16


{'eval_loss': 0.03277109935879707,
 'eval_matthews_correlation': 0.9766473780275736,
 'eval_runtime': 9.2131,
 'eval_samples_per_second': 928.135,
 'eval_steps_per_second': 58.069,
 'epoch': 10.0}

In [None]:
# error analysis: ejemplos con mayor loss

In [None]:
data_collator = trainer.data_collator

def loss_per_example(examples):
    """Agrega a un batch la proba, prediccion y loss de cada ejemplo 
    """
    batch = data_collator(examples)
    input_ids = torch.tensor(batch["input_ids"], device=device)
    attention_mask = torch.tensor(batch["attention_mask"], device=device)
    labels = torch.tensor(batch["labels"], device=device)
    with torch.inference_mode():
        output = model(input_ids, attention_mask)
        batch["proba"] = torch.softmax(output.logits, dim=1)[:, 1]
        batch["predicted_label"] = torch.argmax(output.logits, axis=1)
    # reduction="none" --> loss por example
    loss = torch.nn.functional.cross_entropy(output.logits, labels, reduction="none")
    batch["loss"] = loss
    # datasets requires list of NumPy array data types
    for k, v in batch.items():
        batch[k] = v.cpu().numpy()
    return batch

In [None]:
model.eval()
errors_dataset = tokenized_dataset['validation'].map(
    loss_per_example, batched=True, batch_size=16)

  0%|          | 0/66 [00:00<?, ?ba/s]

  import sys
  
  if __name__ == '__main__':


In [None]:
errors_dataset.set_format('pandas')
errors_df = errors_dataset[:][['label', 'proba', 'predicted_label', 'loss']]
# El trainer elimina in-place cualquier feature de tipo str
# --> recuperamos la columna
errors_df['sentence'] = full_dataset['validation']['sentence']

In [None]:
pd.set_option("display.max_colwidth", None)

In [None]:
# falsos positivos
errors_df.query("label == 0").sort_values("loss", ascending=False).head()

Unnamed: 0,label,proba,predicted_label,loss,sentence
546,0,0.999708,1,8.138898,"Since Jill said Joe had invited Sue, we didn't have to ask who."
605,0,0.999699,1,8.108755,Agnes wondered how John could eat but it's not clear what.
533,0,0.999637,1,7.922245,John ate dinner but I don't know who.
754,0,0.999637,1,7.921374,We found your letter to ourselves in the trash.
588,0,0.999631,1,7.904643,"Sally asked if somebody was going to fail math class, but I can't remember who."


In [None]:
# falsos negativos
errors_df.query("label == 1").sort_values("loss", ascending=False).head()

Unnamed: 0,label,proba,predicted_label,loss,sentence
247,1,0.00041,0,7.798517,John placed Kim behind the garage.
580,1,0.000495,0,7.610415,"She was dancing with somebody, but I don't know who with."
544,1,0.00054,0,7.523472,Joan ate dinner with someone but I don't know who with.
398,1,0.000551,0,7.50464,The man who Mary loves and Sally hates computed my tax.
856,1,0.000583,0,7.446771,They preferred them arrested.


## Referencias

* [Notebooks de rasbt](https://github.com/rasbt/deeplearning-models#transformers)
* [Notebooks de HuggingFace](https://huggingface.co/docs/transformers/notebooks)
* [Blog de Lewis Tunstall](https://lewtun.github.io/blog/til/nlp/huggingface/transformers/2021/01/01/til-data-collator.html)