<a href="https://colab.research.google.com/github/ftvalentini/itba-NLP/blob/master/SequenceClf_FeatureExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Transfer Learning

Vamos a usar BERT como feature extractor para resolver un problema de clasificación. 

Una vez que obtenemos una representación vectorial de la secuencia de input, entrenamos un clasificador que podemos usar para predecir en datos nuevos.

In [None]:
!pip install transformers==4.24.0 datasets==2.6.1 watermark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.24.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.9 MB/s 
[?25hCollecting datasets==2.6.1
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 66.4 MB/s 
[?25hCollecting watermark
  Downloading watermark-2.3.1-py2.py3-none-any.whl (7.2 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 54.1 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 66.2 MB/s 
Collecting dill<0.3.6
  Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 5.4 MB/s 
[?25hC

In [None]:
import numpy as np
import pandas as pd
import torch
import datasets
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModel
from IPython.display import display, HTML
from sklearn.linear_model import LogisticRegression

In [None]:
%reload_ext watermark

In [None]:
%watermark -vp torch,transformers,datasets,sklearn

Python implementation: CPython
Python version       : 3.7.15
IPython version      : 7.9.0

torch       : 1.12.1+cu113
transformers: 4.24.0
datasets    : 2.6.1
sklearn     : 1.0.2



In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## Dataset

Vamos a resolver una de las tasks de GLUE:

[CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability). El objetivo es determinar is una oración es gramaticalmente correcta (1) o no (0).

In [None]:
full_dataset = load_dataset("glue", "cola")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
full_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [None]:
full_dataset["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [None]:
def show_random_elements(dataset, num_examples=10):
    picks = []
    for _ in range(num_examples):
        pick = np.random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = np.random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(full_dataset["train"], num_examples=10)

Unnamed: 0,sentence,label,idx
0,the student who everyone likes left.,acceptable,4873
1,Some sentences can go on.,acceptable,3442
2,Me gave it to him.,unacceptable,7969
3,Cora coiled the rope around the post.,acceptable,2583
4,Maxwell isn't half the doctor that his sister is a psychologist and his father was.,unacceptable,1700
5,He has seen his children.,acceptable,4442
6,The boat was sunk to collect the insurance.,acceptable,461
7,Mr Knightley suggested that thieves would break into Hartfield.,acceptable,6697
8,Louise broke the cup.,acceptable,6821
9,I explained it to Bill that she was lying.,acceptable,1569


In [None]:
print("Distribucion de clases:")
for k in full_dataset.keys():
    print(k)
    print(pd.Series(full_dataset[k]["label"]).value_counts())
    print("-"*70)

Distribucion de clases:
train
1    6023
0    2528
dtype: int64
----------------------------------------------------------------------
validation
1    721
0    322
dtype: int64
----------------------------------------------------------------------
test
-1    1063
dtype: int64
----------------------------------------------------------------------


In [None]:
# test no tiene labels --> es lo que se sube al benchmark!
full_dataset["test"][:3]

{'sentence': ['Bill whistled past the house.',
  'The car honked its way down the road.',
  'Bill pushed Harry off the sofa.'],
 'label': [-1, -1, -1],
 'idx': [0, 1, 2]}

In [None]:
print("Sentence length:")
for k in full_dataset.keys():
    print(k)
    largos = pd.Series(full_dataset[k]["sentence"]).str.len()
    print(np.quantile(largos, q=np.arange(0, 1.1, .1)).astype(int))
    print("-"*70)

Sentence length:
train
[  6  21  26  30  33  37  41  46  52  65 231]
----------------------------------------------------------------------
validation
[  9  20  25  29  33  36  42  47  56  69 157]
----------------------------------------------------------------------
test
[  7  20  25  29  33  36  41  46  53  66 152]
----------------------------------------------------------------------


## Tokenización y feature extraction

Vamos a cargar un modelo sin head porque solo nos interesa BERT para extraer features del texto.

In [None]:
model_checkpoint = "distilbert-base-cased"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
print("max length:", tokenizer.model_max_length)
print("Vocab size:", tokenizer.vocab_size)

max length: 512
Vocab size: 28996


In [None]:
def tokenize_fn(examples):
    return tokenizer(examples["sentence"], truncation=True, padding=True, return_tensors="pt")

In [None]:
# Vamos a extraer los features con los mismos batches con los que tokenizamos
batch_size = 10

In [None]:
tokenized_dataset = full_dataset.map(tokenize_fn, batched=True, batch_size=10)

  0%|          | 0/856 [00:00<?, ?ba/s]

  0%|          | 0/105 [00:00<?, ?ba/s]

  0%|          | 0/107 [00:00<?, ?ba/s]

In [None]:
# map ignores tensor formatting while writing a cache file 
# --> convertimos a tensores
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
# cada batch tiene distinto tamaño
# --> por eso es importante hacer el forward pass con el mismo batch_size
print(tokenized_dataset["train"][:10]["input_ids"].shape)
print(tokenized_dataset["train"][10:20]["input_ids"].shape)

torch.Size([10, 19])
torch.Size([10, 12])


In [None]:
tokenized_dataset["train"][:2]

{'label': tensor([1, 1]),
 'input_ids': tensor([[  101,  3458,  2053,  1281,   112,   189,  4417,  1142,  3622,   117,
           1519,  2041,  1103,  1397,  1141,  1195, 17794,   119,   102],
         [  101,  1448,  1167, 23563,  1704,  2734,  1105,   146,   112,   182,
           2368,  1146,   119,   102,     0,     0,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}

In [None]:
tokenized_dataset["train"][12:14]

{'label': tensor([1, 1]),
 'input_ids': tensor([[ 101, 2617, 3733, 1149, 1104, 1103, 1395,  119,  102,    0,    0,    0],
         [ 101, 1109, 4605, 1200, 1447, 1174, 1103, 4637, 3596,  119,  102,    0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

In [None]:
# del full_dataset

In [None]:
model = AutoModel.from_pretrained(model_checkpoint)
_ = model.to(device)

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# extraemos el embedding de CLS en un batch de prueba
batch_prueba = {
    "attention_mask": tokenized_dataset["train"][:batch_size]["attention_mask"].to(device),
    "input_ids": tokenized_dataset["train"][:batch_size]["input_ids"].to(device)
}
with torch.inference_mode(): # como no_grad() pero mejor
    output_prueba = model(**batch_prueba)
cls_token_output = output_prueba.last_hidden_state[:, 0]

print(output_prueba.last_hidden_state.shape)
print(cls_token_output.shape)

torch.Size([10, 19, 768])
torch.Size([10, 768])


In [None]:
def get_embeddings(examples):
    """Usamos embedding de CLS para representar cada secuencia
    """
    inputs = {key: tensor.to(device) for key,tensor in examples.items() if key != "label"}
    with torch.inference_mode():
        output = model(**inputs).last_hidden_state[:, 0]
    return {"features": output.cpu().numpy()}

In [None]:
model.eval()
featurized_dataset = tokenized_dataset.map(
    get_embeddings, batched=True, batch_size=batch_size)

  0%|          | 0/856 [00:00<?, ?ba/s]

  0%|          | 0/105 [00:00<?, ?ba/s]

  0%|          | 0/107 [00:00<?, ?ba/s]

In [None]:
featurized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 1063
    })
})

In [None]:
# usamos arrays de numpy para entrenar/evaluar el modelo
X_train = np.array(featurized_dataset["train"]["features"])
y_train = np.array(featurized_dataset["train"]["label"])

X_val = np.array(featurized_dataset["validation"]["features"])
y_val = np.array(featurized_dataset["validation"]["label"])

X_test = np.array(featurized_dataset["test"]["features"])
y_test = np.array(featurized_dataset["test"]["label"])

## Modelo

Entrenado sobre los BERT embeddings ya extraidos.

Vamos a hacer _error analysis_ (inspeccionar los ejemplos peor puntuados por el modelo).

In [None]:
mod = LogisticRegression(max_iter=1000)
mod.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [None]:
metric = load_metric('glue', "cola") # matthews corr coefficient

  """Entry point for launching an IPython kernel.


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

In [None]:
scores_train = mod.predict_proba(X_train)[:, 1]
pred_train = scores_train.round()
metric.compute(predictions=pred_train, references=y_train)

{'matthews_correlation': 0.39034393927076266}

In [None]:
scores_val = mod.predict_proba(X_val)[:, 1]
pred_val = scores_val.round()
metric.compute(predictions=pred_val, references=y_val)

{'matthews_correlation': 0.2741631371163882}

In [None]:
df_val = pd.DataFrame({"y": y_val, "score": scores_val, "idx": featurized_dataset["validation"]["idx"]})

In [None]:
# falsos positivos más groseros
top_fp = df_val.query("y == 0").sort_values("score", ascending=False).head(5)
top_fp

Unnamed: 0,y,score,idx
674,0,0.969737,674
78,0,0.953569,78
202,0,0.945956,202
218,0,0.932601,218
659,0,0.931154,659


In [None]:
featurized_dataset["validation"].select(top_fp["idx"])["sentence"]

["Gould's performance of Bach on the piano doesn't please me anywhere as much as Ross's on the harpsichord.",
 'Drowning cats, which is against the law, are hard to rescue.',
 'My heart is pounding me.',
 'John offers many advice.',
 'Millie will send the President an obscene telegram, and Paul, the Secretary a rude letter.']

In [None]:
# falsos negativos mas groseros
top_fn = df_val.query("y == 1").sort_values("score", ascending=True).head(5)
top_fn

Unnamed: 0,y,score,idx
995,1,0.153679,995
398,1,0.154213,398
692,1,0.185759,692
332,1,0.194522,332
407,1,0.201296,407


In [None]:
featurized_dataset["validation"].select(top_fn["idx"])["sentence"]

['John counted on Bill to get there on time.',
 'The man who Mary loves and Sally hates computed my tax.',
 "This is the senator to whose mother's friend's sister's I sent the letter.",
 'With no job would John be happy.',
 'She asked was Alison coming to the party.']

## Referencias

* [Notebooks de rasbt](https://github.com/rasbt/deeplearning-models#transformers)
* [Notebooks de HuggingFace](https://huggingface.co/docs/transformers/notebooks)