<a href="https://colab.research.google.com/github/frank-lacriola/Natural-Language-Processing/blob/main/Handling_Transformers_with_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.7 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 36.0 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 39.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 40.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

AutoClasses allow to generate tokenizer (and model) objects without instantiating the specific model tokenizer (and model)

In [2]:
from transformers import AutoTokenizer

tknzr = AutoTokenizer.from_pretrained("bert-base-cased") 
tokens = tknzr("I'm learning DNLP")
print(tokens) # we will have the input ids for our sentence, the token type id to indicate that these tokens are words, the attention_mask 

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'input_ids': [101, 146, 112, 182, 3776, 141, 20734, 2101, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


Each model configuration has a maximum length of tokens that can be used for processing. It is common to process sentences that have different lenghts. In this case:

- `max_length` parameter allow to set a maximum number of tokens for processing
- `truncation` allows to enable truncation for sentences exceeding the `max_length`
- `padding` allows to enable padding for sentences shorter than `max_length`

The tokenizer return the `attention_mask` that allow the model to compute attention weights only for tokens (and not for padding)

In [3]:
tokens = tknzr("I'm learning Deep NLP", padding='max_length', max_length=16) 
print (tokens)

{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}


In [4]:
tokens = tknzr("I'm learning Deep NLP at Politecnico di Torino. I'm a 2nd year master student", padding='max_length', max_length=16, truncation=True) 
print (tokens)

# [CLS] special token for encoder model, used for classification/regression tasks
# [SEP] special token to separate multiple sentences
# [PAD] special token for padding

{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 1120, 17129, 3150, 1665, 7770, 1186, 4267, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Tokenizers can also perform the opposite conversion, from IDs we can reconstruct the sentence.

In [5]:
text = tknzr.decode(tokens.input_ids)
print(text)

[CLS] I'm learning Deep NLP at Politecnico di [SEP]


# Models

AutoModel class is able to take in charge the instantiation of the correct class for the model we want to use.

Given that, models for specific tasks exist with the same backbone architecture (e.g., BERT can be used both for sequence classification or for token-level classification), the Auto Model should be instantiated with the correct task appended (e.g., AutoModelForSequenceClassification).

In [6]:
from transformers import AutoModelForSequenceClassification

bert_model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

However, pre-trained bert model is not fine-tuned for any specific task (this is the reason behind the warning). If we want to use this model, we first need to finetune it (or we can use another model already finetuned for the task).

In [7]:
from transformers import AutoModelForSequenceClassification

bert_model_sc = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

Downloading:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

In [8]:
import numpy as np

sentences = ["Google stocks went up suddenly, I earned 30B$"]
tokenized_sentence = tknzr(sentences, return_tensors='pt', padding='max_length', truncation=True, max_length=16)
pred = bert_model_sc(**tokenized_sentence) # we pass all the fields of tokenized sentence
print(pred[0][0].detach().numpy(), np.argmax(pred[0][0].detach().numpy()))

[-0.12607424 -0.9667538   1.8209392 ] 2


In [9]:
pred

SequenceClassifierOutput([('logits',
                           tensor([[-0.1261, -0.9668,  1.8209]], grad_fn=<AddmmBackward0>))])

The pred object is a SequenceClassifierOutput, as we can see in the documentation of that class below, it means it has an optional loss, logits, an optional hidden_states and an optional attentions attribute. 

# Fine Tuning

Pretraining + Finetuning paradigm is the key of the success of the 🤗 Transformers library. [Model Hub](https://huggingface.co/models) contains plenty of pre-trained models that can be used as they are, or can be finetuned on new datasets.

[Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) allows user to easily finetune the selected model for the task at hand.

In [10]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_train.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_test.csv

--2022-03-06 12:04:29--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10538325 (10M) [text/plain]
Saving to: ‘Corona_NLP_train.csv’


2022-03-06 12:04:29 (77.7 MB/s) - ‘Corona_NLP_train.csv’ saved [10538325/10538325]

--2022-03-06 12:04:29--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1006793 (983K

In [11]:
import pandas as pd

df_train = pd.read_csv("Corona_NLP_train.csv")
df_test = pd.read_csv("Corona_NLP_test.csv")

df_train = df_train.dropna(how = 'any')
df_test  = df_test.dropna (how = 'any')

In [12]:
df_train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
5,3804,48756,"ÜT: 36.319708,-82.363649",16-03-2020,As news of the regions first confirmed COVID-...,Positive
6,3805,48757,"35.926541,-78.753267",16-03-2020,Cashier at grocery store was sharing his insig...,Positive


In [13]:
train_sentences = df_train['OriginalTweet'].tolist()
train_y = df_train['Sentiment'].tolist()
print(f"Train set: {len(train_sentences)}, {len(train_y)}")

Train set: 32567, 32567


In [14]:
eval_samples = int(0.05*len(train_sentences))
eval_sentences = train_sentences[:eval_samples]
eval_y = train_y[:eval_samples]

train_sentences = train_sentences[eval_samples:]
train_y = train_y[eval_samples:]

In [15]:
test_sentences = df_test["OriginalTweet"].tolist()
test_y = df_test["Sentiment"].tolist()

In [16]:
print(f"Train set: {len(train_sentences)}, {len(train_y)}")
print(f"Eval set: {len(eval_sentences)}, {len(eval_y)}")
print(f"Test set: {len(test_sentences)}, {len(test_y)}")

Train set: 30939, 30939
Eval set: 1628, 1628
Test set: 2964, 2964


In [17]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels = len(set(train_y)))

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

In [18]:
tokenized_train = tokenizer(train_sentences, padding='max_length', truncation=True, max_length=64)
tokenized_test = tokenizer(test_sentences, padding="max_length", truncation=True, max_length=64)
tokenized_eval = tokenizer(eval_sentences, padding="max_length", truncation=True, max_length=64)

In [19]:
# Label encoding

from sklearn.preprocessing import LabelEncoder

def label_encoding(label, labEncoder):
  y = labEncoder.transform(label)
  return y

all_labels = []
for label in set(train_y):
  all_labels.append(label)

le = LabelEncoder()
le.fit(all_labels)

train_y = label_encoding(train_y, le)
test_y = label_encoding(test_y, le)
eval_y = label_encoding(eval_y, le)

In [20]:
import torch

class SCDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    # The items() method returns a view object. The view object contains the key-value pairs of the dictionary, as tuples in a list.
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.labels)

In [21]:
train_ds = SCDataset(tokenized_train, train_y)
eval_ds = SCDataset(tokenized_eval, eval_y)
test_ds = SCDataset(tokenized_test, test_y)

In [22]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report

def compute_metrics(pred):
  predictions = np.argmax(pred.predictions, axis=-1)
  labels = pred.label_ids
  return {"acc": accuracy_score(labels, predictions),
          "f1_macro": f1_score(labels, predictions, average="macro"),
        "f1_weight": f1_score(labels, predictions, average="weighted")}

In [23]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=10,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    eval_steps=100,
    save_steps=100,
    load_best_model_at_end=True,
)

In [24]:
from transformers import Trainer

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_ds,         # training dataset
    eval_dataset=eval_ds,             # evaluation dataset
    compute_metrics=compute_metrics
)

trainer.train()

***** Running training *****
  Num examples = 30939
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 967


Epoch,Training Loss,Validation Loss,Acc,F1 Macro,F1 Weight
1,0.6979,0.678821,0.738943,0.750261,0.737053


***** Running Evaluation *****
  Num examples = 1628
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-967
Configuration saved in ./results/checkpoint-967/config.json
Model weights saved in ./results/checkpoint-967/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-967 (score: 0.678820788860321).


TrainOutput(global_step=967, training_loss=0.9078488483054191, metrics={'train_runtime': 754.0133, 'train_samples_per_second': 41.032, 'train_steps_per_second': 1.282, 'total_flos': 1017576526211712.0, 'train_loss': 0.9078488483054191, 'epoch': 1.0})

In [25]:
# Trainer APIs could also be used for testing the model
preds = trainer.predict(test_ds)
print(preds)

***** Running Prediction *****
  Num examples = 2964
  Batch size = 64


PredictionOutput(predictions=array([[ 2.6345268e-01, -3.2572780e+00,  3.4519835e+00, -1.0118518e-01,
        -2.4304702e-03],
       [-3.5613079e+00,  5.3849685e-01, -3.8158864e-01,  2.0037059e-01,
         3.8210883e+00],
       [ 8.5438071e-03, -3.1067011e+00,  3.1907582e+00,  1.3819455e-01,
         1.1883143e-01],
       ...,
       [-1.5491133e+00, -1.8113766e+00,  2.0123775e+00,  1.2820493e-01,
         1.5449066e+00],
       [-2.0696137e+00, -2.1130445e+00,  1.1042901e-01,  4.2395716e+00,
         7.7974743e-01],
       [-1.4342130e+00,  4.5992270e+00, -2.0125334e+00, -2.1654272e+00,
         3.8095638e-01]], dtype=float32), label_ids=array([0, 4, 2, ..., 2, 3, 1]), metrics={'test_loss': 0.6866244077682495, 'test_acc': 0.738191632928475, 'test_f1_macro': 0.7484057825081532, 'test_f1_weight': 0.7374019574862526, 'test_runtime': 22.4388, 'test_samples_per_second': 132.093, 'test_steps_per_second': 2.095})


# Dataset and metrics

In [26]:
!pip install metrics datasets

Collecting metrics
  Downloading metrics-0.3.3.tar.gz (18 kB)
Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.3 MB/s 
[?25hCollecting Pygments==2.2.0
  Downloading Pygments-2.2.0-py2.py3-none-any.whl (841 kB)
[K     |████████████████████████████████| 841 kB 43.2 MB/s 
[?25hCollecting pathspec==0.5.5
  Downloading pathspec-0.5.5.tar.gz (21 kB)
Collecting pathlib2>=2.3.0
  Downloading pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 45.4 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 48.8 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 M

In [27]:
from datasets import load_metric

metric = load_metric('accuracy')

Downloading:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

In [28]:
y_pred = preds.predictions.argmax(-1)
metric.add_batch(predictions=y_pred, references=test_y)
print(metric.compute())

{'accuracy': 0.738191632928475}


In [29]:
from datasets import load_dataset

dataset = load_dataset('scitldr')

Downloading:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

No config specified, defaulting to: scitldr/Abstract


Downloading and preparing dataset scitldr/Abstract (download: 5.23 MiB, generated: 4.58 MiB, post-processed: Unknown size, total: 9.81 MiB) to /root/.cache/huggingface/datasets/scitldr/Abstract/0.0.0/79e0fa75961392034484808cfcc8f37deb15ceda153b798c92d9f621d1042fef...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.01M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/356k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/378k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset scitldr downloaded and prepared to /root/.cache/huggingface/datasets/scitldr/Abstract/0.0.0/79e0fa75961392034484808cfcc8f37deb15ceda153b798c92d9f621d1042fef. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [30]:
print(dataset["train"][0])
print("Source text:", dataset["train"][0]["source"])
print("Target text:", dataset["train"][0]["target"])

{'source': ['Due to the success of deep learning to solving a variety of challenging machine learning tasks, there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect.', 'Particularly, the properties of critical points and the landscape around them are of importance to determine the convergence performance of optimization algorithms.', 'In this paper, we provide a necessary and sufficient characterization of the analytical forms for the critical points (as well as global minimizers) of the square loss functions for linear neural networks.', 'We show that the analytical forms of the critical points characterize the values of the corresponding loss functions as well as the necessary and sufficient conditions to achieve global minimum.', 'Furthermore, we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks.', 'One partic

In [37]:
max_input_length = 512
max_output_length = 64

def process_function(examples):
  inputs = [s for s in examples['source']]
  inputs = ' '.join(inputs)
  targets = examples['target']
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")
  labels = tokenizer(targets, max_length=max_output_length, truncation=True, padding='max_length')
  model_inputs['labels'] = labels['input_ids']
  return model_inputs

In [38]:
dataset = dataset.map(process_function)

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

In [39]:
columns_to_return = ['input_ids', 'labels', 'attention_mask']
dataset.set_format(type='torch', columns=columns_to_return) 
# If a formatting is set with Dataset.set_format() rows will be returned with the selected format.

In [47]:
#print(dataset["train"][0])
print("Source text:", dataset["train"][0]["input_ids"])
print("Target text:", dataset["train"][0]["labels"])

Source text: tensor([    0, 28084,     7,     5,  1282,     9,  1844,  2239,     7, 15582,
           10,  3143,     9,  4087,  3563,  2239,  8558,     6,    89,    16,
           10,  2227,   773,    11,  2969,   872,  8047,    13,  1058, 26739,
         4836,    31,    10, 26534,  6659,     4, 36863,     6,     5,  3611,
            9,  2008,   332,     8,     5,  5252,   198,   106,    32,     9,
         3585,     7,  3094,     5, 33345,   819,     9, 25212, 16964,     4,
           96,    42,  2225,     6,    52,   694,    10,  2139,     8,  7719,
        34934,     9,     5, 23554,  4620,    13,     5,  2008,   332,    36,
          281,   157,    25,   720, 15970, 11574,    43,     9,     5,  3925,
          872,  8047,    13, 26956, 26739,  4836,     4,   166,   311,    14,
            5, 23554,  4620,     9,     5,  2008,   332, 33776,     5,  3266,
            9,     5, 12337,   872,  8047,    25,   157,    25,     5,  2139,
            8,  7719,  1274,     7,  3042,   720,  