<a href="https://colab.research.google.com/github/frank-lacriola/Natural-Language-Processing/blob/main/Handling_Transformers_with_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install transformers

AutoClasses allow to generate tokenizer (and model) objects without instantiating the specific model tokenizer (and model)

In [3]:
from transformers import AutoTokenizer

tknzr = AutoTokenizer.from_pretrained("bert-base-cased") 
tokens = tknzr("I'm learning DNLP")
print(tokens) # we will have the input ids for our sentence, the token type id to indicate that these tokens are words, the attention_mask 

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'input_ids': [101, 146, 112, 182, 3776, 141, 20734, 2101, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


Each model configuration has a maximum length of tokens that can be used for processing. It is common to process sentences that have different lenghts. In this case:

- `max_length` parameter allow to set a maximum number of tokens for processing
- `truncation` allows to enable truncation for sentences exceeding the `max_length`
- `padding` allows to enable padding for sentences shorter than `max_length`

The tokenizer return the `attention_mask` that allow the model to compute attention weights only for tokens (and not for padding)

In [5]:
tokens = tknzr("I'm learning Deep NLP", padding='max_length', max_length=16) 
print (tokens)

{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}


In [6]:
tokens = tknzr("I'm learning Deep NLP at Politecnico di Torino. I'm a 2nd year master student", padding='max_length', max_length=16, truncation=True) 
print (tokens)

# [CLS] special token for encoder model, used for classification/regression tasks
# [SEP] special token to separate multiple sentences
# [PAD] special token for padding

{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 1120, 17129, 3150, 1665, 7770, 1186, 4267, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Tokenizers can also perform the opposite conversion, from IDs we can reconstruct the sentence.

In [7]:
text = tknzr.decode(tokens.input_ids)
print(text)

[CLS] I'm learning Deep NLP at Politecnico di [SEP]


AutoModel class is able to take in charge the instantiation of the correct class for the model we want to use.

Given that, models for specific tasks exist with the same backbone architecture (e.g., BERT can be used both for sequence classification or for token-level classification), the Auto Model should be instantiated with the correct task appended (e.g., AutoModelForSequenceClassification).

In [8]:
from transformers import AutoModelForSequenceClassification

bert_model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

However, pre-trained bert model is not fine-tuned for any specific task (this is the reason behind the warning). If we want to use this model, we first need to finetune it (or we can use another model already finetuned for the task).

In [9]:
from transformers import AutoModelForSequenceClassification

bert_model_sc = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

Downloading:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

In [13]:
import numpy as np

sentences = ["Google stocks went up suddenly, I earned 30B$"]
tokenized_sentence = tknzr(sentences, return_tensors='pt', padding='max_length', truncation=True, max_length=16)
pred = bert_model_sc(**tokenized_sentence) # we pass all the fields of tokenized sentence
print(pred[0][0].detach().numpy(), np.argmax(pred[0][0].detach().numpy()))

[-0.12607424 -0.9667538   1.8209392 ] 2


In [16]:
pred[0]

tensor([[-0.1261, -0.9668,  1.8209]], grad_fn=<AddmmBackward0>)

The pred object is a SequenceClassifierOutput, as we can see in the documentation of that class below, it means it has an optional loss, a logits an optional hidden_states and an optional attentions attribute. 