<a href="https://colab.research.google.com/github/aydawudu/Transformers_Practice/blob/main/Model_and_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 8.8 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 38.8 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 52.1 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1


In [2]:
from transformers import AutoTokenizer

In [4]:
checkpoint="bert-base-uncased"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
tokenizer

PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [6]:
tokenizer("hello world")

{'input_ids': [101, 7592, 2088, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

In [9]:
#let see how tokenizer works. #1. It converts the strings into tokens
tokens=tokenizer.tokenize("hello world")
tokens

['hello', 'world']

In [10]:
#2. It turns it into intergers IDs
ids=tokenizer.convert_tokens_to_ids(tokens)
ids

[7592, 2088]

In [11]:
#lets convert the integers IDs back into tokens
tokenizer.convert_ids_to_tokens(ids)


['hello', 'world']

In [12]:
#the decode function converts the IDs back into tokens and joins it back into a string
tokenizer.decode(ids)

'hello world'

In [14]:
#Lets do the opposite
ids=tokenizer.encode("hello world")
ids

[101, 7592, 2088, 102]

In [16]:
tokenizer.convert_ids_to_tokens(ids) #the encode method adds the Bert tokens "CLS" and "SEP"

['[CLS]', 'hello', 'world', '[SEP]']

In [17]:
tokenizer.decode(ids)  

'[CLS] hello world [SEP]'

In [18]:
#let tokenize the "hello world" string to be used for our model
model_inputs=tokenizer("hello world")
model_inputs

{'input_ids': [101, 7592, 2088, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

In [20]:
#let just try our tokenizer on multiple sentence list
data=[
    "I like dogs",
    "Do you like cats too?",
]

tokenizer(data)

{'input_ids': [[101, 1045, 2066, 6077, 102], [101, 2079, 2017, 2066, 8870, 2205, 1029, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

In [21]:
from transformers.models.auto.modeling_auto import AutoModelForSequenceClassification
model=AutoModelForSequenceClassification.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [23]:
#let pass our model inputs into our model. 
#Know that the default model on Hugging face is PyTorch so we will need to specify torch tensors for our input
outputs=model(**model_inputs)

AttributeError: ignored

In [24]:
#create torch tensors model inputs
model_inputs=tokenizer("hello world", return_tensors='pt')
model_inputs

{'input_ids': tensor([[ 101, 7592, 2088,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

In [25]:
#Our top layer has not been trained so this logits are useless and by default the library assumes we wanted a binary classifier
outputs=model(**model_inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.2141, -0.1220]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [26]:
#let create model with 3 outputs instead of 2 specifying the num_labels
model=AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Again the warning shows us that we need to train the final layer of the model

In [27]:
#let's pass out input into the new model
outputs=model(**model_inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[0.3547, 0.1337, 0.4228]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Now we have 3 logits instead of two

In [28]:
outputs.logits

tensor([[0.3547, 0.1337, 0.4228]], grad_fn=<AddmmBackward0>)

In [29]:
outputs['logits']

tensor([[0.3547, 0.1337, 0.4228]], grad_fn=<AddmmBackward0>)

In [32]:
outputs[0]

tensor([[0.3547, 0.1337, 0.4228]], grad_fn=<AddmmBackward0>)

In [34]:
#convert the logits into numpy array (useful when computing evaluation metrics)
outputs.logits.detach().cpu().numpy()

array([[0.3546747 , 0.13372748, 0.4227861 ]], dtype=float32)

In [35]:
#let try another example on multiple sentences
data=[
    "I like dogs",
    "Do you like cats too?",
]

#incorrect way since the lengths are different
model_inputs=tokenizer(data, return_tensors='pt')
model_inputs

ValueError: ignored

In [36]:
#correct way
model_inputs=tokenizer(data,
                       padding=True,
                       truncation=True,
                       return_tensors='pt') 
model_inputs

{'input_ids': tensor([[ 101, 1045, 2066, 6077,  102,    0,    0,    0],
        [ 101, 2079, 2017, 2066, 8870, 2205, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

In [37]:
model_inputs['input_ids']

tensor([[ 101, 1045, 2066, 6077,  102,    0,    0,    0],
        [ 101, 2079, 2017, 2066, 8870, 2205, 1029,  102]])

In [38]:
model_inputs['attention_mask']

tensor([[1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])

In [39]:
outputs=model(**model_inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.3494, -0.0104,  0.4083],
        [ 0.2775,  0.1715,  0.5198]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)