## Notebook Overview

This notebook explores three popular transformer models from the Hugging Face library: GPT-2, BERT, and T5. We demonstrate basic usage of each model for different natural language processing tasks:

-   **GPT-2 (Generative Pre-trained Transformer 2):** Used for text generation, predicting the next word in a sequence.
-   **BERT (Bidirectional Encoder Representations from Transformers):** Used for masked language modeling, predicting a masked word in a sentence based on the surrounding context.
-   **T5 (Text-to-Text Transfer Transformer):** Used for sequence-to-sequence tasks, demonstrating its ability to process an input sequence and generate an output sequence (in this case, a simplified summarization example).

## Simple Differences in Working

While all three models are based on the transformer architecture, they are pre-trained on different tasks and designed for different purposes:

-   **GPT-2:** Primarily a **decoder-only** model pre-trained on a language modeling objective (predicting the next token). It's excellent for generating coherent and contextually relevant text.
-   **BERT:** A **encoder-only** model pre-trained on masked language modeling and next sentence prediction tasks. It's best suited for understanding the context and meaning of text, making it ideal for tasks like classification, named entity recognition, and answering questions about text.
-   **T5:** An **encoder-decoder** model pre-trained on a text-to-text framework, where every NLP task is formulated as a text-to-text problem. This versatility allows it to handle a wide range of tasks, including summarization, translation, question answering, and more, by simply changing the input prompt.

In essence:
- **GPT-2 generates text.**
- **BERT understands text (by filling in blanks).**
- **T5 transforms text from one form to another.**

In [1]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [2]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
inputs = tokenizer("India is the fastest growing", return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
logits = outputs.logits

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


In [4]:
inputs

{'input_ids': tensor([[21569,   318,   262, 14162,  3957]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [5]:
tokenizer.decode(inputs['input_ids'].tolist()[0])

'India is the fastest growing'

In [6]:
loss

tensor(2.7621, grad_fn=<NllLossBackward0>)

In [7]:
logits.shape

torch.Size([1, 5, 50257])

In [8]:
last_token_logits = logits[0, -1, :]

predicted_token_id = torch.argmax(last_token_logits)

predicted_word = tokenizer.decode(predicted_token_id)

print(f"The predicted next word is: {predicted_word}")

The predicted next word is:  country


In [9]:
from transformers import BertTokenizer, BertForMaskedLM

In [10]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
input_text = "Masked language [MASK] is used for pretraining."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
logits = outputs.logits

In [12]:
loss

tensor(4.9433, grad_fn=<NllLossBackward0>)

In [13]:
inputs

{'input_ids': tensor([[  101, 16520,  2653,   103,  2003,  2109,  2005,  3653, 23654,  2075,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [14]:
logits

tensor([[[ -6.7650,  -6.7192,  -6.6993,  ...,  -6.0738,  -5.8606,  -4.3139],
         [ -7.8068,  -8.2320,  -8.0335,  ...,  -6.9457,  -6.9095,  -6.3418],
         [ -9.7934,  -9.8951,  -9.7592,  ...,  -8.0059,  -8.1619,  -8.2921],
         ...,
         [-12.3430, -12.7044, -12.4538,  ...,  -9.8909,  -9.1166,  -8.7265],
         [-11.3576, -10.9756, -11.5494,  ...,  -9.3977, -10.7534,  -6.8493],
         [-11.9643, -11.5468, -11.7930,  ...,  -8.9498,  -9.5833,  -8.1403]]],
       grad_fn=<ViewBackward0>)

In [15]:
tokenizer.mask_token_id

103

In [16]:
masked_index = inputs['input_ids'][0].tolist().index(tokenizer.mask_token_id)

In [17]:
logits.shape

torch.Size([1, 12, 30522])

In [18]:
masked_token_logits = logits[0, masked_index, :]

In [19]:
predicted_token_id = torch.argmax(masked_token_logits)

In [20]:
predicted_word = tokenizer.decode(predicted_token_id)

In [21]:
print(f"The predicted word for the [MASK] token is: {predicted_word}")

The predicted word for the [MASK] token is: training


In [22]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [23]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [24]:
input_text = "summarize: PyTorch enables easy model development."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
logits = outputs.logits

In [25]:
inputs

{'input_ids': tensor([[21603,    10, 12901,   382,   127,   524,     3,  7161,   514,   825,
           606,     5,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [26]:
loss

tensor(1.7242, grad_fn=<NllLossBackward0>)

In [27]:
logits.shape

torch.Size([1, 13, 32128])

In [28]:
predicted_token_ids = torch.argmax(logits, dim=-1)

decoded_output = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)

print(f"The decoded output is: {decoded_output}")

The decoded output is: Py  Torch enables easy model development
