<a href="https://colab.research.google.com/github/ua-deti-information-retrieval/Neural-IR-hands-on/blob/main/RI_practical_tutorial_4_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install torch transformers



# Huggingface (the hub)

🤗 HuggingFace 🤗 is a company that has gained prominence for its significant contributions to the field of natural language processing (NLP). Founded in 2016, the company focuses on developing and providing open-source tools and technologies centered around machine learning, particularly for NLP applications.

The most notable contribution of Hugging Face is its development of the "Transformers" library. This library is a comprehensive collection of pre-trained models and tools designed for NLP tasks. It is built on deep learning frameworks like PyTorch and TensorFlow and is well-known for its user-friendly interface and flexibility.

The Transformers library includes a range of state-of-the-art models like BERT, GPT, T5, and others, which have been pre-trained on vast amounts of text data. These models can be easily adapted or fine-tuned for various NLP tasks such as text classification, question-answering, summarization, and language translation.

Corrently the huggingface ecosystem is vast, the following list shows the most popular services/libraries:

1. **Transformers Library**: A comprehensive library of pre-trained models for NLP tasks. [Link](https://huggingface.co/transformers/)

2. **Accelerate**: Simplifies running machine learning models on any hardware configuration. [Link](https://huggingface.co/docs/accelerate/)

3. **Optimum**: Optimization tools for improving performance and efficiency of machine learning models. [Link](https://huggingface.co/docs/optimum/index)

4. **PEFT**: Performance Estimation Framework for Transformers, aiding in the performance analysis of Transformer models. [Link](https://huggingface.co/docs/peft/index)

5. **Tokenizers**: Efficient and versatile tokenization for NLP preprocessing. [Link](https://huggingface.co/docs/tokenizers/index)

6. **Datasets**: A library for easy access and sharing of NLP datasets. [Link](https://huggingface.co/docs/datasets/index)

7. **Hugging Face Hub**: An online platform for sharing and collaborating on models and datasets. [Link](https://huggingface.co/docs/hub/index)

8. **Inference API**: API for running machine learning models hosted on the Hugging Face Hub. [Link](https://huggingface.co/docs/api-inference/index)

9. **AutoTrain**: Service for automating the training, evaluation, and deployment of NLP models. [Link](https://huggingface.co/docs/autotrain/index)

10. **Spaces**: Allows users to create and share machine learning applications. [Link](https://huggingface.co/spaces)


# Transformers

## Studying contextual word embeddings



In [None]:
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
# Auto classes are "smart classes" used to load already trained models, the Auto
# class will automaticly figure out what is the correct class that should be
# used.

In [None]:
bert_checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(bert_checkpoint)
bert_model = AutoModel.from_pretrained(bert_checkpoint).to("cuda")

In [None]:
bert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [None]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [None]:
tokens_data = tokenizer(["I am opening a bank account",
                       "I am sitting on the river bank"], return_tensors="pt", padding=True).to("cuda")
tokens_data

{'input_ids': tensor([[ 101, 1045, 2572, 3098, 1037, 2924, 4070,  102,    0],
        [ 101, 1045, 2572, 3564, 2006, 1996, 2314, 2924,  102]],
       device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [None]:
decoded_tokens = "\n".join([" | ".join([f"{tokenizer.decode(token_id)} ({i})" for i,token_id in enumerate(_ids) ])
                  for _ids in tokens_data.input_ids])
print(decoded_tokens)

[CLS] (0) | i (1) | am (2) | opening (3) | a (4) | bank (5) | account (6) | [SEP] (7) | [PAD] (8)
[CLS] (0) | i (1) | am (2) | sitting (3) | on (4) | the (5) | river (6) | bank (7) | [SEP] (8)


In [None]:
out  = bert_model(**tokens_data)
print(out.last_hidden_state.shape)
print(out.pooler_output.shape)

torch.Size([2, 9, 768])
torch.Size([2, 768])


In [None]:
bank_in_first_sentence = out.last_hidden_state[0,5]
bank_in_second_sentence = out.last_hidden_state[1,7]

In [None]:
torch.nn.functional.cosine_similarity(bank_in_first_sentence, bank_in_second_sentence, dim=0)

tensor(0.3630, device='cuda:0', grad_fn=<SumBackward1>)

In [None]:
tokens_data = tokenizer(["I am sitting on the river side"], return_tensors="pt", padding=True).to("cuda")
side_embedding = bert_model(**tokens_data).last_hidden_state[0,7]

In [None]:
# given the context here the work "side" is much closer to the word "bank"

torch.nn.functional.cosine_similarity(side_embedding, bank_in_second_sentence, dim=0)

tensor(0.8841, device='cuda:0', grad_fn=<SumBackward1>)

## Creating a interaction-based transformer model

### Easy-way (using transformer API)

Reduced freedom

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
bert_model_IR = AutoModelForSequenceClassification.from_pretrained(bert_checkpoint, num_labels=2)
bert_model_IR

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

### Medium-way (extending the transformer API)

Recommended way

In [None]:
from transformers import PreTrainedModel, BertModel, AutoConfig

# create a class that hold the model, similar to torch.nn.module
# internally PreTrainedModel extends torch.nn.module

class MyReranker(PreTrainedModel):

  def __init__(self, checkpoint):
    config = AutoConfig.from_pretrained(checkpoint)
    super().__init__(config)
    # note you can also extend BertModel instead of adding it here
    self.bert = BertModel.from_pretrained(config._name_or_path, config=config)
    self.dropout = torch.nn.Dropout(p=0.1)
    self.linear1 = torch.nn.Linear(768, 1024)
    self.act = torch.nn.GELU()
    self.linear2 = torch.nn.Linear(1024, 2)

  def inference(self, *args, **kwargs):
    with torch.no_grad():
      logits = self(*args, **kwargs)
      return torch.argmax(logits, dim=-1)

  def forward(self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            indexes=None,
            novel=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            output_attentions=None,
            output_hidden_states=None,
            return_dict=None,
            mask=None,
           ):

    x = self.bert(input_ids,
                  attention_mask=attention_mask,
                  token_type_ids=token_type_ids,
                  position_ids=position_ids,
                  head_mask=head_mask,
                  inputs_embeds=inputs_embeds,
                  output_attentions=output_attentions,
                  output_hidden_states=output_hidden_states,
                  return_dict=return_dict)

    x = self.dropout(x["pooler_output"])
    x = self.linear1(x)
    x = self.act(x)
    return self.linear2(x)

In [None]:
bert_model_IR = MyReranker(bert_checkpoint).to("cuda")
bert_model_IR

MyReranker(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine

## How to run/infer?

Similar to how we get the contextual embeddings

In [None]:
tokens_data = tokenizer([("Is the Earth round?", "Yes, the Earth is round. It's an oblate spheroid shape due to its rotation."),
                         ("Is the Earth round?", "No, the Earth is not perfectly round; it's slightly flattened at the poles and bulges at the equator."),
                         ("Does water boil at 100°C (212°F) at sea level?", "Yes, water boils at 100°C (212°F) at sea level under standard atmospheric pressure."),
                         ("Does water boil at 100°C (212°F) at sea level?", "No, water does not always boil at 100°C if the atmospheric pressure is different, like at high altitudes."),
                         ("Is the sun a star?", "Yes, the sun is a star. It's a massive, luminous sphere of hot plasma at the center of our solar system."),
                         ("Is the sun a star?", "No, the sun is not just any star; it is a G-type main-sequence star, which is relatively rare in the universe.")
                         ], return_tensors="pt", padding=True).to("cuda")

In [None]:
bert_model_IR(**tokens_data)

tensor([[-0.1894, -0.1369],
        [-0.0130, -0.1273],
        [-0.1051, -0.1360],
        [-0.1499, -0.2182],
        [-0.1553, -0.1602],
        [-0.1029, -0.0983]], device='cuda:0', grad_fn=<AddmmBackward0>)

In [None]:
bert_model_IR.inference(**tokens_data)

tensor([0, 1, 1, 1, 0, 0], device='cuda:0')