---

## **Exercise 2: Loading a Pre-Trained Model and Tokenizer with HuggingFace**

### Retrieving a Pre-trained Model from Hugging Face

Hugging Face’s `transformers` library provides a wide range of pre-trained models for Natural Language Processing (NLP) tasks. These models, trained on massive datasets, can be easily loaded and fine-tuned for specific tasks such as text classification, translation, summarization, and more. Among these models is **BERT** (Bidirectional Encoder Representations from Transformers), one of the most popular encoder-only transformer models.

One of the core functionalities of Hugging Face is the ability to retrieve models directly from their model hub, where thousands of **pre-trained** models are available. Each model is identified by a model name or a repository path, and it comes with a pre-trained tokenizer. The tokenizer is responsible for converting raw text into numerical representations (tokens) that the model can process, ensuring the text can be understood by the model.

The Hugging Face model hub provides models for various tasks, including:
- **Text Classification**: Sentiment analysis, topic classification, etc.
- **Question Answering**: Answering questions based on a provided context.
- **Summarization**: Generating concise summaries of input text.
- **Translation**: Translating text from one language to another.
- **Text Generation**: Generating new text based on an input prompt.

In this exercise, we will be using the **`google-bert/bert-base-uncased`** checkpoint, a pre-trained BERT model developed by Google. This model is uncased, meaning it treats uppercase and lowercase letters the same, which helps it generalize better for tasks where case sensitivity is not crucial. BERT is trained using both left and right context, making it powerful for a variety of NLP tasks such as classification, named entity recognition, and question answering. 


#### Steps to Retrieve a Pre-trained Model:
1. **Load the Pre-trained Model and Tokenizer**: The `AutoModel` and `AutoTokenizer` classes allow for easy loading of any pre-trained model and its corresponding tokenizer from the Hugging Face model hub. Models are specified by their model name or repository path. For this exercise, we’ll load `google-bert/bert-base-uncased`, a pre-trained BERT model designed for a wide range of natural language processing tasks.
   
2. **Tokenize Input Text**: Before feeding text into the model, it needs to be tokenized using the pre-trained **BERT tokenizer**. This step converts raw text into token IDs that the model can interpret. The BERT tokenizer uses **WordPiece tokenization** to handle out-of-vocabulary words and efficiently represent subwords. For example, "unhappiness" would be split into subword tokens like `["un", "##happy", "##ness"]`.

3. **Perform Inference**: Once the input is tokenized, it can be passed to the BERT model to perform inference. Depending on the task (e.g., classification, question answering), the model outputs logits or hidden states. For classification tasks, the raw logits can be transformed into probabilities using a softmax function, and the label with the highest probability represents the model's prediction.

Below is an example showing how to retrieve and use a pre-trained BERT model from Hugging Face for various NLP tasks.


In [None]:
# Snippet to retrieve a model from Hugging Face
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Define the model name or path from Hugging Face
model_name = "google-bert/bert-base-uncased"

# Load the pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, output_attentions=True)

In this step, we define a sample text that we will use for sentiment analysis **TODO**

The tokenizer is then used to process the input text. The `tokenizer` converts the raw sentence into a format that the model can understand by breaking it down into tokens and converting them into numerical IDs. In this case, the method `tokenizer(text, return_tensors="pt", padding=True, truncation=True)` ensures that the output is in the correct tensor format (`return_tensors="pt"` for PyTorch), applies padding to ensure consistent input length across batches, and truncates the text if it exceeds the model's maximum sequence length. The output includes both **input IDs**, which are the tokenized numerical representations of the words, and an **attention mask**, which indicates which tokens should be attended to (where `1` signifies real tokens and `0` marks padding tokens). 

This ensures that only the relevant tokens are processed by the model, with the attention mask ignoring any padding that may have been added. By printing both the token IDs and the attention mask, we can inspect how the text has been prepared for model inference.

In [None]:
# Define a sample text for sentiment analysis
text = "The dog ate the food because it was hungry"

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Print the tokens
print("Tokens:", inputs["input_ids"])
print("Attention Mask:", inputs["attention_mask"])

In [None]:
# Convert tokens to IDs
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Print the token IDs
print("Tokens:", tokens)

In this step, we pass the tokenized input (which includes the attention mask) to the model for inference. The input is fed into the model using `model(**inputs)`, which performs forward propagation to generate the output. The model’s output contains multiple components, but in this case, we are specifically interested in the **logits**, which are the raw, unnormalized predictions made by the BERT model.

The **logits** represent the model's confidence in each class (for example, sentiment categories such as Negative, Neutral, and Positive) before applying any normalization. However, since the model hasn't been fine-tuned for a specific task yet, these logits won’t provide meaningful or reliable predictions at this point. 

For now, the output doesn't hold significant meaning, but this will be explored in detail in the next lab when we fine-tune the model for specific tasks. **For today's lab, we will limit ourselves to inference only in order to assess the attention weights of the model**. 

In [None]:
# Pass the tokenized input (including attention mask) to the model
outputs = model(**inputs)

# Extract the logits (raw predictions) from the model output
logits = outputs.logits

# Apply softmax to get the predicted probabilities
probs = torch.nn.functional.softmax(logits, dim=-1)

# Print the predicted probabilities for each sentiment class
print("Softmax probabilities:", probs)

In this step, we focus on extracting and examining the **attention weights** produced by the model during inference. Attention weights provide insights into how the model distributes its focus across different tokens in the input sequence. By examining these weights, we can understand which words the model considers important when making predictions, especially in tasks like coreference resolution or sentiment analysis.

First, we extract the attention weights from the model’s output by accessing the `outputs.attentions` attribute. These attention weights are generated by the self-attention mechanism in transformer models like RoBERTa. Self-attention allows each token in the sequence to "attend" to other tokens, meaning it learns how much focus should be placed on surrounding words. This is crucial for capturing contextual relationships between words in a sentence.

Next, we print out the **number of attention layers** in the model using `len(attentions)`. Transformer models typically have multiple layers, each containing its own set of attention heads. For instance, BERT and RoBERTa base models generally have 12 layers, each of which processes the input tokens with attention mechanisms to refine the model's understanding of the sentence structure.

We also print the **shape of the attention weights** from the first layer with `attentions[0].shape`. This shape reveals key information about how the attention is structured:
- The first dimension represents the **batch size** (usually 1 in this case, as we are processing one sentence).
- The second dimension corresponds to the **number of attention heads** in that layer, which are independent mechanisms that attend to different parts of the input.
- The third and fourth dimensions both represent the **sequence length**, meaning how many tokens are in the input. Each token in the sequence attends to every other token, resulting in an attention matrix where every token has a score representing its attention to all other tokens.

By printing the shape of the attention weights, we get a clear understanding of the structure of the attention mechanism across different layers, heads, and tokens. This sets the foundation for visualizing or further analyzing how the model attends to specific words or entities in the input text.


In [None]:
# Extract the attention weights from the output
attentions = outputs.attentions

# Print the attention weights shape
print(f"Number of attention layers: {len(attentions)}")
print(f"Shape of attention weights in the first layer: {attentions[0].shape}")

In this section, we use **Matplotlib** and **Seaborn** to visualize the attention weights extracted from the first attention head of the first layer in the transformer model. Visualizing attention weights allows us to better understand how the model distributes focus across the tokens in a sequence, showing which tokens "attend" to each other.

We create a **heatmap** to visualize the attention weights using **Seaborn’s** `heatmap()` function. A heatmap is an intuitive way to display how much attention each token pays to every other token in the sequence. The heatmap shows the attention matrix, where each row represents a token and each column represents how much attention that token places on other tokens. Darker shades indicate higher attention scores.

By visualizing the attention matrix, we can observe the relationships between tokens in the input sentence, such as whether certain words attend heavily to specific other words, providing insights into how the model understands the sentence contextually.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Convert attention weights to numpy array (taking the first attention head from the first layer)
attention_layer_1 = attentions[0][0, 0].detach().numpy()

# Plot the attention weights for the first layer, first attention head
plt.figure(figsize=(8, 8))
sns.heatmap(attention_layer_1, annot=False, cmap="Blues", xticklabels=tokens, yticklabels=tokens)
plt.xlabel("Attention to Token")
plt.ylabel("Token")
plt.title("Attention Weights - Layer 1, Head 1")
plt.show()
