# Large Language Models

# Introduction to Large Language Models (LLMs)

Large Language Models (LLMs) are powerful deep learning models capable of understanding and generating human-like text. They are trained on vast amounts of text data and have demonstrated remarkable performance on a wide range of natural language processing (NLP) tasks.

## Content

In this notebook, we will:
- Explore the basics of LLMs and how they are trained.
- Understand the common applications and capabilities of LLMs.
- Experiment with using pre-trained models.

## Outline

1. **Instructions**
   - Create account in HuggingFace
   - Packages to import
2. **Applications of LLMs**
   - Text generation, summarization, translation, and more.
3. **Getting Started with Hugging Face Transformers**
   - Installing the necessary libraries.
   - Loading and using a pre-trained model.
4. **Sample Task with LLM**
   - Choose a task: text generation, sentiment analysis, or question answering.
   - Implement the chosen task using a pre-trained model.
5. **Future Directions**
   - Challenges in LLMs.
   - Current research trends and innovations.



## INSTRUCTIONS

1. Create account in [HuggingFace](https://huggingface.co)
2. Create token going Settings - > Access Tokens
3. Write this in your prompt and insert your token huggingface-cli login
4. Have fun!! 

## Packages to import

In [1]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoTokenizer, BartForConditionalGeneration, BartTokenizer 
import sentencepiece as spm
import math
import torch.nn.functional as F
import evaluate
from datasets import load_dataset


# Using Pre-Trained Model

## Sentiment Analysis

Sentiment Analysis is a common application of Large Language Models (LLMs) where a model assesses the sentiment of a given sentence, typically scoring it on a scale to indicate positivity or negativity. Higher scores represent more positive sentiment, while lower scores indicate negative sentiment.

We'll use a pre-trained model from Hugging Face to experiment with sentiment analysis, scoring text from 0 (very negative) to 4 (very positive). This example provides a hands-on look at how LLMs interpret and classify sentiment.


In [2]:
# Load the tokenizer and pre-trained model
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained(
    "nlptown/bert-base-multilingual-uncased-sentiment",
)

In [3]:
def classify_sentiment(text):
    # Tokenize the input
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    # Perform inference
    with torch.no_grad():  # Disable gradient calculation for inference
        outputs = model(**inputs)
    
    # Get the predicted class
    predictions = torch.argmax(outputs.logits, dim=-1)
    
    return predictions.item()


In [4]:
# Testing the model with examples

test_sentence = "Can we have paella for dinner? I love it!"
predicted_class = classify_sentiment(test_sentence)

print("1st test")
print(f"Predicted class for the sentence '{test_sentence}': {predicted_class}")

test_sentence = "We have school today?"
predicted_class = classify_sentiment(test_sentence)

print("\n")
print("2nd test")
print(f"Predicted class for the sentence '{test_sentence}': {predicted_class}")

test_sentence = "I don't want to eat spinach"
predicted_class = classify_sentiment(test_sentence)

print("\n")
print("3rd test")
print(f"Predicted class for the sentence '{test_sentence}': {predicted_class}")

1st test
Predicted class for the sentence 'Can we have paella for dinner? I love it!': 4


2nd test
Predicted class for the sentence 'We have school today?': 2


3rd test
Predicted class for the sentence 'I don't want to eat spinach': 1


## Text Summarization

Text summarization is a key application of Large Language Models (LLMs) that involves condensing a longer piece of text into a shorter summary while retaining its essential meaning. This technique is particularly useful for extracting key information from articles, reports, and other lengthy documents, allowing users to quickly understand the main points without reading everything.

We’ll utilize a pre-trained model from Hugging Face to explore text summarization, transforming extensive content into concise summaries. This example will demonstrate how LLMs can effectively distill information and enhance content accessibility.


In [5]:
# Text to summarize

long_text = """In a small village, a curious boy named Tim loved exploring the woods. 
One sunny day, he found a hidden path leading to a magical garden. Bright flowers bloomed, and a sparkling pond reflected the sky. 
As Tim wandered, he met a talking rabbit named Benny. Benny told Tim that the garden was enchanted and only appeared to those with a kind heart. 
They became friends, sharing stories and laughter. When it was time to leave, Benny gifted Tim a flower that would always remind him of their adventure. 
Tim promised to return, knowing kindness would lead him back."""


model_name = "cnicu/t5-small-booksum"

# Load model
summarizer = pipeline(task="summarization", model=model_name)

outputs = summarizer(long_text, max_length=50)
print("FULL TEXT")
print(long_text)
print("\n")
print("SUMMARIZED TEXT")
print(outputs[0]["summary_text"])

FULL TEXT
In a small village, a curious boy named Tim loved exploring the woods. 
One sunny day, he found a hidden path leading to a magical garden. Bright flowers bloomed, and a sparkling pond reflected the sky. 
As Tim wandered, he met a talking rabbit named Benny. Benny told Tim that the garden was enchanted and only appeared to those with a kind heart. 
They became friends, sharing stories and laughter. When it was time to leave, Benny gifted Tim a flower that would always remind him of their adventure. 
Tim promised to return, knowing kindness would lead him back.


SUMMARIZED TEXT
a curious boy named Tim loves exploring the woods. He finds a hidden path leading to a magical garden. Benny tells Tim that the garden was enchanted and only appeared to those with a kind heart


We can observe that the summary was ok but could be improved. We will come back to text summarization and apply some techniques to make the summarized text cleaner.

## Question & Answering

Question Answering (QA) is a popular application of Large Language Models (LLMs) that enables systems to provide precise answers to questions based on a given context. In this task, a model analyzes the input text to locate relevant information and extract answers, making it an essential tool for various applications such as search engines, customer support, and educational platforms.

We will not be providing a model to experiment and observe how, when we don't provide a model the pipeline selects a default model given the task at hand.

In [6]:
# Load model
qa_model = pipeline(task="question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [7]:
# Context for the model to learn from to be able to answer our question

context = """
The solar system is a vast and complex structure that consists of the Sun and various celestial bodies that orbit it. 
At the center is the Sun, a star that provides the necessary light and heat to sustain life on Earth. 
There are eight planets in our solar system, which are categorized into two groups: terrestrial planets and gas giants. 
The terrestrial planets, which are closer to the Sun, include Mercury, Venus, Earth, and Mars. 
These planets are primarily composed of rock and metal and have solid surfaces, allowing for geological processes. 
Mercury is the smallest planet and closest to the Sun, while Venus is known for its thick atmosphere and high temperatures. 
Earth is unique for its ability to support life, and Mars is often studied for its potential to harbor past life. 

In contrast, the gas giants—Jupiter, Saturn, Uranus, and Neptune—are located farther from the Sun. 
These planets are composed mainly of hydrogen and helium and do not have well-defined solid surfaces. 
Jupiter is the largest planet in our solar system and is known for its Great Red Spot, a giant storm. 
Saturn is famous for its stunning rings, while Uranus and Neptune are noted for their bluish hues due to methane in their atmospheres. 

Additionally, there are dwarf planets like Pluto, which, although no longer classified as a major planet, plays a significant role in our understanding of the solar system's formation. 
Numerous moons, comets, and asteroids further contribute to the complexity and diversity of our solar system, making it a fascinating subject for scientific exploration and study.
"""

# Example question
question = "What are the four terrestrial planets in our solar system?"


outputs = qa_model(question=question, context = context)
print(outputs['answer'])

Mercury, Venus, Earth, and Mars


# Text Translation

Text Translation is a fascinating application of Large Language Models (LLMs) that enables the conversion of text from one language to another. This process involves understanding the meaning of the original text and accurately generating the equivalent text in the target language, maintaining both context and nuance. 

We'll leverage a pre-trained model from Hugging Face to experiment with text translation, allowing us to translate sentences from Spanish to English seamlessly. This hands-on example illustrates how LLMs can bridge language barriers and facilitate communication across different cultures.


In [8]:
# Text to translate
input_text = "Me encanta jugar al futbol! Mi jugador favorito es Cristiano Ronaldo!"

model_name = "Helsinki-NLP/opus-mt-es-en"

# Load the tokenizer and pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Create a translation pipeline
translator = pipeline("translation", model=model, tokenizer=tokenizer)

output = translator(input_text)

print('Text in Spanish:')
print(input_text)
print('\n')
print('Text after translation to English:')
print(output[0]['translation_text'])



Text in Spanish:
Me encanta jugar al futbol! Mi jugador favorito es Cristiano Ronaldo!


Text after translation to English:
I love playing football! My favorite player is Cristiano Ronaldo!


# Text Generation

Text Generation is a compelling application of Large Language Models (LLMs) that enables the creation of coherent and contextually relevant text based on a given prompt. By analyzing the input text, LLMs generate responses that can range from simple sentences to complex paragraphs, making them useful for various applications like storytelling, content creation, and conversational agents.

In this example, we'll utilize a pre-trained model from Hugging Face to explore text generation, allowing us to generate creative and informative responses to specific prompts. This hands-on exercise showcases the remarkable capabilities of LLMs in producing human-like text.


In [9]:
# Load model
generator = pipeline(task="text-generation", model="gpt2")

In [10]:
response = "Dear valued customer, I am glad to hear you had a good stay with us."

customer_review = """
I recently stayed at the Grand Vista Hotel for a weekend getaway, and it was an amazing experience! 
The check-in process was smooth and the staff were incredibly friendly and accommodating. 
My room was spacious, clean, and beautifully decorated, with a stunning view of the city skyline. 
I particularly enjoyed the hotel amenities, including the rooftop pool and the complimentary breakfast that offered a wide variety of delicious options. 
The location was perfect, within walking distance of popular attractions and great restaurants. 
Overall, I had a wonderful stay and would definitely recommend the Grand Vista Hotel to anyone visiting the area!
"""

prompt = f"Customer review:\n{customer_review}\n\nHotel response to the customer:\n{response}"

outputs = generator(prompt, max_length=300, truncation=True, pad_token_id= generator.tokenizer.eos_token_id)

print(outputs[0]['generated_text'])

Customer review:

I recently stayed at the Grand Vista Hotel for a weekend getaway, and it was an amazing experience! 
The check-in process was smooth and the staff were incredibly friendly and accommodating. 
My room was spacious, clean, and beautifully decorated, with a stunning view of the city skyline. 
I particularly enjoyed the hotel amenities, including the rooftop pool and the complimentary breakfast that offered a wide variety of delicious options. 
The location was perfect, within walking distance of popular attractions and great restaurants. 
Overall, I had a wonderful stay and would definitely recommend the Grand Vista Hotel to anyone visiting the area!


Hotel response to the customer:
Dear valued customer, I am glad to hear you had a good stay with us.  Unfortunately, during the coming days of business, we ran a 2 week queue for reservations and a 3 week waiting period in some places, so that we can get a good deal after the hotel closes. Therefore, we canceled all of the

The results may not be optimal, but there's no need to worry! We're currently testing basic models. Later in the notebook, we'll go deeper into text-generation techniques to improve our results.

# Understanding Transformer Architecture

## What is a Transformer?

A **Transformer** is a deep learning model architecture introduced in the paper *"Attention is All You Need"* by Vaswani et al. (2017). This architecture has become the foundation for many state-of-the-art models in natural language processing (NLP). The Transformer leverages a mechanism known as **self-attention**, enabling the model to evaluate the importance of different words in a sequence when making predictions.

## Transformer Architecture

The Transformer consists of an **encoder-decoder** structure:

- **Encoder**: The encoder processes an input sequence to create a set of contextualized embeddings. Each encoder layer includes a multi-head self-attention mechanism and a feedforward neural network.

- **Decoder**: The decoder generates the output sequence based on the encoder's output and previously generated tokens. It features multiple layers, which include masked multi-head self-attention and encoder-decoder attention.

### Evolving Transformer Architectures

While the original Transformer used both an encoder and a decoder, modern implementations often adopt variations based on task requirements:

1. **Encoder-Only Models**: Designed for tasks like text classification and named entity recognition, where only the input context is relevant.  
   *Examples*: BERT, RoBERTa, DistilBERT.

2. **Decoder-Only Models**: Used for tasks like text generation, where previously generated tokens are needed to predict the next token.  
   *Examples*: GPT, GPT-2, GPT-3.

3. **Encoder-Decoder Models**: Suitable for tasks like text translation, where both input and output sequences are essential.  
   *Examples*: T5, BART.

## Detailed Explanation of the Transformer Encoder Structure

### Overview of the Encoder

The encoder processes input sequences to create contextualized embeddings. It plays a crucial role in understanding the relationships between words and capturing the nuances of the input data. The encoder transforms input tokens into rich representations that can be utilized by the decoder or for various downstream tasks.

### Structure of the Encoder

The Transformer encoder consists of multiple layers, each containing several key components:

1. **Multi-Head Self-Attention**
2. **Feedforward Neural Network**
3. **Add & Norm**

Let’s explore each of these components in detail:

### 1. Multi-Head Self-Attention

- **Purpose**:  
  This mechanism allows the encoder to assess the importance of different words relative to each other, capturing dependencies and relationships regardless of their positions in the sequence.

- **How It Works**:
  - **Input**: The encoder takes input tokens.
  - **Self-Attention Calculation**: The self-attention mechanism computes attention scores for each token, determining how much focus each token should receive based on others.
  - **Multi-Head Mechanism**: The model uses multiple attention heads to learn various types of relationships, each capturing different aspects of the input sequence.

### 2. Feedforward Neural Network

- **Purpose**:  
  Each position in the sequence is processed independently through a feedforward neural network, introducing non-linearity and enhancing the model's capacity to learn complex representations.

- **How It Works**:
  - **Input**: The output from the multi-head attention component.
  - **Structure**: Typically consists of two linear transformations with a ReLU activation function in between.
  - **Output**: The output is a transformed version of the input, capturing intricate relationships within the token embeddings.

### 3. Add & Norm

- **Purpose**:  
  This step adds the input from the previous layer to the output of the current layer and normalizes the result, stabilizing the training process and improving gradient flow.

- **How It Works**:
  - **Add**: The input from the previous layer is added to the current layer's output, forming a residual connection to help mitigate the vanishing gradient problem.
  - **Layer Normalization**: The result is normalized, enhancing convergence during training.

### Stacking Layers

Each of these components is stacked to form multiple encoder layers. Typically, Transformers have an equal number of encoder and decoder layers, though this is not a strict rule. The stack depth can be adjusted based on task complexity and model size.

### Output of the Encoder

After passing through all encoder layers, the final output consists of contextualized embeddings for each input token. These embeddings can then be utilized by the decoder or for various downstream tasks, such as text classification or sentiment analysis.

## Detailed Explanation of the Transformer Decoder Structure

### Overview of the Decoder

The decoder generates output sequences from the encoded representations provided by the encoder. It is particularly crucial for tasks like machine translation, text summarization, and any other text generation tasks. The decoder predicts the next token in a sequence based on previously generated tokens and the encoder’s output.

### Structure of the Decoder

The Transformer decoder consists of several layers (typically the same number as the encoder) stacked on top of each other. Each layer contains several key components:

1. **Masked Multi-Head Self-Attention**
2. **Multi-Head Attention**
3. **Feedforward Neural Network**
4. **Add & Norm**

Let’s delve deeper into each of these components:

### 1. Masked Multi-Head Self-Attention

- **Purpose**:  
  This mechanism allows the decoder to attend to previously generated tokens while generating the next token in the sequence. Masking ensures that the model cannot "cheat" by looking at future tokens.

- **How It Works**:
  - **Input**: The decoder receives previous outputs (tokens) as input.
  - **Self-Attention Calculation**: The self-attention computes attention scores for each token, determining how much focus to place on other tokens when producing the next token.
  - **Masking**: A mask is applied to ensure the model only attends to tokens that have already been generated, preventing access to later tokens in the sequence.

- **Multi-Head Mechanism**:  
  This aspect allows the model to capture different relationships and dependencies between tokens using multiple attention heads, each processing input independently.

### 2. Multi-Head Attention (Encoder-Decoder Attention)

- **Purpose**:  
  This component allows the decoder to attend to the encoder’s output, leveraging contextual embeddings to generate more informed and contextually relevant output.

- **How It Works**:
  - **Input**: The output from the encoder and the decoder’s previous layer (after masked self-attention).
  - **Attention Calculation**: The decoder computes attention scores between the current output and all encoder outputs, focusing on relevant parts of the input sequence based on what it is currently generating.
  - **Output**: The results are combined and passed to the next layer.

### 3. Feedforward Neural Network

- **Purpose**:  
  Each position in the sequence is processed independently through a feedforward neural network, introducing non-linearity to the model.

- **How It Works**:
  - **Input**: The output from the multi-head attention component.
  - **Structure**: Typically consists of two linear transformations with a ReLU activation function in between.
  - **Output**: The output is a transformed version of the input that captures complex interactions between tokens.

### 4. Add & Norm

- **Purpose**:  
  This step adds the input from the previous layer to the output of the current layer and normalizes the result, stabilizing the training process and allowing for better gradient flow.

- **How It Works**:
  - **Add**: The input to the current layer (from the previous layer) is added to its output, forming a residual connection to prevent the vanishing gradient problem.
  - **Layer Normalization**: The result is normalized, improving convergence during training.

### Stacking Layers

Each of these components is stacked to form multiple decoder layers. Typically, Transformers have an equal number of encoder and decoder layers, but this is not a strict rule. The depth of the stack can be adjusted based on task complexity and model size.

### Final Linear Layer and Softmax

After passing through all decoder layers, the final output is sent to a linear layer that projects it into the size of the vocabulary. This is followed by a softmax activation function to produce a probability distribution over the vocabulary, allowing the model to select the most likely next token in the sequence.

### Transformer Architecture: Encoder-Decoder

Initially, Transformers were designed as **encoder-decoder** architectures, suitable for tasks requiring mapping one sequence to another, such as machine translation.

- **Encoder**:
  - Processes the input sequence and creates context-aware representations for each token, utilizing multiple layers that apply self-attention to understand dependencies in the input.

- **Decoder**:
  - Generates the output sequence one token at a time, attending to both the encoder's output and the previously generated tokens to ensure coherence in the output.

## Modern Variations of Transformer Architectures

While the encoder-decoder architecture is still widely used, many applications can benefit from using either only the encoder or only the decoder, allowing for more flexibility and efficiency.

### 1. Encoder-Only Models

#### Overview
Encoder-only models focus on understanding and representing input text, generating contextual embeddings for each token without performing text generation.

#### Common Use Cases
- **Text Classification**: Categorizing text into predefined classes.
- **Named Entity Recognition**: Identifying and classifying entities in text.

**Examples of Encoder-Only Models**:
- **BERT (Bidirectional Encoder Representations from Transformers)**: Designed for context-based tasks like sentiment analysis and question answering.
- **RoBERTa**: An improved version of BERT that utilizes more data and longer training times for better performance.
- **DistilBERT**: A smaller, faster, and lighter variant of BERT that maintains a significant portion of its performance.

### 2. Decoder-Only Models

#### Overview
Decoder-only models are tailored for generating text, predicting the next token in a sequence based solely on previous tokens.

#### Common Use Cases
- **Text Generation**: Producing coherent and contextually relevant text based on a given prompt.
- **Dialogue Systems**: Creating conversational agents that respond dynamically.

**Examples of Decoder-Only Models**:
- **GPT (Generative Pre-trained Transformer)**: Designed for generative tasks, allowing it to produce text that follows a given prompt.
- **GPT-2 and GPT-3**: Extensions of GPT that are larger in size, leading to more sophisticated text generation capabilities.

### 3. Encoder-Decoder Models

#### Overview
Encoder-decoder models leverage both components to process an input sequence and generate an output sequence. They are particularly useful for tasks requiring a mapping from one text to another.

#### Common Use Cases
- **Machine Translation**: Translating text from one language to another.
- **Text Summarization**: Condensing long articles into concise summaries.

**Examples of Encoder-Decoder Models**:
- **T5 (Text-to-Text Transfer Transformer)**: Treats all NLP tasks as text-to-text problems, allowing it to handle various tasks, including translation and summarization.
- **BART (Bidirectional and Auto-Regressive Transformers)**: Combines the bidirectional nature of BERT with the generative capabilities of GPT, making it effective for tasks like summarization and text generation.


## Building a Transformer
PyTorch's nn.Transformer class provides a full transformer architecture with pre-built encoder and decoder stacks.

In [11]:
# Define the model dimension
# The model dimension (d_model) is a fundamental hyperparameter that influences both the expressiveness and efficiency of the transformer model. 
# When designing or tuning a transformer model, it's important to balance the model dimension with 
# available resources and the complexity of the tasks being addressed.

d_model = 512

# Define the number of attention heads
n_heads = 8

# Define the number of encoder layers
num_encoder_layers = 6

# Define the number of decoder layers
num_decoder_layers = 6

model = nn.Transformer(
    d_model = d_model,
    nhead = n_heads,
    num_encoder_layers = num_encoder_layers,
    num_decoder_layers = num_decoder_layers,
    batch_first=True,  # Set to True for better performance with batch dimension first
)

print(model)

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

# Attention Mechanism and Positional Encoding

## Attention Mechanism

The attention mechanism is a key component of transformer models that allows them to determine the importance of different words in a sentence when making predictions. It enables the model to focus on specific parts of the input sequence, regardless of their position, enhancing its understanding of context.

### How Attention Works

In the attention mechanism, each word in the input is transformed into three types of vectors:

- **Query**: Represents the word we are focusing on.
- **Key**: Represents other words that could be relevant.
- **Value**: Contains the actual information we want to retrieve.

The attention process involves the following steps:

1. **Score Calculation**: The model computes a score for each word by comparing the query with all the keys. This score indicates how relevant each key is to the query.
2. **Weighting**: The scores are normalized to create weights. Higher weights indicate that the corresponding word is more relevant to the query.
3. **Combining Values**: The model then takes a weighted sum of the value vectors based on the computed weights. This results in an output that emphasizes the most relevant information.

### Types of Attention

- **Self-Attention**: This type of attention evaluates how each word relates to every other word in the same sequence, allowing the model to capture contextual relationships within the input.
- **Cross-Attention**: This type of attention is used in models that compare two different sequences, such as in translation tasks, where the input may come from one language and the output in another.

## Positional Encoding

Transformers do not inherently understand the order of tokens because they do not use recurrence. To solve this, positional encoding is added to the input embeddings, providing essential information about the position of each token in the sequence.

### How Positional Encoding Works

Positional encodings are unique vectors assigned to each token based on its position in the sequence. These encodings are combined with the token embeddings, allowing the model to recognize the order of words. 

The positional encodings use a pattern based on sine and cosine functions. This pattern ensures that tokens at different positions have distinct encodings while maintaining a relationship between them. For example, nearby tokens will have similar positional encodings, which helps the model understand their proximity in context.

### Importance of Positional Encoding

- **Order Information**: By adding positional encodings, the model gains an understanding of the sequence's structure, allowing it to interpret the meaning more accurately.
- **Context Awareness**: Positional encodings help the model maintain context by recognizing how words relate to each other based on their positions.

Combining the attention mechanism with positional encodings enables transformer models to process sequential data effectively while maintaining a strong contextual understanding of the input.


# Positional Encoder

In [12]:
class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_length):
        super(PositionalEncoder, self).__init__()
        self.d_model = d_model
        self.max_length = max_length
        # Initialize the positional encoding matrix
        pe = torch.zeros(max_length, d_model)
        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float) * (math.log(10000)/d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term) # o 2 era 3 antes
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x

# Multi-Head Attention

In [13]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_module, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Set number of attention heads
        self.num_heads = num_heads
        self.d_model = d_model
        self.head_dim = d_model // num_heads
        # Set up the linear transformations
        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)
        self.output_linear = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        # Split the sequence embeddings in x across the attention heads
        x = x.view(batch_size, -1, self.num_heads, self.head_dim)
        return x.permute(0, 2, 1, 3).contiguous().view(batch_size * self.num_heads, -1, self.head_dim)

    def compute_attention(self, query, key, mask=None):
        # Compute dot-product attention scores
        scores = torch.matmul(query, key.permute(0, 2, 1))
        if mask is not None:
            scores = scores.masked_fill(mask==0, float("-1e20"))
        # Normalize attention scores into attention weights
        attention_weights = F.softmax(scores, dim=-1)
        return attention_weights

    def forward(self, query, key, value, mask= None):
        batch_size = query.size(0) # era 8 antes
        query = self.split_heads(self.query_linear(query), batch_size)
        key = self.split_heads(self.key_linear(key), batch_size)
        value = self.split_heads(self.value_linear(value), batch_size)
        attention_weights = self.compute_attention(query, key, mask)
        # Multiply attention weights by values, concatenate and linearly outputs
        output = torch.matmul(attention_weights, value)
        output = output.view(batch_size, self.num_heads, -1, self.head_dim).permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
        return self.output_linear(output)

# Encoder Transformer

In [14]:
class FeedForwardSubLayer(nn.Module):
    # Specify the two linear layers' input and output sizes
    def __init__(self, d_model, d_ff):
        super(FeedForwardSubLayer, self).__init__()
        self.fc1 = nn.Linear(d_model,d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    # Apply a forward pass
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

 

# Complete the initialization of elements in the encoder layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        return self.norm2(x + self.dropout(ff_output))

 

class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        # Define a stack of multiple encoder layers
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

    # Complete the forward pass method
    def forward(self, x, mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

class ClassifierHead(nn.Module):
    def __init__(self, d_model, num_classes):
        super(ClassifierHead, self).__init__()
        # Add linear layer for multiple-class classification
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        logits = self.fc(x[:, 0, :])
        # Obtain log class probabilities upon raw outputs
        return F.log_softmax(logits, dim=-1)


In [15]:
# Testing the encoder transformer

# Hyperparameters for the transformer model
num_classes = 3            # Number of output classes for classification
vocab_size = 10000         # Size of the vocabulary
batch_size = 8             # Number of sequences to process in a batch
d_model = 512              # Dimension of the model (embeddings and hidden states)
num_heads = 8              # Number of attention heads in multi-head attention
num_layers = 6             # Number of encoder layers in the transformer
d_ff = 2048                # Dimension of the feedforward network
sequence_length = 256      # Maximum length of input sequences
dropout = 0.1              # Dropout rate for regularization

# Generate random input sequences (batch_size x sequence_length)
input_sequence = torch.randint(0, vocab_size, (batch_size, sequence_length))

# Create a random attention mask (sequence_length x sequence_length)
# This mask is used to prevent attending to certain positions in the sequence
mask = torch.randint(0, 2, (sequence_length, sequence_length))

# Instantiate the encoder transformer's body and the classification head
encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)
classifier = ClassifierHead(d_model, num_classes)

# Perform a forward pass through the encoder
output = encoder(input_sequence, mask)

# Pass the encoder's output through the classifier head to get class predictions
classification = classifier(output)

# Print the classification outputs for the current batch of sequences
print("Classification outputs for a batch of", batch_size, "sequences:")
print(classification)


Classification outputs for a batch of 8 sequences:
tensor([[-1.2599, -1.7033, -0.6269],
        [-1.6379, -0.9813, -0.8421],
        [-1.4597, -0.5489, -1.6602],
        [-1.0894, -1.4754, -0.8327],
        [-1.4466, -0.9730, -0.9502],
        [-1.2586, -1.8287, -0.5882],
        [-0.9719, -0.9390, -1.4669],
        [-1.1809, -0.7247, -1.5678]], grad_fn=<LogSoftmaxBackward0>)


Each row in the output represents a sequence, displaying the log probabilities for each class. The model predicts the class with the highest log probability for each sequence.

Overall, this output reflects the model’s confidence in its predictions, indicating the most likely class for each sequence based on the provided log probabilities.

# Decoder Transformer

In [16]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        # Initialize the causal (masked) self-attention and cross-attention
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, causal_mask, encoder_output, cross_mask):
        # Pass the necessary arguments to the causal self-attention and cross-attention
        self_attn_output = self.self_attn(x, x, x, causal_mask)
        x = self.norm1(x + self.dropout(self_attn_output))
        cross_attn_output = self.cross_attn(x, encoder_output, encoder_output, cross_mask)
        x = self.norm2(x + self.dropout(cross_attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))

        return x

In [17]:
class TransformerDecoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.classifier = ClassifierHead(d_model, num_classes)

    def forward(self, x, causal_mask, encoder_output, cross_mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, causal_mask, encoder_output, cross_mask)
        return self.classifier(x)  # Returns logits for classification

# Encoder-Decoder Transformer

In [18]:
# Testing the Decoder Transformer

# Hyperparameters for the transformer model
num_classes = 3          # Number of output classes
vocab_size = 10000       # Size of the vocabulary
batch_size = 8           # Number of sequences in a batch
d_model = 512            # Dimensionality of the model's output space
num_heads = 8            # Number of attention heads in the multi-head attention mechanism
num_layers = 6           # Number of layers in the encoder and decoder
d_ff = 2048              # Dimensionality of the feedforward layer
sequence_length = 256    # Length of input sequences
dropout = 0.1            # Dropout rate for regularization

# Create a batch of random input sequences (tokenized)
input_sequence = torch.randint(0, vocab_size, (batch_size, sequence_length))

# Create a padding mask to ignore padding tokens in sequences
padding_mask = torch.randint(0, 2, (sequence_length, sequence_length))

# Create a causal mask to prevent the decoder from attending to future tokens
causal_mask = torch.triu(torch.ones(sequence_length, sequence_length), diagonal=1)

# Instantiate the transformer encoder
encoder = TransformerEncoder(
    vocab_size, 
    d_model, 
    num_layers, 
    num_heads, 
    d_ff, 
    dropout, 
    max_sequence_length=sequence_length
)

# Instantiate the transformer decoder
decoder = TransformerDecoder(
    vocab_size, 
    d_model, 
    num_layers, 
    num_heads, 
    d_ff, 
    dropout, 
    max_sequence_length=sequence_length
)

# Pass the input sequences and masks through the encoder
encoder_output = encoder(input_sequence, padding_mask)

# Pass the encoder's output and input sequences through the decoder
decoder_output = decoder(input_sequence, causal_mask, encoder_output, padding_mask)

# Print input sequences for verification
print("Input sequences:\n", input_sequence)

# Print encoder output shape
print("Encoder output shape:", encoder_output.shape)

# Print encoder output for inspection
print("Encoder output:\n", encoder_output)

# Print the shape of the decoder's output to verify the dimensions
print("Batch's output shape:", decoder_output.shape)

# Print decoder output for inspection
print("Decoder output:\n", decoder_output)


Input sequences:
 tensor([[9182, 5134, 2795,  ..., 2877, 1863, 7910],
        [2538, 6845, 5967,  ..., 4464,  938, 5013],
        [6363, 1339,  385,  ..., 7246, 4582, 3041],
        ...,
        [5635, 4395, 4278,  ..., 1486, 5838, 7712],
        [5388, 9552,  606,  ..., 2421,  767, 4178],
        [ 187, 1044, 6659,  ..., 1696, 8874, 7728]])
Encoder output shape: torch.Size([8, 256, 512])
Encoder output:
 tensor([[[-7.2958e-01, -9.3359e-01,  1.0381e+00,  ...,  4.7121e-01,
           4.9682e-01,  3.1380e-01],
         [-8.5936e-02,  1.2037e+00, -4.9680e-01,  ..., -4.9237e-01,
          -3.5736e-01, -3.7516e-01],
         [ 1.2578e+00,  2.3459e-01,  1.5557e+00,  ..., -2.5342e-01,
          -1.4801e+00, -1.0275e+00],
         ...,
         [ 1.4393e+00,  8.5162e-01, -2.0367e+00,  ..., -5.0844e-01,
          -5.3889e-01,  5.9611e-01],
         [-5.1589e-01, -4.2206e-01, -4.5397e-01,  ..., -9.6882e-01,
          -1.6140e+00,  1.4337e+00],
         [-3.7493e-01, -3.8852e-01,  1.5543e+00,  ..

# More Complex Problem Solving Using Pre-Trained Model

## Classifying two movie opinions

We have seen how to pass one example sequence to a pre-trained text classification LLM for inference. In this exercise you will practice passing two example sequences simultaneously, describing two rather opposite opinions of a movie.

In [19]:
model_name = "textattack/distilbert-base-uncased-SST-2"

# Load the tokenizer and pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

In [20]:
text = ["The best movie I've ever watched!", "What an awful movie. I regret watching it."]

# Tokenize the input text and convert them to tensor format for the model
# The `return_tensors="pt"` argument specifies that the output should be in PyTorch tensor format
# `padding=True` ensures that sequences are padded to the same length for batch processing
inputs = tokenizer(text, return_tensors="pt", padding=True)

# Pass the tokenized inputs to the model for inference
# The model returns output logits, which are the raw predictions before applying softmax
outputs = model(**inputs)

# Extract the logits from the model's output
logits = outputs.logits

# Use argmax to determine the predicted class for each input
# The `dim=1` argument specifies that we want to find the index of the maximum logit along the class dimension
predicted_classes = torch.argmax(logits, dim=1).tolist()

# Loop through each input text and its predicted class to display the results
for idx, predicted_class in enumerate(predicted_classes):
    print(f"Predicted class for \"{text[idx]}\": {predicted_class}")  # Print the predicted class for each input


Predicted class for "The best movie I've ever watched!": 1
Predicted class for "What an awful movie. I regret watching it.": 0


## Text-Generation

In this exercise, we’ll explore how to generate text using the pre-trained GPT-2 model. We will start by loading a dataset from Stanford NLP to extract a sample prompt. After that, we’ll use the GPT-2 tokenizer to prepare this prompt for the model. Once our input is ready, we'll generate some text based on the prompt and decode the output back into a readable format. Let’s dive in and see how it works!


In [21]:
# Load dataset
dataset = load_dataset("stanfordnlp/shp", "default")
train_data = dataset["train"]

# Prepare prompt from dataset
prompt = train_data[0]["history"][:60]

# Load the tokenizer and pre-trained model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad_token_id to eos_token_id to avoid padding error
tokenizer.pad_token = tokenizer.eos_token

# Tokenize input and get input_ids and attention_mask
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    padding=True,
    truncation=True
)

# Generate text, using pad_token_id and attention_mask from tokenizer
output = model.generate(
    inputs["input_ids"],
    max_length=50,
    pad_token_id=tokenizer.eos_token_id,
    attention_mask=inputs["attention_mask"]
)

# Decode generated output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)


Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

In an interview right before receiving the 2013 Nobel prize  for his work on the "The Great Gatsby" , he said: "I think that the best way to understand the world is to look at the world in a different way.


## Text Summarizer
In this exercise let's summarize a random Facebook CNN text using Text Summarizer LMM techniques!

In [22]:
# Load a BART model for summarization
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")


ATENTION SOMETIMES THE FEATURES ARE "text" AND SOMETIMES ARE "history" I DON'T UNDERSTAND WHY. IF GIVES YOU ERROR CHANGE IN THE CODE OR RESTART YOUR KERNEL. (RESULTS WHEN FEATURE IS "text" ARE WAY BETTER THAN WHEN FEATURE IS "history")

In [23]:
# Search in the column names for the feature that contains the text we want to summarize
# In this case we will be sumarizing a "text" from an example
print(f"Feature names: {dataset['train'].column_names}")

Feature names: ['post_id', 'domain', 'upvote_ratio', 'history', 'c_root_id_A', 'c_root_id_B', 'created_at_utc_A', 'created_at_utc_B', 'score_A', 'score_B', 'human_ref_A', 'human_ref_B', 'labels', 'seconds_difference', 'score_ratio']


In [24]:
# Define a minimum length for the review (e.g., 200 characters)
min_length = 200

# Filter for longer reviews
long_reviews = dataset['train'].filter(lambda example: len(example['history']) >= min_length)

# Access an example
example = long_reviews[-5]['history']

# Generate a summary using the previous model
input_ids = tokenizer.encode("summarize: " + example, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(input_ids, max_length=200, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("\nOriginal Text (first 1000 characters): \n", example[:1000])
print("\nGenerated Summary: \n", summary)



Original Text (first 1000 characters): 
 Can I get in trouble for giving my neighbor his leaves back? I live in Louisiana and every year around this time I have to deal with this crap. My neighborhood is small and quiet and we generally love it, except that my neighbor is an asshole. My yard has no big trees in it and the small Crepe Myrtles I have make very little mess. My neighbor, on the other hand has a huge Live Oak and some other tree that has enormous fricken leaves.  First off, he doesn’t take good care of his own yard so the leaves just pile up and eventually come into my yard, which is something I have always dealt with without making a big deal about it until a month or so ago. I came home a little early to eat lunch and he happened to be blowing off his driveway, straight into my yard. I didn’t say anything but I waited until he went inside and got my backpack blower(I own lawn service) and gave him his leaves back.   Since then we have had a sort of silent war of leaves a

Let's evaluate our model! For text summarization we will use ROUGE as a metric. ROUGE measures the overlap between the generated summaries and reference summaries by calculating the number of matching n-grams (e.g., unigrams, bigrams) and the longest common subsequences. Higher ROUGE scores indicate better similarity and relevance of the generated text to the reference content.

In [25]:
# Load the ROUGE metric
rouge = evaluate.load("rouge")

# Assuming you have the original text and generated summary
results = rouge.compute(predictions=[summary], references=[example])
print(results)

{'rouge1': np.float64(0.3173277661795407), 'rouge2': np.float64(0.30607966457023067), 'rougeL': np.float64(0.3173277661795407), 'rougeLsum': np.float64(0.3173277661795407)}


#### ROUGE Score Ranges

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores can be interpreted based on the following ranges:

##### ROUGE-1
- **0.0 - 0.2**: Poor performance; little to no overlap with the reference.
- **0.2 - 0.4**: Fair performance; some overlap, but significant gaps.
- **0.4 - 0.6**: Good performance; reasonable overlap indicating solid content retention.
- **0.6 - 0.8**: Very good performance; high overlap suggesting effective summarization.
- **0.8 - 1.0**: Excellent performance; strong overlap with the reference, indicating nearly identical word usage.

##### ROUGE-2
- **0.0 - 0.1**: Poor performance; minimal bigram overlap.
- **0.1 - 0.3**: Fair performance; some bigram matching, but lacks depth.
- **0.3 - 0.5**: Good performance; adequate bigram retention showing content quality.
- **0.5 - 0.7**: Very good performance; substantial bigram overlap reflecting coherent summarization.
- **0.7 - 1.0**: Excellent performance; high bigram match indicating detailed and nuanced summaries.

##### ROUGE-L
- **0.0 - 0.2**: Poor performance; weak sentence structure preservation.
- **0.2 - 0.4**: Fair performance; some preservation of important sequences, but inconsistent.
- **0.4 - 0.6**: Good performance; maintains significant structure and coherence in summarization.
- **0.6 - 0.8**: Very good performance; effectively preserves order and important sequences.
- **0.8 - 1.0**: Excellent performance; exceptional retention of the original structure and meaning.

##### ROUGE-Lsum
- **0.0 - 0.2**: Poor performance; minimal content retention across the summary.
- **0.2 - 0.4**: Fair performance; some key content retained, but lacks consistency.
- **0.4 - 0.6**: Good performance; maintains a solid grasp of the main ideas in the summary.
- **0.6 - 0.8**: Very good performance; strong retention of content and coherent summarization.
- **0.8 - 1.0**: Excellent performance; exceptional summarization that captures the essence of the content.


# Text Translation
Some experimentation with text translation.

In [26]:
model_name = "Helsinki-NLP/opus-mt-en-es"

# Load the tokenizer and the model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

english_inputs = ["Hello", "Thank you", "How are you?", "Sorry", "Goodbye"]

# Encode the inputs, generate translations, decode, and print them

for english_input in english_inputs:
    input_ids = tokenizer.encode(english_input, return_tensors="pt")
    translated_ids = model.generate(input_ids)
    translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)

    print(f"English: {english_input} | Spanish: {translated_text}")

English: Hello | Spanish: Hola.
English: Thank you | Spanish: Gracias.
English: How are you? | Spanish: ¿Cómo estás?
English: Sorry | Spanish: Lo siento.
English: Goodbye | Spanish: Adiós.


In [27]:
# Model setup
model_name = "Helsinki-NLP/opus-mt-en-es"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# List of sentences to translate
english_inputs = [
    "Hello, how are you today?",
    "Thank you for your help!",
    "This is an advanced translation model example.",
    "Sorry for the inconvenience, please try again.",
    "Goodbye! Have a great day ahead."
]

# Translation function with advanced features
def translate_batch(texts, max_length=50, num_beams=5):
    # Encode batch of sentences and translate
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    translated_ids = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=max_length,
        num_beams=num_beams,
        early_stopping=True
    )
    # Decode translations
    translations = [tokenizer.decode(ids, skip_special_tokens=True) for ids in translated_ids]
    return translations

# Translate and display
translated_texts = translate_batch(english_inputs)
for eng, esp in zip(english_inputs, translated_texts):
    print(f"English: {eng} | Spanish: {esp}")


English: Hello, how are you today? | Spanish: Hola, ¿cómo estás hoy?
English: Thank you for your help! | Spanish: ¡Gracias por su ayuda!
English: This is an advanced translation model example. | Spanish: Este es un ejemplo avanzado de modelo de traducción.
English: Sorry for the inconvenience, please try again. | Spanish: Siento las molestias, por favor, inténtalo de nuevo.
English: Goodbye! Have a great day ahead. | Spanish: Que tengas un gran día por delante.
