# Transformers
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](htttps://docente.ufrn.br/elias.jacob)

## Keypoints
- Transformers have revolutionized NLP due to their ability to handle long-range dependencies and parallelize computations, overcoming limitations of RNNs and LSTMs.

- The attention mechanism is central to transformers, allowing models to focus on the most relevant parts of the input sequence when making predictions.

- Transformers consist of an encoder and decoder, each with multiple layers including self-attention, feed-forward networks, and positional encoding.

- The quadratic complexity of transformers poses challenges for processing longer texts, impacting training time and memory consumption.

- Transformers have been successfully applied beyond NLP in domains like computer vision, music generation, speech recognition, and video processing.

- Common transformer architectures for NLP include BERT, GPT, RoBERTa, T5, and XLNet, each with unique strengths.

- Transformers can be used as feature extractors, using their ability to capture rich syntactical and contextual information from text data.

- Key steps for using transformers involve starting with a pretrained model, optional domain-specific fine-tuning, task-specific training, and potentially using the model as a feature extractor.

- The 512-token limit in transformers, due to quadratic complexity, can impact accuracy when important information is lost during truncation of longer texts.

- A simple workaround to handle the 512-token limit is to focus on the most relevant information, such as using the last 512 tokens in legal documents where the decision is often at the end.


## Learning Goals

By the end of this class, you will be able to:

1.  Explain the key advantages of Transformer models over Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) in handling Natural Language Processing tasks, particularly in terms of long-range dependencies and parallelization.

2.  Describe the fundamental principles of the attention mechanism, specifically self-attention, and articulate its role in enabling Transformers to focus on relevant parts of input sequences for effective context understanding and information processing.

3.  Outline the architecture of a Transformer model, differentiating between the encoder and decoder components, and identify the function of key layers such as self-attention layers, feed-forward networks, and positional encoding within the overall architecture.

4.  Apply pre-trained Transformer models from the Hugging Face Transformers library to perform practical Natural Language Processing tasks, such as masked language modeling and text classification, through fine-tuning techniques on domain-specific datasets.

5.  Discuss the computational implications of the quadratic complexity inherent in Transformer models, especially when processing long text sequences, and recognize common strategies, like truncation and focusing on relevant text segments, used to mitigate these challenges in real-world applications.


## Transformers in Natural Language Processing

Transformers have fundamentally transformed NLP since their introduction in 2017. They excel at handling long-range dependencies and support tasks ranging from question answering and text summarization to machine translation. Unlike sequential models (RNNs, LSTMs) that struggle with vanishing gradients and limited parallelism, transformers process all tokens simultaneously. This simultaneous processing enables them to capture both local and global context with a single, powerful mechanism: **self-attention**.

### Transition from Sequential Models

Before the advent of transformers, recurrent networks such as RNNs and LSTMs dominated the landscape. These models processed data one time step at a time, which imposed innate limitations:
- **Limited Long-Range Dependency Capture:** Distant words or tokens were less likely to influence the current prediction.
- **Sequential Computation:** The inability to perform computations in parallel hindered efficiency.

Transformers overcame these challenges by processing the entire sequence in parallel, allowing the model to capture relationships between tokens regardless of their distance.

### Attention Mechanism

Fundamental to the Transformer is **attention**. Intuitively, attention lets the model decide which parts of the input are most relevant when updating a token's representation. Consider translating an ambiguous word like "judge" in English: the model must consider nearby words (such as gender cues in Portuguese "juiz" vs. "juíza") when determining its translation.

The standard attention computation is given by:

$$
\text{Attention}(Q, K, V) = \text{softmax}\Bigg(\frac{QK^T}{\sqrt{d_k}}\Bigg)V
$$

Where:
- **Q (Query):** A transformed representation of the current token.
- **K (Key):** Encodes features of tokens to determine their “importance”.
- **V (Value):** The signal that is passing the context along.
- $ d_k $ is the dimensionality of the key vectors.

The process for computing attention is as follows:
1. **Transformation:** Each token is projected into three vectors (query, key, and value) via learned linear transformations.
2. **Score Calculation:** Compute the similarity between the query and all keys.
3. **Weighted Sum:** Normalize these similarities via softmax and then compute a weighted sum of the value vectors.

> **Note:** This mechanism enables parallel processing and dynamically focuses the model’s attention on the most influential parts of the input.


## Architecture of a Transformer

A Transformer is divided into two main blocks: the **encoder** and the **decoder**. Each block is formed by stacking identical layers.


An illustration from the original transformer paper is shown below:

<p align="center">
  <img src="images/transformers_basic.png" alt="Basic Transformer Architecture" style="width: 40%; height: 40%"/>
</p>


### Encoder

- **Input Processing:** Converts raw tokens into embeddings.
- **Self-Attention:** Each token interacts with every other token to learn contextual relationships.
- **Feed-Forward Processing:** Applies position-wise feed-forward networks to transform the representations.

**Usage:**  
Models that require a deep understanding of the input without needing to generate new sequences primarily use **only an encoder**. An example is **BERT (Bidirectional Encoder Representations from Transformers)**, which is designed for understanding language tasks such as classification, question answering, and named entity recognition. Here, the focus is on creating high-quality contextual representations of the input text.

### Decoder

- **Context Incorporation:** Leverages encoder outputs through cross-attention (in the full Transformer) or operates independently for auto-regressive tasks.
- **Sequential Generation:** Generates one token at a time while attending to previously generated tokens.

**Usage:**  
Models tasked with generating text, such as language models, frequently use **only a decoder**. For example, **GPT (Generative Pre-trained Transformer)** models use a stack of decoder layers optimized for predicting the next token in a sequence. These decoders employ causal (masked) self-attention to ensure that the prediction for a token depends only on preceding tokens.

### Why Some Models Use Only One Component

- **Encoder-Only Models:**  
  - **Task Focus:** These models are designed for understanding and interpreting input data.
  - **Application Areas:** They work well for tasks like sentiment analysis, classification, or extracting semantic features from text.
  - **Architecture Simplification:** Removing the decoder simplifies the architecture by focusing solely on creating rich, bidirectional contextual embeddings.

- **Decoder-Only Models:**  
  - **Task Focus:** These models are geared towards generating text.
  - **Sequential Generation:** They generate output one token at a time and require masking in the self-attention layers to prevent the model from “seeing” future tokens during training.
  - **Application Areas:** Ideal for tasks such as story completion, code generation, or any scenario where text generation is the end goal.

- **Encoder-Decoder Models (Full Transformers):**  
  - **Task Focus:** These models excel in tasks that require a transformation from one sequence to another.
  - **Application Areas:** They are commonly used in machine translation, summarization, and other sequence-to-sequence problems where an input sequence is transformed into a different output sequence.
  - **Architecture Benefit:** The encoder processes and understands the input context, while the decoder generates the corresponding output using both the encoder's context and the sequential output it builds.

So, the choice between using an encoder, a decoder, or both hinges on the nature of the task at hand: understanding versus generating text. Tasks focused on language understanding benefit from encoder-only architectures, while tasks focused on language generation are better served by decoder-only architectures. When both understanding and generation are required (as in translation), a full encoder-decoder architecture is used.

## Key Components of Transformer Layers

Transformers rely on a series of key components:

### 1. Self-Attention Layer
- **Parallel Processing:** Every token in the sequence attends to every other token at the same time.
- **Contextual Awareness:** Self-attention updates each token’s representation based on the full context.

**Steps:**
1. Linearly transform tokens into Query, Key, Value vectors.
2. Compute similarity scores via dot products.
3. Generate a weighted output via a softmax-normalized weighted sum over values.


### 2. Position-Wise Feed-Forward Neural Networks (FFNN)
- Applies a two-layer MLP on each token’s representation with a non-linear activation (typically ReLU).
- Uses shared parameters across tokens, further refining each token’s feature set.


### 3. Positional Encoding
Since self-attention does not consider order, positional encodings provide the token order by incorporating sine and cosine functions of varying frequencies:

$$
\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right), \quad
\text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

> **Important:** Positional encoding is crucial for downstream tasks that depend on the sequential order of tokens, such as language modeling.

## Understanding Quadratic Complexity in Transformers

Transformers have revolutionized various fields of AI, particularly natural language processing. However, it's crucial to understand their computational characteristics, especially when dealing with long sequences. A key aspect is the **quadratic time and space complexity**, denoted as $ O(n^2) $, where $ n $ is the length of the input sequence. This complexity stems from the **self-attention mechanism**, a fundamental component of Transformer architecture.

fundamental to the Transformer model is the self-attention layer. In this layer, every token in the input sequence interacts with and attends to every other token to compute context-aware representations.  For each token, the attention mechanism calculates a score reflecting its relationship with all other tokens in the sequence.  This process involves pairwise comparisons between all tokens.

Let's consider an input sequence of length $ n $. For the first token, it needs to attend to $ n $ tokens (including itself). The second token also needs to attend to $ n $ tokens, and so on, for all $ n $ tokens in the sequence.  Therefore, the total number of attention computations scales proportionally to $ n \times n = n^2 $.  This quadratic relationship is why Transformers are said to have $ O(n^2) $ complexity.

This quadratic complexity impacts both computational time and memory usage.  Not only does the number of operations increase quadratically with sequence length, but the intermediate attention weights, which represent the relationships between all pairs of tokens, also need to be stored. For longer input sequences, this leads to significant memory demands.

<p align="center">
<img src="images/transformer_quadratic.webp" alt="" style="width: 70%; height: 70%"/>
</p>

As illustrated, for a sequence of length 9, we observe $ 9^2 = 81 $ attention computations.  Critically, if we double the sequence length to 18, the computations quadruple to $ 18^2 = 324 $. This rapid growth presents challenges when processing long documents or sequences.


### Ramifications of Quadratic Scaling

- **Memory Footprint:** The self-attention mechanism necessitates storing or computing attention scores for every pair of tokens.  For a sequence length $ n $ and hidden dimension $ d $, storing the attention matrix alone requires $ O(n^2 \times d) $ memory.  As sequence length increases, the memory demands become a bottleneck, especially when training or deploying models with limited GPU memory.

- **Inference Time:** The quadratic increase in operations directly translates to longer processing times.  For tasks involving long documents, such as document summarization, question answering over extensive texts, or code analysis, the inference time can become prohibitively slow.  This limits the applicability of standard Transformers in scenarios requiring efficient processing of lengthy sequences.


### Strategies to Alleviate Quadratic Overhead

To address the limitations imposed by quadratic complexity, various techniques have been developed to enhance Transformer architectures. Many of these methods are incorporated in models like ModernBERT, aiming to improve computational and memory efficiency and extend the manageable sequence length.

#### 1. Alternating Global and Local Attention Patterns

- **Concept:** The key idea is to strategically reduce the scope of attention computation in certain layers.  Instead of applying computationally expensive global self-attention in every layer, we can alternate between layers with global attention and layers with more efficient local attention.

- **Implementation:** A common approach involves using global attention layers sparingly, for example, every third layer. In the intervening layers, we employ local attention mechanisms, such as **sliding-window attention**. In sliding-window attention, each token only attends to a fixed-size window of tokens around it.  For instance, a token might only attend to the $ w $ tokens immediately preceding and succeeding it, where $ w $ is the window size (e.g., 128 tokens).

- **Benefits:**
    - **Reduced Computational Cost:** Local attention significantly reduces the computational burden.  For a window size $ w $, local attention reduces the complexity from $ O(n^2) $ in global attention to approximately $ O(n \times w) $.  This is because for each of the $ n $ tokens, we only perform attention computations within a window of size $ w $. If $ w << n $, this represents a substantial saving.
    - **Preservation of Long-Range Context:** By incorporating occasional global attention layers, the model retains the ability to capture long-range dependencies within the sequence.  These global layers act as a bridge, allowing information to propagate across the entire sequence, even with local attention in other layers.

#### 2. Unpadding Methods

- **Concept:** Padding is a common practice in batch processing of sequences. To process sequences of varying lengths in batches, shorter sequences are padded with special "padding" tokens to match the length of the longest sequence in the batch. However, these padding tokens are semantically meaningless and contribute unnecessary computations in the attention mechanism. Unpadding techniques aim to eliminate these redundant computations.

- **Implementation:**  Instead of processing the padded sequences directly, unpadding methods first identify and remove the padding tokens from each sequence within a batch. The remaining meaningful tokens are then concatenated into a single, continuous sequence. This concatenated sequence is processed using an efficient attention mechanism that is aware of the original sequence boundaries.  Memory-efficient attention kernels are often employed to handle these unpadded sequences effectively.

- **Benefits:**
    - **Enhanced Efficiency:** By removing computations associated with padding tokens, unpadding reduces computational overhead during both training and inference.  The model focuses its resources on processing actual content.
    - **Increased Throughput:** Processing only meaningful tokens leads to more efficient resource utilization.  This results in higher throughput, meaning more tokens can be processed per unit of time.

#### 3. Flash Attention Kernels

- **Concept:** Flash Attention is not an architectural change but rather a set of optimized implementations of the attention mechanism. These optimized kernels are designed to maximize memory and compute efficiency, especially when dealing with long sequences, by reordering computations.

- **Implementation:** Flash Attention re-arranges the standard attention computation to better align with the characteristics of modern GPU hardware.  Traditional attention implementations often involve multiple memory reads and writes, which can be a bottleneck, especially with long sequences. Flash Attention uses techniques like tiling and kernel blending to reduce memory bandwidth requirements. For instance, Flash Attention 3 is optimized for global attention layers, while Flash Attention 2 is tailored for local attention layers, providing specialized optimizations for different attention patterns.

- **Benefits:**
    - **Reduced Memory Bandwidth Demands:** By minimizing data movement between GPU memory and compute units, Flash Attention significantly lowers memory bandwidth usage. This is crucial for processing long sequences, where memory bandwidth can become a limiting factor.
    - **Accelerated Computation:** Optimized kernels in Flash Attention lead to faster attention computations without compromising accuracy.  This results in overall speed improvements in model training and inference, particularly for long sequences.

#### 4. Positional Embeddings: Rotary Positional Embeddings (RoPE)

- **Concept:** Positional embeddings are crucial for Transformers to understand the order of tokens in a sequence, as self-attention is permutation-invariant. Rotary Positional Embeddings (RoPE) offer an alternative to traditional absolute positional embeddings. Unlike absolute positional embeddings, which add a fixed vector to the token embeddings based on their position, RoPE directly incorporates positional information into the attention computation itself by modifying the query and key vectors based on their positions through rotation matrices.

- **Implementation:** RoPE encodes positional information by applying rotations in the embedding space as a function of token position.  Specifically, when calculating attention scores between tokens at positions $ i $ and $ j $, RoPE rotates the query vector of the $ i $-th token and the key vector of the $ j $-th token using rotation matrices dependent on their respective positions.  A parameter called “RoPE theta” controls the frequency of rotation and allows for context length extension. By adjusting “RoPE theta,” the model can maintain performance even when extrapolating to context lengths beyond what it was originally trained on (e.g., extending from 1024 tokens to 8192 tokens).

- **Benefits:**
    - **Scalability to Longer Sequences:** RoPE's approach to positional encoding is more amenable to handling longer sequences compared to absolute positional embeddings. It can be extended to longer context lengths without a proportional increase in complexity or performance degradation.
    - **Flexibility and Combination:** RoPE can be seamlessly integrated with other efficiency techniques like unpadding and alternating attention mechanisms, providing a versatile solution for long-context Transformers.

#### 5. Refinements in Activation and Normalization

- **Activation Functions:** Standard Transformers often use GeLU (Gaussian Error Linear Unit) as the activation function.  More recent advancements explore alternatives like GeGLU (Gated Exponential Linear Unit). GeGLU introduces a gating mechanism, which can enhance model expressiveness without a significant increase in computational cost. This gated mechanism allows the model to selectively control the flow of information through the network, potentially improving representation learning.

- **Normalization:**  Normalization techniques like LayerNorm are critical for stabilizing training in deep networks.  Pre-normalization, where LayerNorm is applied before the attention and feed-forward layers (instead of after), has been shown to improve training stability, especially in architectures that combine global and local attention. Pre-normalization helps to ensure that gradients are well-behaved, assisting more effective training, particularly when employing complex attention patterns.

#### 6. Hardware-Aware Model Optimizations

- **Tensor Tiling:** Modern GPUs and other hardware accelerators perform optimally when matrix dimensions are aligned with their internal architecture. Tensor tiling involves ensuring that the dimensions of weight matrices in the model are divisible by hardware-specific numbers, such as multiples of 64. This alignment can significantly improve computational efficiency on GPUs by improving memory access patterns and parallel processing.

- **Weight Initialization Strategies:**  When scaling models to handle longer contexts or larger vocabularies, efficient weight initialization becomes crucial. Center tiling is a weight initialization method that enables the efficient expansion of parameter matrices from a smaller, pre-trained model into a larger model. This technique allows parameter matrices from a smaller model to be expanded and reused in a larger model while maintaining or even improving performance. It enables transfer learning and reduces the need to train large models from scratch.


## Applications of Transformers Beyond NLP

Originally developed for Natural Language Processing (NLP), Transformers possess an architecture uniquely suited to identify sophisticated patterns and dependencies within input data. This capability has led to significant exploration and application of Transformers in diverse fields beyond text processing, achieving remarkable results. Let's explore several domains where Transformers have made substantial contributions.

### Computer Vision

In computer vision, Transformers offer a distinct advantage in capturing **long-range dependencies** between different parts of an image. Unlike Convolutional Neural Networks (CNNs) which inherently have a limited receptive field, Transformers, through their attention mechanism, can relate any two regions in the image directly. This is crucial because understanding a visual scene often requires considering relationships between objects and regions that are far apart in the image.

Consider an image as a grid of pixels. Traditional CNNs process images by applying convolutional filters, which are effective at capturing local patterns. However, to understand the context of a pixel in relation to distant pixels, multiple convolutional layers are needed, potentially making it computationally expensive and less efficient at capturing truly long-range dependencies. Transformers address this by using **self-attention**.

#### Image Transformer

The Image Transformer adapts the original Transformer model for image processing at a fine-grained level. It conceptualizes each pixel in an image as a sequential token, analogous to words in a sentence. This allows the model to analyze images pixel by pixel, capturing subtle details and sophisticated relationships throughout the image.

For an image of size $H \times W$ with $C$ channels, we can consider each pixel as a feature vector. Let $P_{i,j} \in \mathbb{R}^C$ be the pixel at row $i$ and column $j$. The Image Transformer processes the sequence of these pixel vectors, effectively linearizing the 2D image into a 1D sequence.  The self-attention mechanism can then compute relationships between any pair of pixels $(P_{i,j}, P_{k,l})$ regardless of their spatial distance in the image.

#### Vision Transformer (ViT)

The Vision Transformer (ViT) takes a different approach to image processing, aiming for computational efficiency while retaining the benefits of the Transformer architecture. ViT processes images by dividing them into **patches**.  An input image is split into fixed-size patches, and each patch is then linearly embedded to form a token. These patch tokens are fed into a standard Transformer encoder.

For an image of size $H \times W$, ViT divides it into $N = \frac{HW}{P^2}$ patches of size $P \times P$. Each patch is flattened and linearly projected into a $D$-dimensional embedding space. Let $x \in \mathbb{R}^{H \times W \times C}$ be the input image. It is reshaped into a sequence of flattened patches $x_p \in \mathbb{R}^{N \times (P^2C)}$.  A linear projection is applied to each patch $E \in \mathbb{R}^{(P^2C) \times D}$ to get patch embeddings $z_0 = [x_{p}^1E; x_{p}^2E; ...; x_{p}^NE] + E_{pos}$, where $E_{pos} \in \mathbb{R}^{N \times D}$ are positional embeddings. These patch embeddings are then processed by the Transformer encoder.

By treating image patches as tokens, ViT significantly reduces the sequence length compared to processing individual pixels, making it computationally more tractable, especially for high-resolution images. ViT has shown that this patch-based approach can achieve competitive performance with latest CNNs on image classification tasks, while utilizing the global context understanding of Transformers.

### Music Generation

Transformers have also proven remarkably effective in music generation. The self-attention mechanism's capacity to maintain a **long context** is highly advantageous for music composition. Musical pieces are inherently structured with long-term dependencies; a musical note's relevance and unity often depend on notes played much earlier in the piece. Transformers are naturally suited to model these temporal dependencies.

Consider the generation of a melody. The choice of a note is not only influenced by the immediately preceding notes but also by the overall musical phrase, key, and style established earlier in the composition. Recurrent Neural Networks (RNNs), while historically used in music generation, can struggle with very long-range dependencies due to issues like vanishing gradients. Transformers, with their attention mechanism, can directly access information from anywhere in the generated musical sequence, regardless of distance.

#### MuseNet

MuseNet is a compelling example of Transformer application in music generation. It employs Transformers to compose musical pieces up to several minutes long, orchestrating across as many as ten different instruments.  MuseNet's capability extends to style transfer, allowing it to generate music that blends various musical styles, from classical composers like Mozart to genres like country music and pop artists such as The Beatles. This demonstrates the Transformer's ability to learn and combine complex musical structures and styles.

The generation process in MuseNet involves training a Transformer on a large dataset of musical scores. The model learns to predict the next musical token (note, duration, instrument, etc.) based on the preceding sequence of tokens. The long context window of the Transformer allows it to maintain musical coherence over extended compositions and to incorporate stylistic elements learned from diverse musical traditions.

### Speech Recognition

Transformers have demonstrated exceptional efficacy in speech recognition. Their self-attention mechanism is particularly well-suited for modeling the **temporal forces of speech**. Speech is a sequential signal where the meaning of a phoneme or word can depend on the context provided by preceding and succeeding sounds, often across considerable time spans.

Traditional Automatic Speech Recognition (ASR) systems often relied on complex architectures combining acoustic models, language models, and alignment algorithms like Hidden Markov Models (HMMs) and Connectionist Temporal Classification (CTC).  HMMs are used to model the temporal structure of speech, and CTC is used to handle the alignment between the acoustic signal and the transcribed text.

#### Speech-Transformer

The Speech-Transformer represents a simplification in ASR system design by reducing the reliance on these complex, hand-engineered components. It effectively operates without requiring explicit HMMs or CTC for alignment. Despite this architectural simplification, Speech-Transformers achieve advanced performance in speech recognition tasks.

The Speech-Transformer processes the audio signal, often converted into spectrograms or filter bank features. These acoustic features are then treated as input sequences, similar to words in NLP. The Transformer's self-attention mechanism directly learns to map these acoustic sequences to text transcriptions. The elimination of CTC and HMMs makes the system conceptually simpler and potentially easier to train and deploy, while maintaining or even improving accuracy due to the powerful sequence modeling capabilities of Transformers.

### Video Processing

Similar to computer vision, Transformers have found significant utility in video processing.  In video, each frame can be considered a spatial token, and the sequence of frames over time provides the temporal dimension.  Understanding video content necessitates capturing both spatial patterns within each frame and temporal patterns across frames.

#### Video Transformer

The Video Transformer leverages the strengths of Transformers to extract elaborate spatial and temporal patterns from video sequences. By analyzing the relationships between frames and their sequential order, the Video Transformer effectively understands and processes video data.

Consider video understanding tasks such as action recognition or video captioning.  To identify actions or describe video content, the model must not only recognize objects and scenes within individual frames (spatial understanding) but also track how these elements change and interact over time (temporal understanding).  Video Transformers can model these spatio-temporal interactions effectively by attending to both spatial features within frames and temporal relationships across frames.

Different approaches exist for applying Transformers to video. Some methods treat each frame as a token and process the sequence of frames directly. Others may first extract features from each frame using CNNs and then apply Transformers to the sequence of frame-level features to model temporal dependencies. Regardless of the specific approach, the central principle is to utilize the Transformer's attention mechanism to integrate spatial and temporal information for video understanding.



> Transformers excel wherever data exhibits structure and complex interdependencies. Their ability to model detailed relationships within input data renders them remarkably versatile and applicable across a wide array of domains. Although originating in NLP, ongoing research continues to broaden their potential, extending the limits of their capabilities in diverse applications.
>
> As demonstrated, Transformers have successfully expanded beyond text processing, making substantial contributions to fields like computer vision, music generation, speech recognition, and video processing. Their unique capacity to capture long-range dependencies and learn from structured data has opened up new avenues for innovation and progress across these disciplines.

## Common Transformer Architectures for NLP

Transformers have greatly impacted Natural Language Processing by modeling dependencies in textual data with self-attention mechanisms. Key transformer models are designed with different characteristics to address numerous tasks.

### 1. BERT (Bidirectional Encoder Representations from Transformers)

- **Architecture and Directionality:**  
  BERT is built using multiple stacked transformer encoder layers. Its unique feature is bidirectional processing, which considers both preceding and following words when determining the context of a word. This approach enhances the understanding of semantic meaning.

- **Pre-training Tasks:**  
  BERT is initially trained on large unlabeled datasets using two unsupervised tasks:
  
  - **Masked Language Modeling (MLM):**  
    A percentage of input tokens is randomly replaced with a mask token. The model learns to predict the original token based on its surrounding context. One can think of the prediction probability as:  
    $$
    P(x_i \mid x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n)
    $$
  
  - **Next Sentence Prediction (NSP):**  
    The model is given two sentences and must decide whether the second sentence logically follows the first in the original text.

- **Usage in Downstream Tasks:**  
  Once pre-training is complete, the model’s learned representations are fine-tuned for specific applications such as text classification, question answering, or named entity recognition with minimal architectural changes.

### 2. GPT (Generative Pretrained Transformer)

- **Architecture and Directionality:**  
  GPT is made up of transformer decoder layers arranged in a stack. It uses a unidirectional (left-to-right) approach, meaning the prediction for each token only depends on previous tokens. This makes GPT especially good at generating coherent text.

- **Pre-training Objective:**  
  GPT is pre-trained using a language modeling objective. Essentially, it is trained to predict the next token in a sequence:
  $$
  P(x_{i+1} \mid x_1, \dots, x_i)
  $$
  
- **Text Generation:**  
  Due to its unidirectional setup, GPT excels in applications like story generation, dialogue systems, and text continuation where context is provided by a prompt.

### 3. RoBERTa (Robustly Optimized BERT Approach)

- **Enhancements over BERT:**  
  RoBERTa retains the basic architecture of BERT (stacked encoder layers) but includes key changes:
  
  - Uses **dynamic masking** rather than a fixed mask setup, meaning different tokens are masked in each training cycle.
  - Trains with larger batch sizes and on more extensive datasets.

- **Performance Considerations:**  
  Improved training procedures allow RoBERTa to often achieve better performance on benchmarks by learning more sturdy representations.

### 4. T5 (Text-to-Text Transfer Transformer)

- **Unified Framework:**  
  T5 reconceptualizes all NLP tasks as converting input text to output text. Whether the task is translation, summarization, or classification, both input and output are treated as text sequences.
  
- **Training Objective:**  
  The model is pre-trained using a denoising cost function, where corrupted text is restored to its original form. In mathematical form, one can describe the objective as minimizing:
  $$
  \min_{\theta} \, E\left[d\left(g(f(x; \theta)), x\right)\right]
  $$
  where $ f $ and $ g $ represent the encoder and decoder functions respectively, and $ d(\cdot) $ is a suitable difference measure.
  
- **Advantages:**  
  This unified approach avoids the need for task-specific architectures and simplifies the learning process across various applications.

### 5. XLNet

- **Combining Strengths from BERT and GPT:**  
  XLNet addresses some of the limits of BERT while incorporating the sequential prediction method of GPT. 

- **Permutation-Based Objective:**  
  Instead of masking fixed tokens, XLNet randomizes the order of the entire sequence and trains the model to predict tokens based on a permuted context. This method is known as Permutation Language Modeling (PLM).

- **Two-Stream Self-Attention:**  
  XLNet employs a method where both content-based and query-based streams of self-attention are used. This helps the model capture long-range dependencies with greater precision.

- **Comparison to BERT:**  
  XLNet often shows improved performance in tasks where capturing long-range dependencies is essential.

### Model Sizes: Base versus Large

- **Base Models:**  
  These models offer a balance between performance and computational efficiency. For example, **BERT-base** consists of 12 layers with 768-dimensional hidden states, totaling roughly 110 million parameters.

- **Large Models:**  
  Larger versions have more layers and/or wider hidden states. **BERT-large** uses 24 layers with 1024-dimensional hidden states, which results in approximately 340 million parameters.

- **Trade-Offs:**  
  While larger models generally offer improved performance due to their ability to encode more complex patterns, they demand significantly more computation and memory. Overfitting is also a risk when training on small datasets, making base models sometimes a more practical choice.




## Using Transformers as Feature Extractors

Transformers can provide dense vector representations of text, capturing both syntactic and contextual features. These vectors are powerful features for many downstream tasks.

### What Are Features?

- **Definition:**  
  Features are numerical representations that capture key properties of the input data. In NLP, these might include:
  - Word frequency
  - Sentence structure
  - Contextual semantics

- **Role in Machine Learning:**  
  These features serve as inputs to further processing steps, allowing models to perform tasks like classification or regression.

### How Transformers Extract Features

- **Process:**  
  1. An input text is tokenized.
  2. Tokens are converted into vectors (embeddings).
  3. The transformer processes these embeddings through layers of self-attention, computing contextualized representations.
  
  A simplified representation calculation in a transformer layer is:
  $$
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  $$
  where $ Q $, $ K $, and $ V $ represent queries, keys, and values, and $ d_k $ is the dimension of the key vectors.

- **Benefits:**  
  - **Rich Representations:** Captures local and global context of the text.
  - **Flexibility:** Works on both labeled and unlabeled data.
  
- **Challenges:**  
  - **Computational Demands:** Requires significant processing power.
  - **Memory Footprint:** The large number of parameters demands greater memory resources.

> **Note:** While there is a high computational cost, the quality of features extracted by transformers justifies the use in many high-level NLP applications.


# General Steps for Using Transformers

To use transformers on a specific task, we need to follow these steps:

### Step 1: Start with a Pretrained Model

The first step is to select a pretrained model from the [Hugging Face Transformers library](https://huggingface.co/transformers/pretrained_models.html). These pretrained models have been trained on large amounts of text data and have learned general language representations. Using a pretrained model provides a strong foundation for your specific task.

*Note:* Training a model from scratch is an advanced topic and rarely necessary. In most cases, you can warm-start your model from a pretrained model. If you're interested in learning more about training your own model from scratch, refer to [this resource](https://huggingface.co/blog/how-to-train).

### Step 2: Fine-tune the Model on Domain-Specific Text (Optional)

Fine-tuning involves adapting a pretrained model to a new domain by training it on domain-specific text. This step is optional but can enhance the model's performance on your specific task. By exposing the model to text that is similar to your target domain, it can learn domain-specific language patterns and representations.

### Step 3: Train the Model for Your Task

Once you have a fine-tuned model (or a pretrained model if you skipped step 2), you can train it for your specific task. This typically involves adding a classification or regression head on top of the model and training it using your task-specific data.

Alternatively, you can use the model as a feature extractor for your task, which is a more advanced approach.

## Example: Classifying Court Decision Labels

Let's explore these steps using a subset of the [BrCAD-5](https://www.kaggle.com/datasets/eliasjacob/brcad5) dataset, which contains over 765,000 legal case information from Brazilian Federal Courts. Our goal is to train a model to predict the label for a court decision based on its text.

1. **Select a Pretrained Model**: We'll choose a suitable pretrained model from the Hugging Face Transformers library that aligns with our task requirements, such as language support and model architecture.

2. **Fine-tune the Model (Optional)**: If we have a sufficient amount of domain-specific text (legal case information in this case), we can fine-tune the pretrained model on this data to capture domain-specific language patterns.

3. **Train the Model for Label Prediction**: We'll add a classification head on top of the model and train it using the labeled court decision data from BrCAD-5. The model will learn to predict the appropriate label based on the text of the court decision.

<br>

> Remember, the key is to start with a strong pretrained model and adapt it to your specific task through fine-tuning and task-specific training.

## Load pretrained models

In [2]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Define the model checkpoints for the base and large versions of the BERT model
model_checkpoint_base = "neuralmind/bert-base-portuguese-cased"
model_checkpoint_large = "neuralmind/bert-large-portuguese-cased"

# Load the tokenizer for the base BERT model
# The tokenizer is responsible for converting text into tokens that the model can understand
tokenizer_base = AutoTokenizer.from_pretrained(model_checkpoint_base)

# Load the masked language model (MLM) for the base BERT model
# The MLM is used for tasks like predicting masked words in a sentence
model_mlm_base = AutoModelForMaskedLM.from_pretrained(model_checkpoint_base)

# Load the tokenizer for the large BERT model
# This tokenizer works similarly to the base tokenizer but is tailored for the large model
tokenizer_large = AutoTokenizer.from_pretrained(model_checkpoint_large)

# Load the masked language model (MLM) for the large BERT model
# This MLM is used for tasks like predicting masked words in a sentence, similar to the base model but with more parameters
model_mlm_large = AutoModelForMaskedLM.from_pretrained(model_checkpoint_large)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at neuralmind/bert-large-portuguese-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertFor

In [3]:
tokenizer_base.is_fast  # A fast tokenizer from HF Transformers uses Rust under the hood for faster tokenization

True

In [4]:
tokenizer_large.is_fast

True

In [5]:
# Define a function to count the number of trainable parameters in a model
def count_parameters(model):
    # Sum the number of elements (numel) for each parameter in the model
    # Only include parameters that require gradients (i.e., are trainable)
    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
    # Print the number of trainable parameters in a human-readable format with commas
    print(f"The model has {n_parameters:,} trainable parameters")


# Count and print the number of trainable parameters for the base BERT model
count_parameters(model_mlm_base)

# Count and print the number of trainable parameters for the large BERT model
count_parameters(model_mlm_large)

The model has 108,954,466 trainable parameters
The model has 334,428,258 trainable parameters


In [6]:
model_mlm_base

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

Above, you can see the model architecture summary of a BERT (Bidirectional Encoder Representations from Transformers) model specifically designed for masked language modeling (MLM) tasks. Let's break it down:

### 1. Overall Architecture: `BertForMaskedLM`

- **Goal:**  
  The model combines a pretrained BERT backbone with an additional output head designed for MLM tasks. This head produces vocabulary logits for each token, especially focusing on the tokens that were masked in the input.


### 2. Base Model: `BertModel`

`BertModel` forms the structural backbone and consists of two major components: token embeddings and stacked encoder layers. 

#### A. BertEmbeddings

This module converts input tokens into dense vectors by learning several types of embeddings:

- **Word Embeddings:**  
  Each token is mapped to a vector using a learned embedding matrix with shape:  
  $$
  (29794,\,768)
  $$
  Here, 29,794 represents the vocabulary size and 768 is the hidden size.

- **Position Embeddings:**  
  The model uses position embeddings to encode the order of tokens. This is achieved with a learned embedding matrix of shape:  
  $$
  (512,\,768)
  $$

- **Token Type Embeddings:**  
  When processing pair sequences (such as question-answer pairs), token type embeddings help the model distinguish between different segments. The corresponding matrix has the shape:  
  $$
  (2,\,768)
  $$

- **Combination Process:**  
  All three embeddings are summed element-wise for each token:
  $$
  h^{(i)}_0 = e_{\text{word}}^{(i)} + e_{\text{pos}}^{(i)} + e_{\text{type}}^{(i)}
  $$
  This sum is then normalized using layer normalization (with an epsilon value of $1 \times 10^{-12}$) and regularized using dropout with a probability of 0.1.

#### B. BertEncoder

The encoder consists of a stack of 12 identical layers (numbered 0 to 11 in the base model). Each layer (or **BertLayer**) has the following subcomponents:

- **BertAttention:**  
  This implements multi-head self-attention and comprises two main parts:
  
  - **Self-Attention Mechanism:**  
    For every token, the model computes query (Q), key (K), and value (V) vectors using linear transformations. Each transformation maps an input vector of dimension 768 to an output of the same size. Dropout is applied to the attention weights to reduce overfitting.

  - **Output Processing:**  
    After computing the attention scores, the resulting vectors are merged and passed through a linear layer to project them back to the 768-dimensional space. This is followed by layer normalization and dropout.

- **BertIntermediate:**  
  This component applies a dense layer that increases the dimensionality from 768 to 3072. The non-linear activation function used is GELU. The transformation can be expressed as:
  $$
  z_i = \text{GELU}(h_i \cdot W_1 + b_1)
  $$
  where $h_i$ is the output from the attention sublayer.

- **BertOutput:**  
  The intermediate representation is then projected back to the hidden size (768) using another linear transformation:
  $$
  h^{\text{new}}_i = \text{GELU}(z_i \cdot W_2 + b_2)
  $$
  This step also includes layer normalization and dropout, and it adds a residual connection from the original input, ensuring training stability.

---

### 3. Masked Language Modeling Head: `BertOnlyMLMHead`

The MLM head is designed to transform the encoder's output into predictions over the vocabulary for each masked token.

- **BertLMPredictionHead:**  
  This component orchestrates the final steps in transforming the encoder output into prediction logits.

- **Prediction Head Transformation:**  
  The output of the final encoder layer undergoes an additional dense transformation followed by a GELU activation and layer normalization:
  $$
  t_i = \text{GELU}(h_L^{(i)} \cdot W_3 + b_3)
  $$
  Here, $h_L^{(i)}$ is the representation of the $i$-th token from the final encoder layer.

- **Decoder:**  
  A final linear layer converts the transformed representation $t_i$ into logits corresponding to the vocabulary terms:
  $$
  \text{logits}_i = t_i \cdot W_{E}^T + b_4
  $$
  The decoder uses weights that directly map from the hidden size (768) to the vocabulary size (29794).



In [7]:
model_mlm_large

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-

> The main differences between the two BERT models are in the model size and architecture:
>
> 1. Embedding dimensions:
>       - In the first model, the word embeddings, position embeddings, and token type embeddings have a dimension of 768.
>       - In the second model, these embeddings have a dimension of 1024, indicating a larger embedding size.
>
> 2. Number of encoder layers:
>       - The first model has 12 encoder layers (`(0-11): 12 x BertLayer`).
>       - The second model has 24 encoder layers (`(0-23): 24 x BertLayer`)
>
> 3. Intermediate layer dimensions:
>       - In the first model, the intermediate layer (`BertIntermediate`) has an output dimension of 3072.
>       - In the second model, the intermediate layer has an output dimension of 4096, which is larger than the first model.
>
> 4. Hidden state dimensions:
>       - The first model uses hidden states with a dimension of 768 throughout the architecture, including the self-attention layers, intermediate layers, and output layers.
>       - The second model uses hidden states with a dimension of 1024 throughout the architecture.
>
> The rest of the architecture, including the self-attention mechanism, layer normalization, dropout, and the MLM head, remains the same between the two models.
>
> The large model has higher-dimensional embeddings, more encoder layers, and larger intermediate layer dimensions. This suggests that the large model has a higher capacity and can potentially capture more complex patterns and representations from the input data. However, the larger model size also means increased computational requirements and longer training times.

## Load dataset and creating a train/test split

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the unlabeled dataset from a Parquet file
# Only the 'text' column is read from the file
df_unlabeled = pd.read_parquet("data/legal/unlabeled_texts.parquet", columns=["text"])

# Split the unlabeled dataset into training and validation sets
# 10% of the data is used for validation, and the split is reproducible with a fixed random state
df_unlabeled_train, df_unlabeled_valid = train_test_split(
    df_unlabeled, test_size=0.10, random_state=271828
)

# Display the shapes of the training and validation sets
# This shows the number of rows and columns in each set
df_unlabeled_train.shape, df_unlabeled_valid.shape

((58529, 1), (6504, 1))

In [9]:
import datasets

# Convert the pandas DataFrame containing the unlabeled training data into a Hugging Face Dataset
# This allows for easier manipulation and integration with Hugging Face's tools and models
dataset_unlabeled_train = datasets.Dataset.from_pandas(df_unlabeled_train)

# Convert the pandas DataFrame containing the unlabeled validation data into a Hugging Face Dataset
# This allows for easier manipulation and integration with Hugging Face's tools and models
dataset_unlabeled_valid = datasets.Dataset.from_pandas(df_unlabeled_valid)

In [10]:
dataset_unlabeled_train

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 58529
})

In [11]:
dataset_unlabeled_valid

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 6504
})

In [12]:
from pathlib import Path

# Define the path to save the outputs of the base BERT masked language model
path_to_save_lm_base = Path("./outputs/transformers_basics/bert_masked_lm_base")
# Create the directory (and any necessary parent directories) if it doesn't already exist
path_to_save_lm_base.mkdir(parents=True, exist_ok=True)

# Define the path to save the outputs of the large BERT masked language model
path_to_save_lm_large = Path("./outputs/transformers_basics/bert_masked_lm_large")
# Create the directory (and any necessary parent directories) if it doesn't already exist
path_to_save_lm_large.mkdir(parents=True, exist_ok=True)

## Finetune the Language Model on the domain text

Remember our transfer learning class. During this stage, the general-domain language model adapts itself to the idiosyncrasies of the domain-specific text. This is done by training the model on the domain-specific text. This step is optional, but it can improve the performance of the model on your task.

In [13]:
from functools import partial
from multiprocessing import cpu_count


def tokenize_function(examples, tokenizer):
    """
    Tokenizes the input text in the given examples using the tokenizer object.

    Args:
    - examples: A dictionary containing the input text to be tokenized.

    Returns:
    - A dictionary containing the tokenized input text.
    """
    result = tokenizer(
        examples["text"], truncation=False, padding=False
    )  # Tokenize the input text
    if tokenizer.is_fast:
        # If the tokenizer is a fast tokenizer, add word IDs to the result
        result["word_ids"] = [
            result.word_ids(i) for i in range(len(result["input_ids"]))
        ]
    return result


# Create partial functions for tokenizing using the base and large tokenizers
# This allows us to pass the tokenizer as a fixed argument to the tokenize_function
tokenize_function_base = partial(tokenize_function, tokenizer=tokenizer_base)
tokenize_function_large = partial(tokenize_function, tokenizer=tokenizer_large)

# Tokenize the training dataset using the base tokenizer
# The map function applies the tokenize_function_base to each example in the dataset
# The batched=True argument processes the examples in batches for efficiency
# The remove_columns argument removes the specified columns from the dataset after tokenization
dataset_train_tokenized_mlm_base = dataset_unlabeled_train.map(
    tokenize_function_base, batched=True, remove_columns=["text", "__index_level_0__"]
)

# Tokenize the validation dataset using the base tokenizer
dataset_valid_tokenized_mlm_base = dataset_unlabeled_valid.map(
    tokenize_function_base, batched=True, remove_columns=["text", "__index_level_0__"]
)

# Tokenize the training dataset using the large tokenizer
dataset_train_tokenized_mlm_large = dataset_unlabeled_train.map(
    tokenize_function_large, batched=True, remove_columns=["text", "__index_level_0__"]
)

# Tokenize the validation dataset using the large tokenizer
dataset_valid_tokenized_mlm_large = dataset_unlabeled_valid.map(
    tokenize_function_large, batched=True, remove_columns=["text", "__index_level_0__"]
)

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

In [14]:
import numpy as np


def group_texts(examples):
    """
    This function groups together a set of texts as contiguous text of fixed length (chunk_size).
    It's useful for training masked language models.

    Args:
    - examples: A dictionary containing the examples to group. Each key corresponds to a feature,
                and each value is a list of lists of tokens.

    Returns:
    - A dictionary containing the grouped examples. Each key corresponds to a feature,
      and each value is a list of lists of tokens.
    """
    # Concatenate all texts for each feature
    concatenated_examples = {k: np.concatenate(examples[k]) for k in examples.keys()}

    # Compute the total length of the concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # Adjust the total length to be a multiple of chunk_size, dropping the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size

    # Split the concatenated texts into chunks of size chunk_size using NumPy
    result = {
        k: np.split(t[:total_length], total_length // chunk_size)
        for k, t in concatenated_examples.items()
    }

    # Create a new 'labels' column that is a copy of the 'input_ids' column
    result["labels"] = result["input_ids"].copy()

    return result


# Define the chunk size for grouping texts
chunk_size = 512

# Apply the group_texts function to the tokenized training dataset for the base BERT model
dataset_train_tokenized_mlm_base = dataset_train_tokenized_mlm_base.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized validation dataset for the base BERT model
dataset_valid_tokenized_mlm_base = dataset_valid_tokenized_mlm_base.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized training dataset for the large BERT model
dataset_train_tokenized_mlm_large = dataset_train_tokenized_mlm_large.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized validation dataset for the large BERT model
dataset_valid_tokenized_mlm_large = dataset_valid_tokenized_mlm_large.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

print(
    f"Number of training examples for base model: {len(dataset_train_tokenized_mlm_base)}"
)
print(
    f"Number of validation examples for base model: {len(dataset_valid_tokenized_mlm_base)}"
)

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

Number of training examples for base model: 271519
Number of validation examples for base model: 29823


In [15]:
from transformers import DataCollatorForLanguageModeling

# Create a data collator for masked language modeling (MLM) using the base BERT tokenizer
# The data collator will dynamically mask tokens in the input text with a probability of 0.15
data_collator_mlm_base = DataCollatorForLanguageModeling(
    tokenizer=tokenizer_base, mlm_probability=0.15
)

# Create a data collator for masked language modeling (MLM) using the large BERT tokenizer
# The data collator will dynamically mask tokens in the input text with a probability of 0.15
data_collator_mlm_large = DataCollatorForLanguageModeling(
    tokenizer=tokenizer_large, mlm_probability=0.15
)

In [16]:
from transformers import TrainingArguments

# Define the batch size for training and evaluation using the base BERT model
batch_size_base = 20

# Extract the model name from the model checkpoint path for the base BERT model
model_name_base = model_checkpoint_base.split("/")[-1]

# Set up the training arguments for fine-tuning the base BERT model on a masked language modeling task
training_args_mlm_base = TrainingArguments(
    output_dir=path_to_save_lm_base
    / f"{model_name_base}-finetuned-mlm",  # Directory to save the model checkpoints
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
    learning_rate=5e-5,  # Learning rate for the optimizer
    weight_decay=0.01,  # Weight decay for regularization
    per_device_train_batch_size=batch_size_base,  # Batch size for training
    per_device_eval_batch_size=batch_size_base,  # Batch size for evaluation
    bf16=True,  # Use bfloat16 precision (change to "fp16" if using a free GPU)
    num_train_epochs=3,  # Number of training epochs
    save_total_limit=1,  # Limit the total number of saved checkpoints
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    logging_steps=1,  # Log the training loss after every 1 epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Metric to use for selecting the best model
    greater_is_better=False,  # Lower evaluation loss is better
    gradient_accumulation_steps=3,  # Number of gradient accumulation steps
    seed=271828,  # Random seed for reproducibility
)

In [17]:
from transformers import Trainer

# Initialize the Trainer for the base BERT model
# The Trainer class provides an easy-to-use API for training and evaluating models
trainer_mlm_base = Trainer(
    model=model_mlm_base,  # The model to be trained (base BERT masked language model)
    args=training_args_mlm_base,  # Training arguments defined earlier
    train_dataset=dataset_train_tokenized_mlm_base,  # Tokenized training dataset
    eval_dataset=dataset_valid_tokenized_mlm_base,  # Tokenized validation dataset
    data_collator=data_collator_mlm_base,  # Data collator for dynamically masking tokens
    processing_class=tokenizer_base,  # Tokenizer for processing the input text
)

In [None]:
# This took around 5 hours to train on 2 x NVIDIA RTX 3090 GPUs
trainer_mlm_base.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33meliasjacob[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
0,0.5797,0.473443
1,0.4326,0.408905
2,0.3782,0.3882


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


TrainOutput(global_step=6786, training_loss=0.5313287478850508, metrics={'train_runtime': 17450.1629, 'train_samples_per_second': 46.679, 'train_steps_per_second': 0.389, 'total_flos': 2.143305955958661e+17, 'train_loss': 0.5313287478850508, 'epoch': 2.9991160872127285})

In [None]:
# Save the trained model
trainer_mlm_base.save_model(path_to_save_lm_base / f"{model_name_base}-finetuned-mlm")
tokenizer_base.save_pretrained(
    path_to_save_lm_base / f"{model_name_base}-finetuned-mlm"
)

trainer_mlm_base.evaluate()



{'eval_loss': 0.38791826367378235,
 'eval_runtime': 293.0877,
 'eval_samples_per_second': 101.755,
 'eval_steps_per_second': 2.545,
 'epoch': 2.9991160872127285}

In [None]:
print(path_to_save_lm_base / f"{model_name_base}-finetuned-mlm")

outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm


In [1]:
import gc
import torch

# Set the trainer, model, and tokenizer for the base BERT model to None
# This helps free up memory by removing references to these objects
trainer_mlm_base = None
model_mlm_base = None
tokenizer_base = None

# Force garbage collection to free up memory
gc.collect()

# Clear the CUDA memory cache to free up GPU memory
torch.cuda.empty_cache()

In [None]:
from transformers import TrainingArguments

# Define the batch size for training and evaluation
batch_size_large = 14

# Extract the model name from the model checkpoint path
# This will be used to name the output directory for the trained model
model_name_large = model_checkpoint_large.split("/")[-1]

# Define the training arguments for the large masked language model (MLM)
training_args_mlm_large = TrainingArguments(
    output_dir=path_to_save_lm_large
    / f"{model_name_large}-finetuned-mlm",  # Output directory for the trained model
    overwrite_output_dir=True,  # Overwrite the output directory if it already exists
    learning_rate=5e-5,  # Learning rate for the optimizer
    weight_decay=0.01,  # Weight decay for regularization
    per_device_train_batch_size=batch_size_large,  # Batch size for training
    per_device_eval_batch_size=batch_size_large,  # Batch size for evaluation
    bf16=True,  # Use bf16 precision. Change to "fp16" if using a free GPU
    num_train_epochs=3,  # Number of training epochs
    save_total_limit=1,  # Limit the total amount of checkpoints and delete the older ones
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    logging_steps=1,  # Log the training loss after every 1 step
    eval_steps=1,  # Evaluate the model after every 1 step
    save_steps=1,  # Save the model after every 1 step
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Use the evaluation loss to determine the best model
    greater_is_better=False,  # Lower evaluation loss is better
    gradient_accumulation_steps=4,  # Number of steps to accumulate gradients before updating the model parameters
    seed=271828,  # Random seed for reproducibility
)

In [None]:
from transformers import Trainer

# Initialize the Trainer for the large masked language model (MLM)
trainer_mlm_large = Trainer(
    model=model_mlm_large,  # The pre-trained large BERT model for masked language modeling
    args=training_args_mlm_large,  # The training arguments defined earlier for the large model
    train_dataset=dataset_train_tokenized_mlm_large,  # The tokenized training dataset for the large model
    eval_dataset=dataset_valid_tokenized_mlm_large,  # The tokenized validation dataset for the large model
    data_collator=data_collator_mlm_large,  # The data collator for dynamic masking during training
    processing_class=tokenizer_large,  # The tokenizer used to process the input text for the large model
)

In [None]:
# Train the large masked language model (MLM)
# This process involves multiple epochs of training on the training dataset
# Note: This training process took almost 14 hours on 2 x NVIDIA RTX 3090 GPUs
trainer_mlm_large.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33meliasjacob[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
0,0.3995,0.381371
2,0.3323,0.309468


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


TrainOutput(global_step=7272, training_loss=0.4201359052510217, metrics={'train_runtime': 48363.8901, 'train_samples_per_second': 16.842, 'train_steps_per_second': 0.15, 'total_flos': 7.590524853366497e+17, 'train_loss': 0.4201359052510217, 'epoch': 2.999381315735203})

In [None]:
# Save the trained large masked language model (MLM) to the specified directory
trainer_mlm_large.save_model(
    path_to_save_lm_large / f"{model_name_large}-finetuned-mlm"
)

# Save the tokenizer used for the large MLM to the same directory
tokenizer_large.save_pretrained(
    path_to_save_lm_large / f"{model_name_large}-finetuned-mlm"
)

# Evaluate the trained large MLM on the validation dataset
# This will return a dictionary containing the evaluation metrics
trainer_mlm_large.evaluate()



{'eval_loss': 0.30810099840164185,
 'eval_runtime': 709.2879,
 'eval_samples_per_second': 42.046,
 'eval_steps_per_second': 1.503,
 'epoch': 2.999381315735203}

In [None]:
print(path_to_save_lm_large / f"{model_name_large}-finetuned-mlm")

outputs/transformers_basics/bert_masked_lm_large/bert-large-portuguese-cased-finetuned-mlm


## Assessing a Language Model

To ensure that a language model is effective and reliable, we need to assess its performance. This is usually done by evaluating how well the model can predict a word in a sentence. The primary metric used for this purpose is known as 'Perplexity'.

### Understanding Perplexity

Perplexity is a quantitative measure of how well a probability model predicts a sample. concerning language models, it gauges how surprised or 'thrown-off' the model is upon encountering new data. Essentially, it is a measure of "surprise".

A lower perplexity indicates that the model was less surprised by the new data, signifying that it was better trained and has a good understanding of the language patterns in the provided data. Therefore, a lower perplexity value is indicative of better training.

### Calculating Perplexity

Perplexity is defined as the exponentiation of the entropy. Entropy is a measure of the uncertainty associated with a random variable. Since the loss function of the language model is the cross-entropy loss, we can use the loss value to calculate the perplexity. The formula for perplexity is:

$$Perplexity = e^{loss}$$

Where:
- $e$ is the base of the natural logarithm (Euler's number, approximately 2.71828)
- $loss$ is the cross-entropy loss

### Choice of Logarithm Base

The choice of base for the logarithm in calculating perplexity or entropy often depends on the context or the historical convention of the field.

- In information theory, the base of the logarithm is typically 2, resulting in units of bits (binary digits). This is because information was originally conceptualized concerning binary decisions (yes/no, true/false, 0/1), and thus, using a base-2 logarithm is natural: a message space of $2^n$ messages each carry $n$ bits of information.

- The rationale behind using $e$ as the base is somewhat unclear. In numerous domains of machine learning, $e$ possesses unique attributes, however, these properties do not hold relevance here. Euler's number ($e$) exhibits several intriguing properties, especially in machine learning, where a majority of the basic mathematical principles and techniques (like calculus and optimization methods) often function more efficiently or are simpler with natural logarithms.

> It's important to note that the base of the logarithm doesn't change the fundamental interpretation of entropy or perplexity - it's merely a scaling factor. However, base-2 logarithms will give you a measure in bits, while natural logarithms will give you a measure in nats (natural units of information).

In [None]:
import gc
import torch

# Set the trainer for the large masked language model (MLM) to None to free up memory
trainer_mlm_large = None

# Set the large masked language model (MLM) to None to free up memory
model_mlm_large = None

# Set the tokenizer for the large MLM to None to free up memory
tokenizer_large = None

# Collect garbage to free up memory
gc.collect()

# Empty the CUDA cache to free up GPU memory
torch.cuda.empty_cache()

In [None]:
import math

print(f"The perplexity for the base model is {math.exp(0.38791826367378235)}")
print(f"The perplexity for the large model is {math.exp(0.30810099840164185)}")

The perplexity for the base model is 1.4739093074325733
The perplexity for the large model is 1.3608384245024145


In [None]:
import os
from pathlib import Path
from transformers import pipeline, AutoModelForMaskedLM, AutoTokenizer

# Define the path to save the base masked language model (MLM)
# This path points to the directory where the base MLM model will be saved
path_to_save_lm_base = Path("./outputs/transformers_basics/bert_masked_lm_base")

# Define the path to save the large masked language model (MLM)
# This path points to the directory where the large MLM model will be saved
path_to_save_lm_large = Path("./outputs/transformers_basics/bert_masked_lm_large")

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load the fine-tuned base masked language model (MLM) from the specified directory
# This model is a BERT base model fine-tuned on a Portuguese dataset
model_base = AutoModelForMaskedLM.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Load the tokenizer for the fine-tuned base MLM from the same directory
# The tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Load the fine-tuned large masked language model (MLM) from the specified directory
# This model is a BERT large model fine-tuned on a Portuguese dataset
model_large = AutoModelForMaskedLM.from_pretrained(
    path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm"
)

# Load the tokenizer for the fine-tuned large MLM from the same directory
# The tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(
    path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm"
)

In [None]:
from transformers import pipeline

# Create a pipeline for the base masked language model (MLM)
# The pipeline is used to fill in the masked tokens in the input text
# 'fill-mask' specifies the task type for the pipeline
# model_base is the fine-tuned base MLM model
# tokenizer_base is the tokenizer for the base MLM model
# top_k=5 specifies that the top 5 predictions for the masked token will be returned
pipe_base = pipeline("fill-mask", model=model_base, tokenizer=tokenizer_base, top_k=5)

# Create a pipeline for the large masked language model (MLM)
pipe_large = pipeline(
    "fill-mask", model=model_large, tokenizer=tokenizer_large, top_k=5
)

In [None]:
pipe_base("O artigo 121 do Código Penal prevê o crime de [MASK]")

[{'score': 0.9281540513038635,
  'token': 131,
  'token_str': ':',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de :'},
 {'score': 0.012005731463432312,
  'token': 21982,
  'token_str': 'homicídio',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de homicídio'},
 {'score': 0.0050378949381411076,
  'token': 18144,
  'token_str': 'roubo',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de roubo'},
 {'score': 0.0032502533867955208,
  'token': 1112,
  'token_str': '“',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de “'},
 {'score': 0.0027919497806578875,
  'token': 184,
  'token_str': 're',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de re'}]

In [None]:
pipe_large("O artigo 121 do Código Penal prevê o crime de [MASK]")

[{'score': 0.9837086796760559,
  'token': 131,
  'token_str': ':',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de :'},
 {'score': 0.006448815111070871,
  'token': 21982,
  'token_str': 'homicídio',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de homicídio'},
 {'score': 0.0011397271882742643,
  'token': 119,
  'token_str': '.',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de.'},
 {'score': 0.0007863400387577713,
  'token': 1386,
  'token_str': 'morte',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de morte'},
 {'score': 0.0007723842863924801,
  'token': 9566,
  'token_str': 'corrupção',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de corrupção'}]

In [None]:
pipe_base(
    "O Código de Processo Civil prevê prazo em [MASK] para interposição de recurso pela Fazenda Pública"
)

[{'score': 0.3354406952857971,
  'token': 17225,
  'token_str': 'julgado',
  'sequence': 'O Código de Processo Civil prevê prazo em julgado para interposição de recurso pela Fazenda Pública'},
 {'score': 0.2519117295742035,
  'token': 5370,
  'token_str': 'aberto',
  'sequence': 'O Código de Processo Civil prevê prazo em aberto para interposição de recurso pela Fazenda Pública'},
 {'score': 0.2214008867740631,
  'token': 21244,
  'token_str': 'dobro',
  'sequence': 'O Código de Processo Civil prevê prazo em dobro para interposição de recurso pela Fazenda Pública'},
 {'score': 0.06615797430276871,
  'token': 3418,
  'token_str': 'curso',
  'sequence': 'O Código de Processo Civil prevê prazo em curso para interposição de recurso pela Fazenda Pública'},
 {'score': 0.03829769790172577,
  'token': 4712,
  'token_str': 'branco',
  'sequence': 'O Código de Processo Civil prevê prazo em branco para interposição de recurso pela Fazenda Pública'}]

In [None]:
pipe_large(
    "O Código de Processo Civil prevê prazo em [MASK] para interposição de recurso pela Fazenda Pública"
)

[{'score': 0.5983186960220337,
  'token': 2241,
  'token_str': 'lei',
  'sequence': 'O Código de Processo Civil prevê prazo em lei para interposição de recurso pela Fazenda Pública'},
 {'score': 0.31673961877822876,
  'token': 21244,
  'token_str': 'dobro',
  'sequence': 'O Código de Processo Civil prevê prazo em dobro para interposição de recurso pela Fazenda Pública'},
 {'score': 0.015298635698854923,
  'token': 2502,
  'token_str': 'Lei',
  'sequence': 'O Código de Processo Civil prevê prazo em Lei para interposição de recurso pela Fazenda Pública'},
 {'score': 0.01301574520766735,
  'token': 20554,
  'token_str': 'razoável',
  'sequence': 'O Código de Processo Civil prevê prazo em razoável para interposição de recurso pela Fazenda Pública'},
 {'score': 0.009656175971031189,
  'token': 5370,
  'token_str': 'aberto',
  'sequence': 'O Código de Processo Civil prevê prazo em aberto para interposição de recurso pela Fazenda Pública'}]

## Train our Document Classifier Using Our Fine-Tuned Language Model

### Understanding the Language Model Output Structure

Before diving into the details of document classification, it's essential to grasp the structure of the output from the language model. The output is a vector with dimensions of `max_tokens` x `embedding_dimension`. Taking BERT-base as an example, the embedding dimension is 768. This means that for each token in the input text, there is a corresponding vector of size 768.

In practical scenarios, utilizing the entire array of vectors as input for our classifier may not be feasible due to the vast amount of information involved. Instead, we focus on using the vector corresponding to the `[CLS]` token.

### The Significance of the `[CLS]` Token

The `[CLS]` token is a special token that precedes the input text and represents the entirety of the input concerning BERT models. This token's vector size is 768, which is significantly more manageable compared to the entire vector array. `[CLS]` stands for `CL`a`S`sification and is specifically designed for classification tasks.

Here's an example to illustrate the usage of the `[CLS]` token:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
outputs = tokenizer('Eu gosto muito de farofa')
tokenizer.decode(outputs['input_ids'])
```

Resulting output: `'[CLS] Eu gosto muito de farofa [SEP]'`

In the above output, you'll notice that the `[CLS]` token is added to the start of the input text, while the `[SEP]` token is appended to the end. However, for classification purposes, we only need to focus on the `[CLS]` token and can ignore the `[SEP]` token. The role of the `[SEP]` token in BERT is to enable the separation of two sentences, but since our input text contains only one sentence, its usage is unnecessary here.

### Carrying Out Classification Using the `[CLS]` Token

Now that we know how to extract the vector for the `[CLS]` token, we can use it as input for our classifier. The classifier's output will be a vector of size `num_labels`, where `num_labels` refers to the number of labels present in our dataset. For example, if we have 4 labels, the classifier would output a vector of size 4.

This output vector will be crucial in calculating the model's loss and updating its weights during the training process. By comparing the predicted label probabilities with the actual labels, we can measure the model's performance and make necessary adjustments to improve its accuracy.

### Putting It All Together

To summarize, the process of document classification using a fine-tuned language model involves the following steps:

1. Tokenize the input text and add the `[CLS]` token at the beginning.
2. Pass the tokenized input through the language model to obtain the output vector.
3. Extract the vector corresponding to the `[CLS]` token.
4. Use the `[CLS]` token vector as input for the classifier.
5. Obtain the classifier's output vector, which represents the predicted label probabilities.
6. Calculate the loss by comparing the predicted labels with the actual labels.
7. Update the model's weights based on the calculated loss to improve its performance.

In [None]:
import os
from pathlib import Path
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoConfig,
)

In [None]:
import pandas as pd
from pathlib import Path

# Load the training dataset from a Parquet file
# Only the 'text' and 'label' columns are read from the file
df_train = pd.read_parquet("data/legal/train.parquet", columns=["text", "label"])

# Load the validation dataset from a Parquet file
# Only the 'text' and 'label' columns are read from the file
df_valid = pd.read_parquet("data/legal/valid.parquet", columns=["text", "label"])

# Define the path to save the base masked language model (MLM)
# This path points to the directory where the base MLM model will be saved
path_to_save_lm_base = Path("./outputs/transformers_basics/bert_masked_lm_base")

# Define the path to save the large masked language model (MLM)
# This path points to the directory where the large MLM model will be saved
path_to_save_lm_large = Path("./outputs/transformers_basics/bert_masked_lm_large")

# Display the shapes of the training and validation datasets
# This shows the number of rows and columns in each dataset
df_train.shape, df_valid.shape

((52026, 2), (13007, 2))

In [None]:
# Create a dictionary to map each unique label in the training dataset to a unique ID
# df_train.label.unique() returns an array of unique labels in the training dataset
# The dictionary comprehension iterates over the unique labels and assigns an ID to each label
label2id = {df_train.label.unique()[i]: i for i in range(len(df_train.label.unique()))}

# Create a dictionary to map each unique ID back to its corresponding label
# This is the reverse mapping of the label2id dictionary
# The dictionary comprehension iterates over the items in label2id and swaps the keys and values
id2label = {v: k for k, v in label2id.items()}

# Display the label-to-ID and ID-to-label mappings
label2id, id2label

({'IMPROCEDENTE': 0,
  'PROCEDENTE': 1,
  'PARCIALMENTE PROCEDENTE': 2,
  'EXTINTO SEM MÉRITO': 3},
 {0: 'IMPROCEDENTE',
  1: 'PROCEDENTE',
  2: 'PARCIALMENTE PROCEDENTE',
  3: 'EXTINTO SEM MÉRITO'})

In [None]:
# Map the labels in the training dataset to their corresponding IDs
# This replaces the label names with their respective IDs using the label2id dictionary
df_train["label"] = df_train["label"].map(label2id)

# Map the labels in the validation dataset to their corresponding IDs
# This replaces the label names with their respective IDs using the label2id dictionary
df_valid["label"] = df_valid["label"].map(label2id)

# Display the first few rows of the training dataset
# This shows the updated training dataset with labels replaced by their corresponding IDs
df_train.head()

Unnamed: 0,text,label
1387,"SENTENÇA Vistos etc. Dispensado o relatório, a...",0
17972,"SENTENÇA Relatório dispensado. No caso, não há...",0
34527,SENTENÇA Vistos etc. Trata-se de pedido de res...,1
58381,TERMO DE AUDIÊNCIA DE INSTRUÇÃO Ação Especial ...,1
56474,SENTENÇA Trata-se de ação em que a parte autor...,2


In [None]:
import datasets

# Convert the training DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for training and evaluation
dataset_labeled_train = datasets.Dataset.from_pandas(df_train)

# Convert the validation DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for validation and evaluation
dataset_labeled_valid = datasets.Dataset.from_pandas(df_valid)

In [None]:
from transformers import AutoTokenizer
from functools import partial

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Load the tokenizer for the fine-tuned large masked language model (MLM)
# This tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(
    path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm"
)


# Define a function to preprocess the input examples using a specified tokenizer
# The function tokenizes the input text, truncates it to a maximum length of 512 tokens,
# and pads the sequences to ensure they are of equal length
def preprocess_function(examples, tokenizer):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)


# Create a partial function for preprocessing using the base tokenizer
preprocess_function_base = partial(preprocess_function, tokenizer=tokenizer_base)

# Create a partial function for preprocessing using the large tokenizer
preprocess_function_large = partial(preprocess_function, tokenizer=tokenizer_large)

In [None]:
# Tokenize the training dataset using the base tokenizer
# The preprocess_function_base tokenizes the text, truncates it to 512 tokens, and pads the sequences
# The batched=True argument processes the dataset in batches for efficiency
dataset_labeled_train_tokenized_base = dataset_labeled_train.map(
    preprocess_function_base, batched=True
)

# Tokenize the validation dataset using the base tokenizer
dataset_labeled_valid_tokenized_base = dataset_labeled_valid.map(
    preprocess_function_base, batched=True
)

# Tokenize the training dataset using the large tokenizer
dataset_labeled_train_tokenized_large = dataset_labeled_train.map(
    preprocess_function_large, batched=True
)

# Tokenize the validation dataset using the large tokenizer
dataset_labeled_valid_tokenized_large = dataset_labeled_valid.map(
    preprocess_function_large, batched=True
)

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorWithPadding

# Create a data collator for the base tokenizer
# The data collator dynamically pads the input sequences to the maximum length in the batch
# This ensures that all sequences in a batch have the same length, which is required for efficient processing
data_collator_base = DataCollatorWithPadding(tokenizer=tokenizer_base)

# Create a data collator for the large tokenizer
data_collator_large = DataCollatorWithPadding(tokenizer=tokenizer_large)

In [None]:
# Import the evaluate module from the Hugging Face library
import evaluate

# Load the accuracy metric from the evaluate module
# This metric will be used to evaluate the performance of the model
accuracy = evaluate.load("accuracy")

In [None]:
import numpy as np


# Define a function to compute evaluation metrics
# This function will be used to evaluate the performance of the model during training and validation
def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the evaluation tuple
    predictions, labels = eval_pred

    # Convert the model's output logits to predicted class labels
    # np.argmax(predictions, axis=1) selects the index of the maximum logit for each prediction
    predictions = np.argmax(predictions, axis=1)

    # Compute the accuracy metric using the predicted and true labels
    # accuracy.compute() calculates the accuracy of the predictions
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
# Determine the number of unique labels in the training dataset
# This will be used to configure the classification model
n_labels = df_train.label.nunique()

# Load the configuration for the base masked language model (MLM) and modify it for sequence classification
# The configuration is loaded from the specified directory and the number of labels is set to n_labels
config_base = AutoConfig.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm",
    num_labels=n_labels,
)

# Load the base masked language model (MLM) and modify it for sequence classification
# The model is loaded from the specified directory and the configuration is set to config_base
classifier_base = AutoModelForSequenceClassification.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm",
    config=config_base,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import Trainer, TrainingArguments

# Define the training arguments for the base classifier
# These arguments configure various aspects of the training process
training_args_base = TrainingArguments(
    output_dir=path_to_save_lm_base
    / "base_classifier_legal",  # Directory to save the model and other outputs
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=48,  # Batch size for training (adjust based on GPU memory)
    per_device_eval_batch_size=64,  # Batch size for evaluation (adjust based on GPU memory)
    num_train_epochs=5,  # Number of training epochs
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients before updating
    weight_decay=0.01,  # Weight decay for regularization
    bf16=True,  # Use 16-bit floating point precision for training (adjust based on GPU support)
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 10 steps
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Seed for reproducibility
)

# Create a Trainer instance for the base classifier
# The Trainer handles the training and evaluation of the model
trainer_base = Trainer(
    model=classifier_base,  # The model to be trained
    args=training_args_base,  # Training arguments
    train_dataset=dataset_labeled_train_tokenized_base,  # Training dataset
    eval_dataset=dataset_labeled_valid_tokenized_base,  # Evaluation dataset
    processing_class=tokenizer_base,  # Tokenizer for preprocessing the input text
    data_collator=data_collator_base,  # Data collator for dynamic padding
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

# Train the model using the Trainer
trainer_base.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.6011,0.608936,0.744599
2,0.5274,0.584451,0.759899
3,0.485,0.558055,0.775967
4,0.4569,0.565196,0.777043
5,0.4197,0.580136,0.779657




TrainOutput(global_step=2710, training_loss=0.5146869243291031, metrics={'train_runtime': 3837.1369, 'train_samples_per_second': 67.793, 'train_steps_per_second': 0.706, 'total_flos': 6.844430787637248e+16, 'train_loss': 0.5146869243291031, 'epoch': 5.0})

In [None]:
trainer_base.evaluate()



{'eval_loss': 0.5580551624298096,
 'eval_accuracy': 0.7759667871146306,
 'eval_runtime': 60.6451,
 'eval_samples_per_second': 214.477,
 'eval_steps_per_second': 1.682,
 'epoch': 5.0}

`Can you guess why the accuracy is so low?`


## Understanding Low Accuracy: The Limitation of 512 Tokens

When working with transformer models, it's essential to be aware of a key limitation: most models can only process a maximum of **512 tokens**. This restriction has a significant impact on the accuracy of predictions, especially when dealing with longer texts.

### The Self-Attention Mechanism and Quadratic Complexity

The 512-token limit is a result of the *quadratic complexity* of the **self-attention mechanism**, which is a fundamental component of transformer models. Self-attention allows the model to weigh the importance of each token in relation to others, enabling it to capture context and dependencies within the input text.

However, the computational cost of self-attention grows quadratically with the number of tokens. As the input length increases, the memory and computational requirements become prohibitively expensive. To mitigate this issue, most transformer models impose a maximum token limit of 512.

### The Impact of Truncation on Accuracy

When an input text exceeds 512 tokens, the model automatically truncates it by removing tokens until it fits within the limit. This truncation process can have a detrimental effect on the model's accuracy.

Important information, such as key context or relevant details, may be lost during truncation. The model is forced to make predictions based on an incomplete representation of the original text, leading to lower accuracy scores.

### Strategies for Handling Longer Texts

While the 512-token limit can be challenging, there are several approaches to mitigate its impact:

1. **Sliding Window Approach**:
    - Divide the long text into smaller, overlapping chunks (windows).
    - Process each window individually and aggregate the results.
    - This approach can help capture local context, but it may struggle with long-range dependencies.

2. **Alternative Neural Network Architectures**:
    - Consider using other architectures, such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs).
    - These architectures can handle longer sequences without the same token limit constraints.
    - However, they may not capture long-range dependencies as effectively as transformers.

3. **Transformer Variants for Longer Sequences**:
    - Explore transformer-based models specifically designed for handling longer texts, such as Longformer and BigBird.
    - These models introduce modifications to the self-attention mechanism to reduce computational complexity.
    - Keep in mind that these models are relatively new and may have limitations or trade-offs compared to standard transformers.


To make informed decisions about handling longer texts, you need to understand the characteristics of your dataset. Analyze the average number of tokens per text and the distribution of text lengths.

If a significant portion of your texts exceeds the 512-token limit, consider applying one of the strategies mentioned above. Experiment with different approaches and evaluate their impact on accuracy and computational efficiency.

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Initialize an empty list to store the sizes of tokenized input sequences
sizes = []

# Iterate over each text in the training dataset
for txt in df_train.text:
    # Tokenize the text without truncation and get the length of the tokenized input sequence
    # Append the length of the tokenized input sequence to the sizes list
    sizes.append(len(tokenizer_base(txt, truncation=False)["input_ids"]))

# Convert the sizes list to a Pandas Series and display descriptive statistics
# This provides an overview of the distribution of tokenized input sequence lengths
pd.Series(sizes).describe()

count    52026.000000
mean      2373.339407
std       1822.717847
min        151.000000
25%       1133.000000
50%       1799.000000
75%       3031.000000
max      11434.000000
dtype: float64

`As we can see above, the average number of tokens in our dataset is 2,373. This is significantly higher than the 512 token limit. Therefore, we need to employ a workaround to handle this limitation. We won't cover more complex approaches in this class, but we can use a simple and effective workaround - understanding our data! Let's see how we can do this.`

In [None]:
df_train.sample(10, random_state=271828)["text"].iloc[0]

"SENTENÇA Tipo A RELATÓRIO Trata-se de ação declaratória de inexistência de débito e indenizatória por danos morais, com pedido de repetição de indébito, ajuizada por Lúcia Matias de Souza em face do Instituto Nacional do Seguro Social – INSS e do Banco Bradesco S/A, em razão da existência de contrato de empréstimo consignado celebrado perante a aludida instituição financeira que, segundo diz a autora, não foi por ela contratado. É o que importa relatar. Passo a decidir. FUNDAMENTAÇÃO Das preliminares arguidas Quanto à preliminar de ilegitimidade passiva alegada pelo INSS (anexo 11), entendo que a Autarquia ré detém legitimidade para figurar no pólo passivo da ação, tendo em vista que é responsável pelo gerenciamento e pagamento dos descontos realizados nos benefícios previdenciários em decorrência de empréstimo consignado. Assim, a partir do momento em que opera o desconto nos valores tem interesse e legitimidade para figurar no pólo passivo da presente demanda. Ademais, só o INSS tem


`Can you notice that the really relevant information for our classification task is not in the beginning of the text, but in the end?`

> (....)
>
> DISPOSITIVO Isso posto, `julgo PROCEDENTE` o pedido para determinar que o INSS cesse os descontos das parcelas do Contratono 808431996. Condeno, também, a título de danos materiais, o Banco Bradesco a devolver os valores descontados com relação aos citados contratos de empréstimo, em dobro, nos termos do art. 42, parágrafo único, do CDC, devendo tais valores serem acrescidos de juros de mora de 1% ao mês desde o evento danoso (súmula 54 – STJ) e correção monetária com base no IPCA-E desde o efetivo prejuízo (súmula 43 – STJ). Condeno, ainda, o bancoréua pagar, a título de indenização por danos morais, a quantia de R$ 5.000,00 (cinco mil reais), valor este que deve ser atualizado exclusivamente pela taxa SELIC desde a publicação desta sentença. Declaro a inexistência do contrato no808431996. Declaro extinto o processo com resolução do mérito, nos termos do art. 487, I, do Código de Processo Civil. Custas e honorários advocatícios indevidos em primeiro grau de jurisdição (art. 55 da Lei no 9.099/95, c/c art. 1o da Lei no 10.259/01). Registre-se. Intimem-se as partes (Lei no 10.259/01, art. 8o). Campina Grande-PB, data supra. JUIZ FEDERAL
>

This is very common in this kind of documents. The judge starts with a thorough description of the case and then goes to the decision. So, we can use the last 512 tokens of the text to train our model. We just need to change the truncation_side parameter to 'left' in the tokenizer.

Let's see how we can do this.

In [None]:
from transformers import AutoTokenizer
from functools import partial

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Load the tokenizer for the fine-tuned large masked language model (MLM)
# This tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(
    path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm"
)

In [None]:
tokenizer_base.truncation_side

'right'

In [None]:
# Tokenize the input text using the base tokenizer
# The padding=True argument ensures that the sequence is padded to the maximum length
# The truncation=True argument ensures that the sequence is truncated to the maximum length if it exceeds it
# The max_length=5 argument sets the maximum length of the tokenized sequence to 5 tokens
out_len5 = tokenizer_base(
    "Eu gosto muito de farofa com banana", padding=True, truncation=True, max_length=5
)  # This is to simulate the truncation

# Decode the tokenized input IDs back to a string
# This converts the token IDs back to the corresponding text
# The decoded text will be truncated to the first 5 tokens
tokenizer_base.decode(out_len5["input_ids"])

'[CLS] Eu gosto muito [SEP]'

In [None]:
# Set the truncation side for the base tokenizer to 'left'
# This means that if the input text needs to be truncated, tokens will be removed from the beginning (left side) of the sequence
# This setting is useful when the most important information is at the end of the sequence
tokenizer_base.truncation_side = "left"

In [None]:
# Tokenize the input text using the base tokenizer
# The padding=True argument ensures that the sequence is padded to the maximum length
# The truncation=True argument ensures that the sequence is truncated to the maximum length if it exceeds it
# The max_length=5 argument sets the maximum length of the tokenized sequence to 5 tokens
out_len5 = tokenizer_base(
    "Eu gosto muito de farofa com banana", padding=True, truncation=True, max_length=5
)

# Decode the tokenized input IDs back to a string
# This converts the token IDs back to the corresponding text
# The decoded text will be truncated to the first 5 tokens
tokenizer_base.decode(out_len5["input_ids"])

'[CLS] com banana [SEP]'

In [None]:
import datasets

# Convert the training DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for training and evaluation
dataset_labeled_train = datasets.Dataset.from_pandas(df_train)

# Convert the validation DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for validation and evaluation
dataset_labeled_valid = datasets.Dataset.from_pandas(df_valid)

In [None]:
from functools import partial


# Define a function to preprocess the input examples using a specified tokenizer
# The function tokenizes the input text, truncates it to a maximum length of 512 tokens,
# and pads the sequences to ensure they are of equal length
def preprocess_function(examples, tokenizer):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)


# Create a partial function for preprocessing using the base tokenizer
# This partial function allows us to call preprocess_function with only the examples argument,
# as the tokenizer argument is already set to tokenizer_base
preprocess_function_base = partial(preprocess_function, tokenizer=tokenizer_base)

# Create a partial function for preprocessing using the large tokenizer
preprocess_function_large = partial(preprocess_function, tokenizer=tokenizer_large)

In [None]:
# Tokenize the training dataset using the base tokenizer
# The preprocess_function_base tokenizes the text, truncates it to 512 tokens, and pads the sequences
# The batched=True argument processes the dataset in batches for efficiency
dataset_labeled_train_tokenized_base = dataset_labeled_train.map(
    preprocess_function_base, batched=True
)

# Tokenize the validation dataset using the base tokenizer
dataset_labeled_valid_tokenized_base = dataset_labeled_valid.map(
    preprocess_function_base, batched=True
)

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorWithPadding

# Create a data collator for the base tokenizer
# The data collator dynamically pads the input sequences to the maximum length in the batch
# This ensures that all sequences in a batch have the same length, which is required for efficient processing
data_collator_base = DataCollatorWithPadding(tokenizer=tokenizer_base)

In [None]:
# Import the evaluate module from the Hugging Face library
import evaluate

# Load the accuracy metric from the evaluate module
# This metric will be used to evaluate the performance of the model
accuracy = evaluate.load("accuracy")

In [None]:
import numpy as np


# Define a function to compute evaluation metrics
# This function will be used to evaluate the performance of the model during training and validation
def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the evaluation tuple
    predictions, labels = eval_pred

    # Convert the model's output logits to predicted class labels
    # np.argmax(predictions, axis=1) selects the index of the maximum logit for each prediction
    predictions = np.argmax(predictions, axis=1)

    # Compute the accuracy metric using the predicted and true labels
    # accuracy.compute() calculates the accuracy of the predictions
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
# Determine the number of unique labels in the training dataset
# This will be used to configure the classification model
n_labels = df_train.label.nunique()

# Load the configuration for the base masked language model (MLM) and modify it for sequence classification
# The configuration is loaded from the specified directory and the number of labels is set to n_labels
config_base = AutoConfig.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm",
    num_labels=n_labels,
)

# Load the base masked language model (MLM) and modify it for sequence classification
classifier_base = AutoModelForSequenceClassification.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm",
    config=config_base,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import Trainer, TrainingArguments

# Define the training arguments for the base classifier
# These arguments configure various aspects of the training process
training_args_base = TrainingArguments(
    output_dir=path_to_save_lm_base
    / "base_classifier_legal",  # Directory to save the model and other outputs
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=48,  # Batch size for training (adjust based on GPU memory)
    per_device_eval_batch_size=64,  # Batch size for evaluation (adjust based on GPU memory)
    num_train_epochs=5,  # Number of training epochs
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients before updating
    weight_decay=0.01,  # Weight decay for regularization
    bf16=True,  # Use 16-bit floating point precision for training (adjust based on GPU support)
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 10 steps
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Seed for reproducibility
)

# Create a Trainer instance for the base classifier
# The Trainer handles the training and evaluation of the model
trainer_base = Trainer(
    model=classifier_base,  # The model to be trained
    args=training_args_base,  # Training arguments
    train_dataset=dataset_labeled_train_tokenized_base,  # Training dataset
    eval_dataset=dataset_labeled_valid_tokenized_base,  # Evaluation dataset
    processing_class=tokenizer_base,  # Tokenizer for preprocessing the input text
    data_collator=data_collator_base,  # Data collator for dynamic padding
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

# Train the model using the Trainer
trainer_base.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.1213,0.125238,0.955178
2,0.0769,0.121206,0.957408
3,0.0936,0.12133,0.960483
4,0.0874,0.118434,0.961252
5,0.0613,0.12168,0.96179




TrainOutput(global_step=2710, training_loss=0.11554007523614102, metrics={'train_runtime': 3846.022, 'train_samples_per_second': 67.636, 'train_steps_per_second': 0.705, 'total_flos': 6.844430787637248e+16, 'train_loss': 0.11554007523614102, 'epoch': 5.0})

In [None]:
trainer_base.evaluate()



{'eval_loss': 0.11843354254961014,
 'eval_accuracy': 0.9612516337356808,
 'eval_runtime': 60.1118,
 'eval_samples_per_second': 216.38,
 'eval_steps_per_second': 1.697,
 'epoch': 5.0}

We've achieved a significant improvement in our model's accuracy, which soared from 77.9% to an impressive 96.1%. This upswing is indeed fantastic news!

Let's gain a better understanding of this improvement by examining it in terms of the error rate. The error rate is simply calculated as (1 - accuracy). With this formula, our initial error rate was 22.5%, and our improved error rate dropped dramatically to 3.9%.

To put this into perspective, we've effectively reduced the error rate by nearly six-fold! In other words, our model is now making far fewer mistakes than before, indicating an exponential enhancement in its overall performance.

By using the last 512 tokens in the text data, we were able to direct the focus of our model towards the most relevant information. This approach is a simple yet effective workaround to overcome the 512 token limitation in transformers.

This method may seem simple, but it's proven to be an effectively strategic approach to overcome such limitations and handle large amounts of data proficiently. `Remember, sometimes simplicity is the key to master complex challenges!`

# External resources

- [Large Language Models for the curious beginner](https://www.youtube.com/watch?v=LPZh9BOjkQs)
- [How I Finally Understood Self-Attention (With PyTorch)](https://www.youtube.com/watch?v=FepOyFtYQ6I)
- [Visualizing transformers and attention | Talk for TNG Big Tech Day '24](https://www.youtube.com/watch?v=KJtZARuO3JY)
- [Transformers (how LLMs work) explained visually](https://www.youtube.com/watch?v=wjZofJX0v4M)
- [Attention in transformers, step-by-step](https://www.youtube.com/watch?v=eMlx5fFNoYc)
- [How might LLMs store facts](https://www.youtube.com/watch?v=9-Jl0dxWQs8&t=354s)
- [2024 in Post-Transformer Architectures: State Space Models, RWKV [Latent Space LIVE! @ NeurIPS 2024]](https://www.youtube.com/watch?v=LPe6iC73lrc)
- [Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference - ModernBERT Paper](https://arxiv.org/abs/2412.13663)


## Takeaways
- Transformers have become a fundamental tool in NLP, enabling more effective handling of long-range dependencies and parallelization compared to traditional sequential models.

- The attention mechanism allows transformers to capture complex relationships and focus on the most relevant information, leading to improved performance on various NLP tasks.

- While the quadratic complexity of transformers poses challenges for longer texts, ongoing research aims to develop more efficient variants and attention mechanisms to overcome these limitations.

- The successful application of transformers beyond NLP highlights their versatility in capturing patterns and dependencies in structured data across different domains.

- Understanding the architecture and components of transformers, such as the encoder-decoder structure, self-attention, and positional encoding, is crucial for effectively employing their capabilities.

- Familiarity with common transformer architectures like BERT, GPT, RoBERTa, T5, and XLNet allows practitioners to choose the most suitable model for their specific NLP task.

- Transformers can be powerful feature extractors, providing rich representations that capture syntactical and contextual information for downstream tasks.

- Following best practices, such as starting with pretrained models, fine-tuning on domain-specific data, and task-specific training, can help achieve optimal results when using transformers.

- Awareness of the 512-token limit and developing strategies to handle longer texts, such as focusing on the most relevant information or using sliding window approaches, is essential for maintaining accuracy in real-world applications.

# Questions

1. What is the key advantage of transformers compared to traditional sequential models like RNNs and LSTMs?

2. What is the role of the attention mechanism in transformers?

3. What are the main components of a typical transformer architecture?

4. What is the impact of quadratic complexity on the performance of transformers for longer texts?

5. How have transformers been applied beyond natural language processing (NLP)?

6. What are some common transformer architectures used for NLP tasks?

7. How can transformers be used as feature extractors?

8. What are the key steps for using transformers on a specific task?

9. What is the limitation of the 512-token limit in transformers, and how does it impact accuracy?

10. What is a simple workaround to handle the 512-token limit and improve accuracy?

`Answers are commented inside this cell`
<!--
1. Transformers can handle long-range dependencies effectively and parallelize computations, avoiding the vanishing gradient problem that plagues RNNs and LSTMs.

2. The attention mechanism allows the model to focus on the most relevant parts of the input sequence when predicting a specific output, enabling it to capture complex relationships and dependencies between words.

3. A typical transformer consists of an encoder and a decoder, each composed of multiple identical layers. The key components include self-attention layers, feed-forward neural networks, and positional encoding.

4. The quadratic complexity of transformers results in slower training times and high memory consumption when dealing with longer sequences, hindering their practicability in scenarios involving extensive texts.

5. Transformers have been successfully applied in various domains, including computer vision (e.g., Image Transformer, Vision Transformer), music generation (e.g., MuseNet), speech recognition (e.g., Speech-Transformer), and video processing (e.g.,
Video Transformer).

6. Common transformer architectures for NLP include BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pretrained Transformer), RoBERTa (Robustly Optimized BERT Approach), T5 (Text-to-Text Transfer Transformer), and
XLNet.

7. Transformers can be used as feature extractors by utilizing their ability to capture rich syntactical and contextual information from text data. The extracted features, typically represented by the vector corresponding to the [CLS] token, can
be used as input for downstream tasks like classification or regression.

8. The key steps for using transformers include starting with a pretrained model, optionally fine-tuning the model on domain-specific text, training the model for the specific task using task-specific data, and using the model as a feature
extractor if needed.

9. Most transformer models can only process a maximum of 512 tokens due to the quadratic complexity of the self-attention mechanism. When an input text exceeds this limit, it is truncated, potentially losing important information and leading to
lower accuracy in predictions.

10. A simple workaround is to focus on the most relevant information in the text data. For example, in legal documents where the decision is often at the end, using the last 512 tokens of the text can significantly improve accuracy by directing the model's attention to the most important part of the document. -->