# Transformers
## Deep Learning and Generative AI
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

# Summary
## Keypoints

1. **Transformers revolutionize NLP by enabling parallel processing and capturing long-range dependencies** through self-attention mechanisms, overcoming the limitations of sequential models like RNNs and LSTMs.

2. **Self-attention allows transformers to weigh the importance of each token relative to all others**, capturing contextual relationships across the entire input sequence.

3. **Transformer architecture consists of encoder and decoder components** built with layers of self-attention, feed-forward networks, residual connections, and layer normalization.

4. **Positional encoding integrates sequence order into transformers**, typically using sine and cosine functions, ensuring the model recognizes the position of tokens within the input.

5. **The quadratic complexity of self-attention limits transformers to processing sequences up to 512 tokens**, posing computational challenges for tasks involving longer texts.

6. **Transformers have been successfully adapted beyond NLP**, applied in domains such as computer vision (Vision Transformers), music generation, speech recognition, and video processing.

7. **Key transformer architectures include** BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), T5 (Text-to-Text Transfer Transformer), and Transformer-XL, each tailored to specific tasks and data types.

8. **Transformers can be used as feature extractors**, providing rich contextual embeddings (e.g., from the `[CLS]` token) for downstream tasks like classification.

9. **Key steps for using transformers include selecting a pretrained model**, optionally fine-tuning it on domain-specific data, and conducting task-specific training by adding appropriate heads or layers.

10. **Strategies are needed to handle the 512-token limit**, such as adjusting the truncation side to retain important information or using models designed to process longer sequences.

11. **Fine-tuning transformers on domain-specific text enhances performance** by adapting the model to the specific language patterns and styles of the domain.

12. **Perplexity is a crucial metric for assessing language models**, with lower perplexity indicating better predictive performance and understanding of the language data.

## Takeaways

1. **Transformers are foundational in modern NLP**, offering powerful capabilities to model complex language patterns and dependencies through their self-attention mechanisms.

2. **Understanding transformer architecture and components is essential** for effectively applying them to various tasks, including awareness of their limitations and computational considerations.

3. **Adapting transformers to domain-specific data through fine-tuning significantly improves their effectiveness**, enabling models to handle specialized vocabularies and styles.

4. **Practitioners must address the 512-token limit** by employing strategies to handle long texts, such as focusing on the most relevant information to prevent loss due to truncation.

5. **The versatility of transformers across different domains underscores their power**, demonstrating applicability beyond NLP to areas like computer vision and speech processing.

6. **Selecting the appropriate transformer architecture is crucial**, as different models have unique strengths suited to specific tasks, whether understanding context, generating text, or handling sequence-to-sequence transformations.

7. **Using transformers as feature extractors provides rich embeddings for downstream tasks**, leveraging the deep contextual understanding captured by the models.

8. **Effective use of transformers involves leveraging pretrained models**, considering domain adaptation through fine-tuning, and being mindful of limitations such as sequence length and computational resources.


# Transformers

Transformers have significantly advanced the field of Natural Language Processing (NLP) by enabling models to capture long-range dependencies and complex contextual relationships within text data. Since their introduction by Vaswani et al. in 2017, transformers have become the foundational architecture for a wide array of NLP tasks such as machine translation, text summarization, and question answering.

## Moving Beyond Sequential Processing

Traditional models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) process sequences token by token, maintaining an internal state that captures previous information. This sequential processing limits their ability to capture long-range dependencies due to issues like the vanishing gradient problem and inhibits parallelization during training, leading to increased computational time.

Transformers address these limitations by processing entire sequences simultaneously, allowing them to model relationships between all tokens in a sequence regardless of their positions. This parallel processing capability not only improves computational efficiency but also enhances the model's ability to understand complex dependencies within the data.

## The Attention Mechanism: Focusing on What Matters

At the core of the transformer architecture is the **attention mechanism**. Attention allows the model to weigh the relevance of different parts of the input data when generating each part of the output. This mechanism enables the model to focus on the most pertinent elements, enhancing its ability to capture context and relationships between tokens.

For instance, consider machine translation from English to French. When translating the English pronoun "it," the correct French translation (*"il"* or *"elle"*) depends on the grammatical gender of the noun it refers to. The attention mechanism helps the model focus on the relevant noun, ensuring correct gender agreement in the translation.

## Anatomy of a Transformer

A typical transformer model consists of two main components:

- **Encoder:** Processes the input sequence and generates a contextualized representation for each token.
- **Decoder:** Takes the encoder's output and generates the output sequence, one token at a time, while attending to both the encoder's output and the previously generated tokens.

<p align="center">
<img src="images/transformers_basic.png" alt="" style="width: 40%; height: 40%"/>
</p>

Each of these components is composed of multiple layers that include critical sub-components:

### 1. Self-Attention Mechanism

The **self-attention** mechanism allows the model to consider the relationship between a token and all other tokens in the sequence. This is crucial for understanding context, as the meaning of a word often depends on the surrounding words.

#### How Self-Attention Works

1. **Input Transformations:**

   - For each token in the sequence, the model computes three vectors:

     - **Query (Q) Vector**
     - **Key (K) Vector**
     - **Value (V) Vector**

     These vectors are linear transformations of the input embeddings, capturing different aspects of the token's representation.

2. **Calculating Attention Scores:**

   - The attention score between two tokens is computed by taking the dot product of their Query and Key vectors:

     $$
     \text{Attention Score} = Q \cdot K^T
     $$

     This score reflects how much attention the model should pay to one token when processing another.

3. **Scaling and Normalization:**

   - The attention scores are scaled by dividing by the square root of the dimensionality of the Key vectors to mitigate issues with large dot product values:

     $$
     \text{Scaled Attention Score} = \frac{Q \cdot K^T}{\sqrt{d_k}}
     $$

   - A softmax function is then applied to obtain normalized attention weights that sum to one.

4. **Weighted Sum of Values:**

   - The attention weights are used to compute a weighted sum of the Value vectors, producing an output vector that captures aggregated information from the entire sequence.

This process allows the model to dynamically weight the influence of each token based on its relevance to others, effectively capturing context.

#### Multi-Head Attention

To enhance the model's ability to capture different types of relationships, transformers employ **multi-head attention**. This involves running multiple self-attention mechanisms, or "heads," in parallel. Each head learns to focus on different aspects or positions in the sequence.

- The outputs from all heads are concatenated and linearly transformed to form the final output of the self-attention layer.

> **Note:** Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, enriching the model's capacity to understand complex patterns.

### 2. Position-Wise Feed-Forward Networks

After the self-attention layer, each position in the sequence is passed through a fully connected feed-forward network (FFN). This network consists of two linear transformations with a non-linear activation function (usually ReLU) in between.

- **Operations:**

  $$
  \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
  $$

- **Characteristics:**

  - The same feed-forward network is applied independently to each position.
  - Adds depth and non-linearity to the model, enabling it to learn more complex functions.

### 3. Positional Encoding: Injecting Order into the Model

Since transformers process all tokens simultaneously without built-in regard to their position, they need a method to incorporate information about the order of the sequence. **Positional encoding** achieves this by adding a unique positional vector to each token's embedding.

#### Implementation of Positional Encoding

A common approach uses sine and cosine functions of different frequencies:

- For position $ pos $ and dimension $ i $:

  $$
  \text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
  $$
  $$
  \text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
  $$

- **Advantages:**

  - Provides the model with a sense of relative positioning.
  - Enables the model to generalize to sequences longer than those seen during training.
  - The periodic nature allows for easy computation of relative positions.

> **Example Analogy:** Think of positional encoding as adding a unique rhythm to each word based on its position, allowing the model to distinguish between "the cat sat on the mat" and "on the mat sat the cat," despite having the same words.

### 4. Residual Connections and Layer Normalization

To facilitate training deeper models, transformers use **residual connections** around the sub-layers, followed by **layer normalization**.

- **Residual Connections:**

  - Add the input of a sub-layer to its output:

    $$
    \text{Output} = \text{LayerNorm}(x + \text{SubLayer}(x))
    $$

  - Helps in mitigating the vanishing gradient problem by allowing gradients to flow directly through the network.

- **Layer Normalization:**

  - Normalizes the output across the features for each position, stabilizing and speeding up training.

## Advantages of Transformers

- **Parallelization:** By processing entire sequences at once, transformers make efficient use of modern hardware, reducing training times significantly compared to sequential models.

- **Handling Long-Range Dependencies:** The attention mechanism allows direct connections between any two tokens in a sequence, regardless of their distance, effectively capturing long-range dependencies.

- **Flexibility in Modeling Relationships:** Multi-head attention enables the model to learn nuanced relationships from different subspaces.

## FAQ

- **Do Transformers Ignore Sequence Order?**

  - **No.** While transformers process tokens in parallel, positional encoding ensures that the model is aware of the order of tokens, allowing it to capture sequential information.

- **Are Transformers Only for NLP?**

  - **Not anymore.** While initially developed for language tasks, the transformer architecture has been adapted for various domains, including computer vision (e.g., Vision Transformers) and reinforcement learning.

- **Is Attention All You Need?**

  - The catchphrase from the original paper suggests the power of attention mechanisms, but transformers also rely on other components like feed-forward networks, residual connections, and proper regularization to perform effectively.

---

## Quadratic Complexity in Transformers

Transformers have dramatically advanced the field of natural language processing (NLP), achieving state-of-the-art performance in a variety of tasks. A key innovation in Transformers is the self-attention mechanism, which allows the model to capture contextual relationships between all elements in an input sequence. However, despite their successes, Transformers face a significant limitation: when processing long sequences, their computational complexity becomes prohibitive due to its quadratic nature with respect to the sequence length.

### Understanding the Quadratic Complexity

The computational complexity of the self-attention mechanism in Transformers is **O(n²)**, where *n* is the length of the input sequence. This quadratic complexity arises because, in the self-attention layer, each token (word or sub-word unit) attends to every other token in the sequence. Specifically, for each token, the model computes attention scores (dot products) with all other tokens to determine the relevance of each other token to itself.

As a result, for a sequence of length *n*, the number of attention score computations is *n* × *n* = *n²*. This is illustrated in the following figure:

<p align="center">
<img src="images/transformer_quadratic.webp" alt="Quadratic complexity in self-attention" style="width: 50%; height: 50%"/>
</p>

*Figure: For a sequence of length 9, the self-attention mechanism results in 9² = 81 computations.*

This quadratic scaling affects both:

- **Computational Time**: As the sequence length *n* increases, the time required to compute all attention scores grows quadratically, leading to longer processing times.
  
- **Memory Usage**: The model must store the *n × n* attention matrix, which holds all the attention scores between tokens. This storage requirement can quickly exceed available memory for large *n*.

An analogy to understand this complexity is to consider a meeting where every person needs to directly communicate with every other person. In small meetings, this is manageable, but as the number of people increases, the number of direct interactions grows rapidly, making the meeting chaotic and unmanageable.

#### Mathematical Explanation

The self-attention operation involves computing a matrix of attention scores, **A**, where each element **A<sub>i,j</sub>** represents the attention from token *i* to token *j*. Computing **A** requires **O(n²)** operations, and storing **A** requires **O(n²)** memory.

### Impact on Real-World Applications

The quadratic complexity imposes practical limitations when applying Transformers to tasks involving long sequences. Some real-world applications affected include:

1. **Question Answering**

   In question answering, especially open-domain or extractive tasks, the model needs to consider a lengthy context to find the correct answer within large documents or multiple passages. Processing these long contexts with standard Transformers is computationally intensive.

2. **Machine Translation**

   Translating long sentences or entire documents requires the model to capture dependencies across the entire text. The quadratic complexity makes it challenging to efficiently process long input or output sequences, potentially limiting the translation of lengthy texts.

3. **Summarization**

   Summarizing long articles or reports requires understanding and condensing information from extensive input. The computational demands of processing long sequences can impede the use of Transformers for such tasks.

4. **Language Modeling**

   In language modeling and generation tasks, the model often needs to consider long-range dependencies to generate coherent text. The quadratic scaling limits the feasible context length that the model can handle.

5. **Text Classification**

   Classifying documents like legal briefs or research articles requires processing the entire text to capture nuanced information that determines the correct class. The memory and computational constraints can hinder model performance on these tasks.

In all these scenarios, the quadratic complexity leads to:

- **Higher Computational Costs**: Training and inference times increase significantly with sequence length.
  
- **Increased Memory Consumption**: Memory requirements may exceed hardware limitations, leading to the need for model adjustments or specialized hardware.

These limitations constrain the applicability of Transformers in processing long texts, which is a significant drawback given the importance of understanding long-range dependencies in many NLP tasks.

### Addressing the Quadratic Complexity

To mitigate the challenges posed by the quadratic complexity in Transformers, researchers have proposed several approaches:

1. **Sparse Attention Mechanisms**

   Sparse attention mechanisms limit the number of tokens each token attends to, reducing the computational burden. Rather than computing attention scores with all tokens, each token attends to a subset of relevant tokens.

   Examples include:

   - **Local Attention**: Tokens attend only to a fixed window of neighboring tokens. This approach assumes that relevant information is often located nearby in the sequence.
   
   - **Strided Attention**: Tokens attend to a subset of tokens at regular intervals, capturing long-range dependencies without exhaustive computations.
   
   - **Adaptive Sparsity**: The model dynamically selects which tokens to attend to based on learned importance scores.

   By reducing the number of attention calculations, sparse attention mechanisms lower both computation and memory requirements from **O(n²)** to **O(n)** or **O(n log n)**, depending on the method.

2. **Efficient Attention Algorithms**

   Several algorithms have been developed to compute attention more efficiently:

   - **Linformer**: Projects the sequence into a lower-dimensional space, reducing the attention matrix size.
   
   - **Reformer**: Uses locality-sensitive hashing (LSH) to approximate attention computations, focusing on similar tokens.
   
   - **Performer**: Employs random feature methods to approximate the softmax function in attention, achieving linear complexity.

3. **Segmented or Chunked Processing**

   The sequence is divided into smaller segments or chunks processed separately. While this reduces computational load, it may limit the model's ability to capture dependencies across segments. Techniques like windowed attention combined with cross-chunk connections aim to mitigate this issue.

4. **Memory-Augmented Models**

   Models incorporate external memory structures to store and retrieve information without processing the entire sequence at each step. This allows the model to access important information from previous tokens without the need for full self-attention over long sequences.

5. **Long Range Arena (LRA) Benchmark**

   The **Long Range Arena** is a benchmark designed to evaluate models' ability to handle long-context sequences efficiently. It includes tasks that require capturing long-range dependencies, providing a standardized way to assess and compare the performance of various efficient Transformer models.

   Researchers use LRA to test new architectures and attention mechanisms, pushing the development of models that can process longer sequences without prohibitive computational costs.

### Ongoing Research and Future Directions

The challenge of quadratic complexity in Transformers is an active area of research. Key goals include:

- **Scaling to Longer Sequences**: Developing models that can handle sequences of tens of thousands of tokens efficiently, enabling applications like full-document understanding.

- **Balancing Efficiency and Performance**: Designing attention mechanisms that reduce computational demands while maintaining or improving model accuracy.

- **Hardware Optimizations**: Leveraging specialized hardware, such as GPUs and TPUs, and optimization techniques to improve computational efficiency.

- **Theoretical Advances**: Understanding the fundamental limits of attention mechanisms and exploring alternative architectures that capture long-range dependencies with lower complexity.

> 
> **Note**: While quadratic complexity poses challenges, Transformers continue to be highly effective in capturing contextual information in sequences. The development of more efficient variants and attention mechanisms is crucial for expanding their applicability to tasks involving longer texts. As research progresses, we can expect Transformers to become more capable of handling long sequences efficiently, enhancing their impact in natural language processing and beyond.

## Applications of Transformers Beyond NLP

Transformers, initially introduced for Natural Language Processing (NLP) tasks, have revolutionized how we model sequential data. Their core innovation, the self-attention mechanism, allows the modeling of complex patterns and long-range dependencies without relying on recurrence. This capability extends beyond text, enabling Transformers to excel in various domains. In this section, we explore how Transformers have been adapted for tasks in computer vision, music generation, speech recognition, and video processing.

### Computer Vision

Understanding images requires capturing both local and global patterns. Traditional Convolutional Neural Networks (CNNs) are effective at modeling local features through convolutions but can struggle with global dependencies due to limited receptive fields.

#### Vision Transformer (ViT)

The **Vision Transformer (ViT)** adapts the Transformer architecture for image classification. Instead of using convolutions, ViT divides an image into a sequence of patches, treating each patch as a token:

1. **Image Patchification**: An image $ \mathbf{I} \in \mathbb{R}^{H \times W \times C} $ is split into $ N $ patches $ \{ \mathbf{x}_p^i \} $, each of size $ P \times P $, where $ H $ and $ W $ are image dimensions, and $ C $ is the number of channels.

2. **Patch Embedding**: Each patch $ \mathbf{x}_p^i $ is flattened and projected to a latent vector $ \mathbf{z}_0^i $ using a learnable linear transformation:

   $$
   \mathbf{z}_0^i = \mathbf{E} \cdot \mathrm{flatten}(\mathbf{x}_p^i) + \mathbf{E}_{pos}^i,
   $$

   where $ \mathbf{E} $ is the embedding matrix and $ \mathbf{E}_{pos}^i $ is the position embedding for patch $ i $.

3. **Transformer Encoding**: The sequence $ \{\mathbf{z}_0^i\} $ is input to standard Transformer encoder layers to model relationships between patches.

4. **Classification**: A special classification token $ \mathbf{z}_0^0 $ is prepended to the sequence, whose final representation $ \mathbf{z}_L^0 $ is used for classification.

By using self-attention, ViT captures both local and global dependencies across all patches, enabling effective image recognition without convolutions. ViT has achieved state-of-the-art results on image classification benchmarks, especially when pre-trained on large datasets.

#### Image Transformer

The **Image Transformer** extends Transformers to image generation and manipulation tasks. It treats image pixels or regions as sequences and models the conditional probability of each pixel given previous ones:

$$
P(\mathbf{x}) = \prod_{i=1}^{N} P(x_i \mid x_1, x_2, \dots, x_{i-1}),
$$

where $ x_i $ represents the $ i $-th pixel or region. The self-attention mechanism efficiently captures dependencies across the entire image, allowing for high-quality image synthesis and inpainting.

### Music Generation

Music is inherently sequential and hierarchical, with long-range dependencies such as repeating melodies and harmonic progressions.

#### MuseNet

**MuseNet** applies Transformers to music composition by treating musical elements as tokens:

- **Tokens**: Notes, durations, instruments, and other musical attributes are encoded as tokens.
- **Sequence Modeling**: The model learns the probability distribution over sequences of tokens, capturing musical structure.

The self-attention mechanism allows MuseNet to consider the entire context of the composition when generating each note, enabling:

- **Polyphony**: Handling multiple simultaneous notes (chords) across different instruments.
- **Style Blending**: Combining elements from various musical genres.

MuseNet can generate complex compositions up to four minutes long, showcasing the Transformer’s ability to model complex, long-range dependencies in music.

### Speech Recognition

Speech recognition involves mapping acoustic signals to text, requiring models to capture temporal interactions over variable-length sequences.

#### Speech-Transformer

The **Speech-Transformer** brings the Transformer architecture to Automatic Speech Recognition (ASR):

1. **Input Representation**: The raw audio waveform is converted into a sequence of acoustic feature vectors $ \{\mathbf{x}_t\} $, such as Mel-frequency cepstral coefficients (MFCCs).

2. **Positional Encoding**: Since Transformers lack built-in sequence ordering, positional encodings $ \mathbf{E}_{pos}^t $ are added to the input features:

   $$
   \mathbf{z}_0^t = \mathbf{x}_t + \mathbf{E}_{pos}^t.
   $$

3. **Encoder-Decoder Architecture**: The model uses a Transformer encoder to process the input sequence and a decoder to generate the corresponding text.

4. **Attention Mechanism**: Self-attention in the encoder captures temporal relationships in speech, while encoder-decoder attention aligns acoustic features with linguistic outputs.

The Speech-Transformer simplifies ASR by eliminating the need for components like Hidden Markov Models (HMMs) or Connectionist Temporal Classification (CTC), while achieving competitive performance.

### Video Processing

Videos present spatiotemporal data, requiring models to capture dependencies both within and across frames.

The **Video Transformer** applies Transformers to video understanding tasks by extending self-attention to spatiotemporal tokens:

1. **Spatiotemporal Tokenization**:

   - **Spatial Patches**: Each frame is divided into patches, similar to ViT.
   - **Temporal Segmentation**: Sequences of patches over time are considered.

2. **Embedding and Position Encoding**:

   $$
   \mathbf{z}_0^{(t,i)} = \mathbf{E} \cdot \mathrm{flatten}(\mathbf{x}_{p}^{(t,i)}) + \mathbf{E}_{pos}^{(t,i)},
   $$

   where $ \mathbf{x}_{p}^{(t,i)} $ is patch $ i $ at time $ t $.

3. **Attention Across Space and Time**: Self-attention computes relationships across all patches and frames:

   $$
   \mathrm{Attention(Q,K,V)} = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V,
   $$

   where queries $ Q $, keys $ K $, and values $ V $ aggregate information from both spatial and temporal dimensions.

This approach allows the model to understand motions, actions, and interactions in videos, making it effective for tasks like action recognition and video captioning.

### Other Domains

Transformers have been adapted for various other fields:

- **Protein Structure Prediction**: Modeling amino acid sequences to predict 3D structures.
- **Reinforcement Learning**: Representing policies and value functions as sequences.
- **Multi-Modal Processing**: Combining text, vision, and audio data for tasks like image captioning and audiovisual speech recognition.

> **Key Insight**: Transformers excel in any domain requiring the modeling of complex, long-range dependencies in structured data. Their flexibility and scalability have made them a cornerstone in advancing machine learning across diverse fields.

### FAQ

- **Why use Transformers over traditional models in these domains?**
  
  - Transformers handle long-range dependencies more effectively than models like RNNs or CNNs, especially in scenarios where context from distant parts of the input is crucial.

- **How do positional encodings impact performance?**
  
  - Positional encodings provide sequence order information to the model. Different domains may use absolute or relative positional encodings to better capture the structure of the data.

- **Are Transformers computationally intensive?**
  
  - While self-attention has $ \mathcal{O}(n^2) $ complexity with respect to sequence length $ n $, techniques like sparse attention and patching reduce computational load.



# Common Transformer Architectures for NLP

Transformers have transformed (sorry, pun intended) Natural Language Processing (NLP) with their ability to effectively capture dependencies in sequence data. This has led to the development of several powerful transformer architectures tailored for various NLP tasks. In this section, we will explore five commonly used transformer architectures and discuss their unique characteristics and strengths.

### 1. BERT (Bidirectional Encoder Representations from Transformers)

**BERT**, developed by Google, is a pre-trained transformer model that has significantly advanced NLP. Its key innovation lies in its bidirectional approach to language understanding. Unlike traditional models that process text in a unidirectional (left-to-right or right-to-left) manner, BERT considers both the preceding and following context simultaneously. This bidirectional context allows BERT to capture a more nuanced understanding of language semantics, leading to improved performance on a wide range of NLP tasks.

#### Architecture

BERT's architecture consists of multiple transformer **encoder** layers stacked together. Specifically, BERT employs only the encoder part of the original transformer model, focusing on understanding the input text.

#### Pre-training Tasks

During pre-training, BERT is trained on large amounts of unlabeled text data using two unsupervised learning tasks:

1. **Masked Language Modeling (MLM)**:
   
   - A certain percentage of input tokens are randomly masked, and the model is tasked with predicting the original tokens based on the surrounding context.
   - Formally, given an input sequence $ X = \{x_1, x_2, \dots, x_n\} $, where some tokens are replaced with a special `[MASK]` token, the objective is to minimize the cross-entropy loss:

     $$
     \mathcal{L}_{\text{MLM}} = - \sum_{i \in M} \log P(x_i \mid X_{\setminus i}),
     $$

     where $ M $ is the set of masked positions, and $ X_{\setminus i} $ represents the input sequence with masked tokens.

2. **Next Sentence Prediction (NSP)**:

   - The model is trained to predict whether a given pair of sentences $ (A, B) $ are consecutive in the original text.
   - The objective is to minimize the binary classification loss:

     $$
     \mathcal{L}_{\text{NSP}} = - \left[ y \log P(\text{IsNext} \mid A, B) + (1 - y) \log P(\text{NotNext} \mid A, B) \right],
     $$

     where $ y = 1 $ if $ B $ follows $ A $ in the original text, and $ y = 0 $ otherwise.

These pre-training tasks enable BERT to learn deep bidirectional representations, capturing both syntactic and semantic relationships in language.

#### Fine-tuning

After pre-training, BERT can be fine-tuned for specific downstream tasks (e.g., question answering, named entity recognition) by adding a simple classification layer on top and training on task-specific data.

### 2. GPT (Generative Pre-trained Transformer)

**GPT**, introduced by OpenAI, is another influential transformer architecture in NLP. Unlike BERT, GPT adopts a unidirectional approach, processing text from left to right, or in a forward manner. This unidirectional nature makes GPT particularly well-suited for tasks that require text generation, such as language modeling, text completion, and conversational AI.

#### Architecture

GPT's architecture consists of multiple transformer **decoder** layers stacked together. It uses only the decoder part of the original transformer model and includes masked self-attention to prevent the model from seeing future tokens during training.

#### Pre-training Objective

GPT is pre-trained using a standard language modeling objective:

- **Language Modeling (LM)**:

  - The model learns to predict the next token $ x_{t} $ given the previous tokens $ x_{1:t-1} $.
  - The objective is to minimize the negative log-likelihood:

    $$
    \mathcal{L}_{\text{LM}} = - \sum_{t=1}^{n} \log P(x_t \mid x_{1:t-1}).
    $$

This training enables GPT to capture the statistical properties of language, including syntax and semantics.

#### Strengths

- **Text Generation**: GPT excels at generating coherent and fluent text that closely resembles human writing.
- **Few-Shot Learning**: By conditioning on a prompt or a few examples, GPT can perform tasks without task-specific fine-tuning, demonstrating emergent few-shot capabilities.
  
#### Limitations

- GPT relies on left-to-right context, which may limit its understanding of information that appears later in the text.

### 3. RoBERTa (A Robustly Optimized BERT Pretraining Approach)

**RoBERTa**, developed by Facebook AI, is a variant of BERT that aims to improve upon the original by optimizing the training procedure.

#### Key Modifications

1. **Dynamic Masking**:

   - Instead of using a fixed mask for each training instance, RoBERTa applies masking dynamically during training. Each epoch sees a new masking pattern.
   - This exposes the model to more varied contexts, enhancing generalization.

2. **Removal of NSP Task**:

   - RoBERTa eliminates the Next Sentence Prediction task, as studies showed it did not benefit performance significantly.

3. **Training with Larger Batches and More Data**:

   - RoBERTa is trained on a larger corpus and with bigger batch sizes, improving the robustness of the learned representations.

4. **Longer Training Duration**:

   - Extended training allows the model to converge to better optima.

#### Impact

These enhancements result in RoBERTa achieving state-of-the-art results on various NLP benchmarks, often outperforming BERT. This highlights the importance of training strategies and hyperparameter optimization in transformer models.

### 4. T5 (Text-to-Text Transfer Transformer)

**T5**, introduced by Google, presents a unified framework by framing every NLP task as a text-to-text problem. This means both the input and output are text strings.

#### Unified Text-to-Text Framework

- **Task Formulation**:

  - All tasks are converted into text-to-text formats.
  - For example:
    - **Translation**: Input: `"translate English to Portuguese: That is good."`, Output: `"Isso é bom."`
    - **Summarization**: Input: `"summarize: O rato branco roedor roeu a roupa do rei de Roma`, Output: `"O rato roeu a roupa do rei de Roma."`
    - **Sentiment Analysis**: Input: `"sst2 sentence: Essa aula é maravilhosa"`, Output: `"positivo"`

#### Architecture

- T5 uses the standard encoder-decoder Transformer architecture.
- Both the encoder and decoder are composed of several transformer layers.

#### Pre-training Objective

- **Span Corruption (Denoising Objective)**:

  - Random spans of text are replaced with a single mask token.
  - The model is tasked to reconstruct the original text.
  - This generalizes the MLM task by masking contiguous spans rather than individual tokens.

#### Advantages

- **Flexibility**: The text-to-text approach allows T5 to be applied universally across tasks without changing the architecture.
- **Transfer Learning**: Pre-training on multiple tasks improves performance on individual tasks due to shared representations.

### 5. XLNet

**XLNet**, jointly developed by Google Brain and Carnegie Mellon University, combines the strengths of BERT and auto-regressive models like GPT.

#### Key Innovations

1. **Permutation Language Modeling (PLM)**:

   - XLNet uses a novel training objective where all possible permutations of the input sequence are considered, modeling bidirectional context while maintaining the auto-regressive property.
   - For a sequence $ X = \{x_1, x_2, \dots, x_n\} $, the model maximizes the likelihood over all possible permutations $ \mathcal{Z}_n $:

     $$
     \mathcal{L}_{\text{PLM}} = \sum_{z \in \mathcal{Z}_n} \log P(x_{z_t} \mid X_{< z_t}),
     $$

     where $ z $ is a permutation of indices, and $ X_{< z_t} $ are the tokens preceding $ x_{z_t} $ in the permutation.

2. **Two-Stream Self-Attention**:

   - Introduces a content stream and a query stream to handle the dependency on target words during training.

#### Benefits

- **Bidirectional Context**: Captures context from both past and future tokens.
- **Auto-regressive Property**: Retains the benefits of autoregressive modeling, improving the ability to generate coherent sequences.

#### Performance

XLNet has demonstrated improved performance over BERT on several NLP benchmarks, particularly in tasks that benefit from modeling long-range dependencies.


### Model Sizes: Base vs. Large

Transformer architectures often come in different sizes, commonly referred to as **Base** and **Large**, indicating the model's depth and number of parameters.

#### Base Models

- **Architecture**:

  - Moderate depth, e.g., 12 layers.
  - Hidden size (number of units per layer), e.g., 768.
  - Number of attention heads, e.g., 12.

- **Parameters**:

  - Typically around 110 million parameters (e.g., BERT-base).

- **Usage**:

  - Suitable for tasks with limited computational resources.
  - Faster training and inference.

#### Large Models

- **Architecture**:

  - Greater depth, e.g., 24 layers.
  - Larger hidden size, e.g., 1024.
  - More attention heads, e.g., 16.

- **Parameters**:

  - Significantly more parameters, e.g., around 340 million (BERT-large).

- **Usage**:

  - Achieve better performance on complex tasks due to increased capacity.
  - Require more computational resources and memory.

#### Considerations

- **Performance vs. Efficiency Trade-off**:

  - Larger models generally perform better due to their higher capacity to capture complex patterns.
  - However, they are more computationally intensive and may overfit on small datasets.

- **Overfitting Risk**:

  - With limited training data, large models can memorize noise, harming generalization.
  - Techniques like regularization and data augmentation can mitigate overfitting.

- **Practical Approach**:

  - Start with a base model to establish a baseline.
  - Scale up to a large model if higher performance is needed and resources allow.

### FAQ

- **Why are different pre-training tasks important?**

  - Different pre-training tasks help the model learn varied aspects of language:
    - **MLM** captures contextual relationships.
    - **NSP** helps with sentence-level understanding.
    - **PLM** allows modeling of bidirectional context without masking tokens.

- **What is the impact of bidirectionality in models like BERT and XLNet?**

  - Bidirectional models can use both past and future context, leading to better understanding of language semantics.
  - This is particularly beneficial for tasks requiring comprehension of the entire context.

- **How does the choice of model size affect performance?**

  - Larger models can capture more complex patterns but require more data and computational power.
  - Smaller models are faster and require fewer resources but may underperform on complex tasks.

- **Why is T5's text-to-text framework advantageous?**

  - It provides a unified approach to various tasks, enabling the same architecture to be applied universally.
  - Simplifies the learning process and reduces the need for task-specific adjustments.

> **Note:** Understanding the nuances of these transformer architectures empowers practitioners to select appropriate models for their specific NLP tasks. As the field continues to develop, staying informed about these advancements is crucial for leveraging the full potential of transformer-based models in natural language understanding and generation.

## Using Transformers as Feature Extractors

Transformers have revolutionized the field of Natural Language Processing (NLP) by providing a powerful tool for extracting rich syntactical and contextual information from text data. These extracted features can be leveraged for a wide range of downstream tasks, such as classification, regression, and more. In this section, we will explore the concept of features and how transformers can be effectively applied as feature extractors.

### Understanding Features and Feature Extraction

Features are the distinctive properties or characteristics that are extracted from a dataset to capture its essential information. In the context of NLP, features can represent various aspects of text data, such as word frequencies, sentence structure, or semantic relationships. These features serve as the foundation for building accurate and insightful models for various tasks.

Feature extractors are sophisticated algorithms designed to automatically derive meaningful features from raw data, regardless of its modality (e.g., text, images, or audio). They are capable of identifying and capturing relevant patterns, structures, and relationships within the data, enabling more effective analysis and prediction.

### Transformers as Feature Extractors

Transformers are a class of neural networks that have proven to be exceptionally well-suited for extracting features from textual data. At their central, transformers operate by taking a sequence of words as input and generating a numeric vector representation that encapsulates the semantic meaning of the entire text.

When using transformers as feature extractors, the process involves feeding a word sequence into the transformer model and obtaining a dense numeric vector that captures the essential semantic information of the input text. This vector representation can then be used as input to other machine learning algorithms or neural networks, enabling them to perform various prediction and analysis tasks effectively.

### Benefits and Limitations of Transformers as Feature Extractors

One of the key advantages of using transformers as feature extractors is their ability to handle textual data effectively, making them particularly useful for tasks such as text classification or regression. Transformers also excel at unsupervised learning, meaning they can extract meaningful features from unlabeled text data, reducing the reliance on labeled datasets.

Also, transformers have the capacity to capture complex semantic relationships and contextual information within the text, enabling them to generate rich and informative feature representations. This ability to capture elaborate patterns and dependencies makes transformers a powerful tool for various NLP applications.

However, it is important to note that transformers also have some limitations. They are computationally intensive and require substantial amounts of data for training. Additionally, transformers have a significant memory footprint due to the need to store the weights of the neural network. This can make them challenging to deploy on resource-constrained devices such as mobile phones or embedded systems with limited memory.

<br>

> While transformers have their limitations in terms of computational complexity and memory requirements, their benefits in terms of feature extraction and unsupervised learning make them an indispensable tool in the NLP toolkit. As research in this area continues to advance, we can expect to see further improvements and innovations in the use of transformers as feature extractors, pushing the boundaries of what is possible in natural language understanding and processing.

# General Steps for Using Transformers

To use transformers on a specific task, we need to follow these steps:

### Step 1: Start with a Pretrained Model

The first step is to select a pretrained model from the [Hugging Face Transformers library](https://huggingface.co/transformers/pretrained_models.html). These pretrained models have been trained on large amounts of text data and have learned general language representations. Using a pretrained model provides a strong foundation for your specific task.

*Note:* Training a model from scratch is an advanced topic and rarely necessary. In most cases, you can warm-start your model from a pretrained model. If you're interested in learning more about training your own model from scratch, refer to [this resource](https://huggingface.co/blog/how-to-train).

### Step 2: Fine-tune the Model on Domain-Specific Text (Optional)

Fine-tuning involves adapting a pretrained model to a new domain by training it on domain-specific text. This step is optional but can enhance the model's performance on your specific task. By exposing the model to text that is similar to your target domain, it can learn domain-specific language patterns and representations.

### Step 3: Train the Model for Your Task

Once you have a fine-tuned model (or a pretrained model if you skipped step 2), you can train it for your specific task. This typically involves adding a classification or regression head on top of the model and training it using your task-specific data.

Alternatively, you can use the model as a feature extractor for your task, which is a more advanced approach.

## Example: Classifying Court Decision Labels

Let's explore these steps using a subset of the [BrCAD-5](https://www.kaggle.com/datasets/eliasjacob/brcad5) dataset, which contains over 765,000 legal case information from Brazilian Federal Courts. Our goal is to train a model to predict the label for a court decision based on its text.

1. **Select a Pretrained Model**: We'll choose a suitable pretrained model from the Hugging Face Transformers library that aligns with our task requirements, such as language support and model architecture.

2. **Fine-tune the Model (Optional)**: If we have a sufficient amount of domain-specific text (legal case information in this case), we can fine-tune the pretrained model on this data to capture domain-specific language patterns.

3. **Train the Model for Label Prediction**: We'll add a classification head on top of the model and train it using the labeled court decision data from BrCAD-5. The model will learn to predict the appropriate label based on the text of the court decision.

<br>

> Remember, the key is to start with a strong pretrained model and adapt it to your specific task through fine-tuning and task-specific training.

## Load pretrained models

In [1]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Define the model checkpoints for the base and large versions of the BERT model
model_checkpoint_base = "neuralmind/bert-base-portuguese-cased"
model_checkpoint_large = "neuralmind/bert-large-portuguese-cased"

# Load the tokenizer for the base BERT model
# The tokenizer is responsible for converting text into tokens that the model can understand
tokenizer_base = AutoTokenizer.from_pretrained(model_checkpoint_base)

# Load the masked language model (MLM) for the base BERT model
# The MLM is used for tasks like predicting masked words in a sentence
model_mlm_base = AutoModelForMaskedLM.from_pretrained(model_checkpoint_base)

# Load the tokenizer for the large BERT model
# This tokenizer works similarly to the base tokenizer but is tailored for the large model
tokenizer_large = AutoTokenizer.from_pretrained(model_checkpoint_large)

# Load the masked language model (MLM) for the large BERT model
# This MLM is used for tasks like predicting masked words in a sentence, similar to the base model but with more parameters
model_mlm_large = AutoModelForMaskedLM.from_pretrained(model_checkpoint_large)

2024-12-10 11:21:29.159338: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1733840489.180297 4187509 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1733840489.186643 4187509 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-10 11:21:29.207685: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn'

In [2]:
tokenizer_base.is_fast  # A fast tokenizer from HF Transformers uses Rust under the hood for faster tokenization

True

In [3]:
tokenizer_large.is_fast

True

In [4]:
# Define a function to count the number of trainable parameters in a model
def count_parameters(model):
    # Sum the number of elements (numel) for each parameter in the model
    # Only include parameters that require gradients (i.e., are trainable)
    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
    # Print the number of trainable parameters in a human-readable format with commas
    print(f"The model has {n_parameters:,} trainable parameters")


# Count and print the number of trainable parameters for the base BERT model
count_parameters(model_mlm_base)

# Count and print the number of trainable parameters for the large BERT model
count_parameters(model_mlm_large)

The model has 108,954,466 trainable parameters
The model has 334,428,258 trainable parameters


In [5]:
model_mlm_base

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

> Above, you can see the model architecture summary of a BERT (Bidirectional Encoder Representations from Transformers) model specifically designed for masked language modeling (MLM) tasks. Let's break it down:
>
> 1. `BertForMaskedLM`: This is the main class that represents the BERT model for masked language modeling.
>
> 2. `BertModel`: This is the fundamental BERT model that consists of the following components:
>       - `BertEmbeddings`: This module handles the input embeddings, including word embeddings, position embeddings, and token type embeddings. It also applies layer normalization and dropout.
>       - `BertEncoder`: This is the main encoder component of BERT, which consists of a stack of `BertLayer` modules.
>       - `BertLayer`: Each layer in the encoder consists of a self-attention mechanism (`BertAttention`), an intermediate feed-forward network (`BertIntermediate`), and an output projection (`BertOutput`).
>       - `BertAttention`: This module performs self-attention on the input representations using query, key, and value linear transformations, followed by dropout.
>       - `BertIntermediate`: This is a feed-forward network with a GELU activation function.
>       - `BertOutput`: This module applies a dense linear transformation, layer normalization, and dropout to the output of the intermediate layer.
>
> 3. `BertOnlyMLMHead`: This module is specific to the masked language modeling task and consists of the following components:
>       - `BertLMPredictionHead`: This module performs the final prediction for the masked tokens.
>       - `BertPredictionHeadTransform`: This module applies a dense linear transformation, GELU activation, and layer normalization to the output of the BERT encoder.
>       - `decoder`: This is a linear layer that maps the transformed representations to the vocabulary size for predicting the masked tokens.
>
> The model architecture summary provides details about the dimensions of the embeddings, the number of layers in the encoder, and the sizes of the intermediate and output layers.
>

In [6]:
model_mlm_large

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-

> The main differences between the two BERT models are in the model size and architecture:
>
> 1. Embedding dimensions:
>       - In the first model, the word embeddings, position embeddings, and token type embeddings have a dimension of 768.
>       - In the second model, these embeddings have a dimension of 1024, indicating a larger embedding size.
>
> 2. Number of encoder layers:
>       - The first model has 12 encoder layers (`(0-11): 12 x BertLayer`).
>       - The second model has 24 encoder layers (`(0-23): 24 x BertLayer`)
>
> 3. Intermediate layer dimensions:
>       - In the first model, the intermediate layer (`BertIntermediate`) has an output dimension of 3072.
>       - In the second model, the intermediate layer has an output dimension of 4096, which is larger than the first model.
>
> 4. Hidden state dimensions:
>       - The first model uses hidden states with a dimension of 768 throughout the architecture, including the self-attention layers, intermediate layers, and output layers.
>       - The second model uses hidden states with a dimension of 1024 throughout the architecture.
>
> The rest of the architecture, including the self-attention mechanism, layer normalization, dropout, and the MLM head, remains the same between the two models.
>
> The large model has higher-dimensional embeddings, more encoder layers, and larger intermediate layer dimensions. This suggests that the large model has a higher capacity and can potentially capture more complex patterns and representations from the input data. However, the larger model size also means increased computational requirements and longer training times.

## Load dataset and creating a train/test split

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the unlabeled dataset from a Parquet file
# Only the 'text' column is read from the file
df_unlabeled = pd.read_parquet("data/legal/unlabeled_texts.parquet", columns=["text"])

# Split the unlabeled dataset into training and validation sets
# 10% of the data is used for validation, and the split is reproducible with a fixed random state
df_unlabeled_train, df_unlabeled_valid = train_test_split(
    df_unlabeled, test_size=0.10, random_state=271828
)

# Display the shapes of the training and validation sets
# This shows the number of rows and columns in each set
df_unlabeled_train.shape, df_unlabeled_valid.shape

((58529, 1), (6504, 1))

In [8]:
import datasets

# Convert the pandas DataFrame containing the unlabeled training data into a Hugging Face Dataset
# This allows for easier manipulation and integration with Hugging Face's tools and models
dataset_unlabeled_train = datasets.Dataset.from_pandas(df_unlabeled_train)

# Convert the pandas DataFrame containing the unlabeled validation data into a Hugging Face Dataset
# This allows for easier manipulation and integration with Hugging Face's tools and models
dataset_unlabeled_valid = datasets.Dataset.from_pandas(df_unlabeled_valid)

In [9]:
dataset_unlabeled_train

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 58529
})

In [10]:
dataset_unlabeled_valid

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 6504
})

In [11]:
from pathlib import Path

# Define the path to save the outputs of the base BERT masked language model
path_to_save_lm_base = Path("./outputs/transformers_basics/bert_masked_lm_base")
# Create the directory (and any necessary parent directories) if it doesn't already exist
path_to_save_lm_base.mkdir(parents=True, exist_ok=True)

# Define the path to save the outputs of the large BERT masked language model
path_to_save_lm_large = Path("./outputs/transformers_basics/bert_masked_lm_large")
# Create the directory (and any necessary parent directories) if it doesn't already exist
path_to_save_lm_large.mkdir(parents=True, exist_ok=True)

## Fine tune the Language Model on the Domani Text

Remember our transfer learning class. During this stage, the general-domain language model adapts itself to the idiosyncrasies of the domain-specific text. This is done by training the model on the domain-specific text. This step is optional, but it can improve the performance of the model on your task.

In [12]:
from functools import partial
from multiprocessing import cpu_count


def tokenize_function(examples, tokenizer):
    """
    Tokenizes the input text in the given examples using the tokenizer object.

    Args:
    - examples: A dictionary containing the input text to be tokenized.

    Returns:
    - A dictionary containing the tokenized input text.
    """
    result = tokenizer(examples["text"])  # Tokenize the input text
    if tokenizer.is_fast:
        # If the tokenizer is a fast tokenizer, add word IDs to the result
        result["word_ids"] = [
            result.word_ids(i) for i in range(len(result["input_ids"]))
        ]
    return result


# Create partial functions for tokenizing using the base and large tokenizers
# This allows us to pass the tokenizer as a fixed argument to the tokenize_function
tokenize_function_base = partial(tokenize_function, tokenizer=tokenizer_base)
tokenize_function_large = partial(tokenize_function, tokenizer=tokenizer_large)

# Tokenize the training dataset using the base tokenizer
# The map function applies the tokenize_function_base to each example in the dataset
# The batched=True argument processes the examples in batches for efficiency
# The remove_columns argument removes the specified columns from the dataset after tokenization
dataset_train_tokenized_mlm_base = dataset_unlabeled_train.map(
    tokenize_function_base, batched=True, remove_columns=["text", "__index_level_0__"]
)

# Tokenize the validation dataset using the base tokenizer
dataset_valid_tokenized_mlm_base = dataset_unlabeled_valid.map(
    tokenize_function_base, batched=True, remove_columns=["text", "__index_level_0__"]
)

# Tokenize the training dataset using the large tokenizer
dataset_train_tokenized_mlm_large = dataset_unlabeled_train.map(
    tokenize_function_large, batched=True, remove_columns=["text", "__index_level_0__"]
)

# Tokenize the validation dataset using the large tokenizer
dataset_valid_tokenized_mlm_large = dataset_unlabeled_valid.map(
    tokenize_function_large, batched=True, remove_columns=["text", "__index_level_0__"]
)

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

In [13]:
import numpy as np


def group_texts(examples):
    """
    This function groups together a set of texts as contiguous text of fixed length (chunk_size).
    It's useful for training masked language models.

    Args:
    - examples: A dictionary containing the examples to group. Each key corresponds to a feature,
                and each value is a list of lists of tokens.

    Returns:
    - A dictionary containing the grouped examples. Each key corresponds to a feature,
      and each value is a list of lists of tokens.
    """
    # Concatenate all texts for each feature
    concatenated_examples = {k: np.concatenate(examples[k]) for k in examples.keys()}

    # Compute the total length of the concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # Adjust the total length to be a multiple of chunk_size, dropping the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size

    # Split the concatenated texts into chunks of size chunk_size using NumPy
    result = {
        k: np.split(t[:total_length], total_length // chunk_size)
        for k, t in concatenated_examples.items()
    }

    # Create a new 'labels' column that is a copy of the 'input_ids' column
    result["labels"] = result["input_ids"].copy()

    return result


# Define the chunk size for grouping texts
chunk_size = 512

# Apply the group_texts function to the tokenized training dataset for the base BERT model
dataset_train_tokenized_mlm_base = dataset_train_tokenized_mlm_base.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized validation dataset for the base BERT model
dataset_valid_tokenized_mlm_base = dataset_valid_tokenized_mlm_base.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized training dataset for the large BERT model
dataset_train_tokenized_mlm_large = dataset_train_tokenized_mlm_large.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized validation dataset for the large BERT model
dataset_valid_tokenized_mlm_large = dataset_valid_tokenized_mlm_large.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

In [14]:
from transformers import DataCollatorForLanguageModeling

# Create a data collator for masked language modeling (MLM) using the base BERT tokenizer
# The data collator will dynamically mask tokens in the input text with a probability of 0.15
data_collator_mlm_base = DataCollatorForLanguageModeling(
    tokenizer=tokenizer_base, mlm_probability=0.15
)

# Create a data collator for masked language modeling (MLM) using the large BERT tokenizer
# The data collator will dynamically mask tokens in the input text with a probability of 0.15
data_collator_mlm_large = DataCollatorForLanguageModeling(
    tokenizer=tokenizer_large, mlm_probability=0.15
)

In [15]:
from transformers import TrainingArguments

# Define the batch size for training and evaluation using the base BERT model
batch_size_base = 20

# Extract the model name from the model checkpoint path for the base BERT model
model_name_base = model_checkpoint_base.split("/")[-1]

# Set up the training arguments for fine-tuning the base BERT model on a masked language modeling task
training_args_mlm_base = TrainingArguments(
    output_dir=path_to_save_lm_base
    / f"{model_name_base}-finetuned-mlm",  # Directory to save the model checkpoints
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
    learning_rate=5e-5,  # Learning rate for the optimizer
    weight_decay=0.01,  # Weight decay for regularization
    per_device_train_batch_size=batch_size_base,  # Batch size for training
    per_device_eval_batch_size=batch_size_base,  # Batch size for evaluation
    bf16=True,  # Use bfloat16 precision (change to "fp16" if using a free GPU)
    num_train_epochs=3,  # Number of training epochs
    save_total_limit=1,  # Limit the total number of saved checkpoints
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    logging_steps=1,  # Log the training loss after every 1 epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Metric to use for selecting the best model
    greater_is_better=False,  # Lower evaluation loss is better
    gradient_accumulation_steps=3,  # Number of gradient accumulation steps
    seed=271828,  # Random seed for reproducibility
)

In [16]:
from transformers import Trainer

# Initialize the Trainer for the base BERT model
# The Trainer class provides an easy-to-use API for training and evaluating models
trainer_mlm_base = Trainer(
    model=model_mlm_base,  # The model to be trained (base BERT masked language model)
    args=training_args_mlm_base,  # Training arguments defined earlier
    train_dataset=dataset_train_tokenized_mlm_base,  # Tokenized training dataset
    eval_dataset=dataset_valid_tokenized_mlm_base,  # Tokenized validation dataset
    data_collator=data_collator_mlm_base,  # Data collator for dynamically masking tokens
    tokenizer=tokenizer_base,  # Tokenizer for processing the input text
)

  trainer_mlm_base = Trainer(


In [17]:
# This took around 5 hours to train on 2 x NVIDIA RTX 3090 GPUs
trainer_mlm_base.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33meliasjacob[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


TrainOutput(global_step=6786, training_loss=1.7849110120035188e+29, metrics={'train_runtime': 19426.6794, 'train_samples_per_second': 41.93, 'train_steps_per_second': 0.349, 'total_flos': 2.1433112200101888e+17, 'train_loss': 1.7849110120035188e+29, 'epoch': 2.9994107248084854})

In [18]:
# Save the trained model
trainer_mlm_base.save_model(path_to_save_lm_base / f"{model_name_base}-finetuned-mlm")
tokenizer_base.save_pretrained(
    path_to_save_lm_base / f"{model_name_base}-finetuned-mlm"
)

trainer_mlm_base.evaluate()



{'eval_loss': 0.38383054733276367,
 'eval_runtime': 297.876,
 'eval_samples_per_second': 100.119,
 'eval_steps_per_second': 2.504,
 'epoch': 2.9994107248084854}

In [19]:
print(path_to_save_lm_base / f"{model_name_base}-finetuned-mlm")

outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm


In [20]:
import gc
import torch

# Set the trainer, model, and tokenizer for the base BERT model to None
# This helps free up memory by removing references to these objects
trainer_mlm_base = None
model_mlm_base = None
tokenizer_base = None

# Force garbage collection to free up memory
gc.collect()

# Clear the CUDA memory cache to free up GPU memory
torch.cuda.empty_cache()

In [21]:
from transformers import TrainingArguments

# Define the batch size for training and evaluation
batch_size_large = 14

# Extract the model name from the model checkpoint path
# This will be used to name the output directory for the trained model
model_name_large = model_checkpoint_large.split("/")[-1]

# Define the training arguments for the large masked language model (MLM)
training_args_mlm_large = TrainingArguments(
    output_dir=path_to_save_lm_large
    / f"{model_name_large}-finetuned-mlm",  # Output directory for the trained model
    overwrite_output_dir=True,  # Overwrite the output directory if it already exists
    learning_rate=5e-5,  # Learning rate for the optimizer
    weight_decay=0.01,  # Weight decay for regularization
    per_device_train_batch_size=batch_size_large,  # Batch size for training
    per_device_eval_batch_size=batch_size_large,  # Batch size for evaluation
    bf16=True,  # Use bf16 precision. Change to "fp16" if using a free GPU
    num_train_epochs=3,  # Number of training epochs
    save_total_limit=1,  # Limit the total amount of checkpoints and delete the older ones
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    logging_steps=1,  # Log the training loss after every 1 step
    eval_steps=1,  # Evaluate the model after every 1 step
    save_steps=1,  # Save the model after every 1 step
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Use the evaluation loss to determine the best model
    greater_is_better=False,  # Lower evaluation loss is better
    gradient_accumulation_steps=4,  # Number of steps to accumulate gradients before updating the model parameters
    seed=271828,  # Random seed for reproducibility
)

In [22]:
from transformers import Trainer

# Initialize the Trainer for the large masked language model (MLM)
trainer_mlm_large = Trainer(
    model=model_mlm_large,  # The pre-trained large BERT model for masked language modeling
    args=training_args_mlm_large,  # The training arguments defined earlier for the large model
    train_dataset=dataset_train_tokenized_mlm_large,  # The tokenized training dataset for the large model
    eval_dataset=dataset_valid_tokenized_mlm_large,  # The tokenized validation dataset for the large model
    data_collator=data_collator_mlm_large,  # The data collator for dynamic masking during training
    tokenizer=tokenizer_large,  # The tokenizer used to process the input text for the large model
)

  trainer_mlm_large = Trainer(


In [23]:
# Train the large masked language model (MLM)
# This process involves multiple epochs of training on the training dataset
# Note: This training process took almost 14 hours on 2 x NVIDIA RTX 3090 GPUs
trainer_mlm_large.train()



Epoch,Training Loss,Validation Loss


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


TrainOutput(global_step=7272, training_loss=0.41708476434495584, metrics={'train_runtime': 49389.7092, 'train_samples_per_second': 16.492, 'train_steps_per_second': 0.147, 'total_flos': 7.590524853366497e+17, 'train_loss': 0.41708476434495584, 'epoch': 2.999381315735203})

In [24]:
# Save the trained large masked language model (MLM) to the specified directory
trainer_mlm_large.save_model(
    path_to_save_lm_large / f"{model_name_large}-finetuned-mlm"
)

# Save the tokenizer used for the large MLM to the same directory
tokenizer_large.save_pretrained(
    path_to_save_lm_large / f"{model_name_large}-finetuned-mlm"
)

# Evaluate the trained large MLM on the validation dataset
# This will return a dictionary containing the evaluation metrics
trainer_mlm_large.evaluate()



{'eval_loss': 0.30418580770492554,
 'eval_runtime': 714.0408,
 'eval_samples_per_second': 41.767,
 'eval_steps_per_second': 1.493,
 'epoch': 2.999381315735203}

In [25]:
print(path_to_save_lm_large / f"{model_name_large}-finetuned-mlm")

outputs/transformers_basics/bert_masked_lm_large/bert-large-portuguese-cased-finetuned-mlm


## Assessing Language Models with Perplexity

To evaluate the performance of a language model, it's essential to measure how effectively it predicts words in a sentence. One key metric for this purpose is **perplexity**, which provides insight into the model's ability to understand and generate language.

### Understanding Perplexity

**Perplexity** is a quantitative measure of how well a probability model predicts a sample of data. In the context of language models, it assesses how "surprised" the model is when encountering new data. Essentially, perplexity measures the uncertainty of the model in predicting the next word in a sequence.

- **Lower Perplexity**: Indicates that the model is less surprised by the test data, suggesting it has a good grasp of the language patterns.
- **Higher Perplexity**: Implies that the model is more surprised by the test data, indicating less effective learning.

> **Note:** Perplexity can be thought of as the average number of choices the model has when predicting the next word.

### Mathematical Definition of Perplexity

Perplexity is mathematically defined using entropy or the cross-entropy loss function used during training. For a language model that assigns a probability $ p(w_i) $ to each word $ w_i $ in a test set of size $ N $, perplexity is calculated as:

$$ \text{Perplexity} = e^{\text{loss}} $$

Where:

- $ e $ is Euler's number (approximately 2.71828).
- $ \text{loss} $ is the average cross-entropy loss over the test set.

Expanding the cross-entropy loss, we have:

$$ \text{loss} = -\frac{1}{N} \sum_{i=1}^{N} \log p(w_i) $$

Therefore, perplexity becomes:

$$ \text{Perplexity} = e^{ -\frac{1}{N} \sum_{i=1}^{N} \log p(w_i) } $$

#### Relationship to Entropy

Entropy $ H $ is a measure of the average uncertainty innate in the possible outcomes of a random variable. For a probability distribution $ p $ over a vocabulary $ V $:

$$ H(p) = -\sum_{w \in V} p(w) \log p(w) $$

Perplexity is then the exponential of the entropy:

$$ \text{Perplexity} = e^{H(p)} $$

This formulation shows that perplexity reflects the model's uncertainty: the higher the entropy, the higher the perplexity.

### Interpreting Perplexity

Perplexity provides an interpretable metric:

- If a language model has a perplexity of **10**, it is as uncertain as if it had to choose uniformly among **10** possible words at each step.
- Lower perplexity values mean the model's predictions are closer to the actual data distribution.

#### Example Analogy

Consider a language model trying to predict the next word in a sentence:

- **High Perplexity Scenario**: The model considers many plausible words (e.g., it could be any one of 100 words), indicating high uncertainty.
- **Low Perplexity Scenario**: The model confidently narrows down the next word to a few options (e.g., between 2 or 3 words), showing low uncertainty.

### Choice of Logarithm Base

The base of the logarithm used in entropy and perplexity calculations affects the units of measurement:

- **Base-2 Logarithm ($ \log_2 $)**: Measures entropy in **bits**. Common in information theory, reflecting binary decisions.
  
  $$ H_2(p) = -\sum_{w \in V} p(w) \log_2 p(w) $$

- **Natural Logarithm ($ \ln $ or $ \log_e $)**: Measures entropy in **nats**, leveraging mathematical properties of $ e $.

  $$ H_e(p) = -\sum_{w \in V} p(w) \ln p(w) $$

> **Important:** The choice of logarithm base is a scaling factor and does not alter the fundamental interpretation of perplexity. Consistency in the base used is key when comparing models.

### Addressing Potential Misconceptions

- **Perplexity vs. Accuracy**: A low perplexity does not necessarily translate to higher accuracy in practical applications. Perplexity measures the probability distribution learned by the model, not the correctness of specific predictions.
  
- **Perplexity Values**: There is no "ideal" perplexity value universally applicable. It's context-dependent and should be compared relative to other models or baselines on the same dataset.

- **Impact of Vocabulary Size**: A larger vocabulary can lead to higher perplexity simply because there are more possible words to predict. Techniques like subword tokenization can help mitigate this effect.

### Practical Considerations

- **Model Comparison**: When evaluating different language models, perplexity allows for quantifiable comparisons of their predictive capabilities.
  
- **Training and Evaluation**: Monitoring perplexity during training helps identify overfitting:
  - **Decreasing Training Perplexity with Increasing Validation Perplexity**: May indicate overfitting to the training data.
  - **Consistent Perplexity Across Datasets**: Suggests the model generalizes well.

- **Units Consistency**: Always ensure consistency in the logarithm base and units (bits vs. nats) when reporting perplexity.

In [26]:
import gc
import torch

# Set the trainer for the large masked language model (MLM) to None to free up memory
trainer_mlm_large = None

# Set the large masked language model (MLM) to None to free up memory
model_mlm_large = None

# Set the tokenizer for the large MLM to None to free up memory
tokenizer_large = None

# Collect garbage to free up memory
gc.collect()

# Empty the CUDA cache to free up GPU memory
torch.cuda.empty_cache()

In [64]:
import math

print(f"The perplexity for the base model is {math.exp(0.38383054733276367)}")
print(f"The perplexity for the large model is {math.exp(0.30418580770492554)}")

The perplexity for the base model is 1.4678966815981307
The perplexity for the large model is 1.3555208989189849


In [28]:
import os
from pathlib import Path
from transformers import pipeline, AutoModelForMaskedLM, AutoTokenizer

# Define the path to save the base masked language model (MLM)
# This path points to the directory where the base MLM model will be saved
path_to_save_lm_base = Path("./outputs/transformers_basics/bert_masked_lm_base")

# Define the path to save the large masked language model (MLM)
# This path points to the directory where the large MLM model will be saved
path_to_save_lm_large = Path("./outputs/transformers_basics/bert_masked_lm_large")

In [29]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load the fine-tuned base masked language model (MLM) from the specified directory
# This model is a BERT base model fine-tuned on a Portuguese dataset
model_base = AutoModelForMaskedLM.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Load the tokenizer for the fine-tuned base MLM from the same directory
# The tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Load the fine-tuned large masked language model (MLM) from the specified directory
# This model is a BERT large model fine-tuned on a Portuguese dataset
model_large = AutoModelForMaskedLM.from_pretrained(
    path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm"
)

# Load the tokenizer for the fine-tuned large MLM from the same directory
# The tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(
    path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm"
)

In [30]:
from transformers import pipeline

# Create a pipeline for the base masked language model (MLM)
# The pipeline is used to fill in the masked tokens in the input text
# 'fill-mask' specifies the task type for the pipeline
# model_base is the fine-tuned base MLM model
# tokenizer_base is the tokenizer for the base MLM model
# top_k=5 specifies that the top 5 predictions for the masked token will be returned
pipe_base = pipeline("fill-mask", model=model_base, tokenizer=tokenizer_base, top_k=5)

# Create a pipeline for the large masked language model (MLM)
pipe_large = pipeline(
    "fill-mask", model=model_large, tokenizer=tokenizer_large, top_k=5
)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [31]:
pipe_base("O artigo 121 do Código Penal prevê o crime de [MASK]")

[{'score': 0.9297154545783997,
  'token': 131,
  'token_str': ':',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de :'},
 {'score': 0.01481659710407257,
  'token': 21982,
  'token_str': 'homicídio',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de homicídio'},
 {'score': 0.006103144492954016,
  'token': 18144,
  'token_str': 'roubo',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de roubo'},
 {'score': 0.002901835599914193,
  'token': 1112,
  'token_str': '“',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de “'},
 {'score': 0.0020643649622797966,
  'token': 12244,
  'token_str': 'assalto',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de assalto'}]

In [32]:
pipe_large("O artigo 121 do Código Penal prevê o crime de [MASK]")

[{'score': 0.9842374920845032,
  'token': 131,
  'token_str': ':',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de :'},
 {'score': 0.006213649641722441,
  'token': 21982,
  'token_str': 'homicídio',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de homicídio'},
 {'score': 0.0011775112943723798,
  'token': 119,
  'token_str': '.',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de.'},
 {'score': 0.0009778104722499847,
  'token': 1386,
  'token_str': 'morte',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de morte'},
 {'score': 0.0006957394652999938,
  'token': 9566,
  'token_str': 'corrupção',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de corrupção'}]

In [33]:
pipe_base(
    "O Código de Processo Civil prevê prazo em [MASK] para interposição de recurso pela Fazenda Pública"
)

[{'score': 0.31187960505485535,
  'token': 17225,
  'token_str': 'julgado',
  'sequence': 'O Código de Processo Civil prevê prazo em julgado para interposição de recurso pela Fazenda Pública'},
 {'score': 0.282763808965683,
  'token': 21244,
  'token_str': 'dobro',
  'sequence': 'O Código de Processo Civil prevê prazo em dobro para interposição de recurso pela Fazenda Pública'},
 {'score': 0.17378923296928406,
  'token': 5370,
  'token_str': 'aberto',
  'sequence': 'O Código de Processo Civil prevê prazo em aberto para interposição de recurso pela Fazenda Pública'},
 {'score': 0.09018026292324066,
  'token': 3418,
  'token_str': 'curso',
  'sequence': 'O Código de Processo Civil prevê prazo em curso para interposição de recurso pela Fazenda Pública'},
 {'score': 0.03861454129219055,
  'token': 4712,
  'token_str': 'branco',
  'sequence': 'O Código de Processo Civil prevê prazo em branco para interposição de recurso pela Fazenda Pública'}]

In [34]:
pipe_large(
    "O Código de Processo Civil prevê prazo em [MASK] para interposição de recurso pela Fazenda Pública"
)

[{'score': 0.6611300110816956,
  'token': 2241,
  'token_str': 'lei',
  'sequence': 'O Código de Processo Civil prevê prazo em lei para interposição de recurso pela Fazenda Pública'},
 {'score': 0.27480894327163696,
  'token': 21244,
  'token_str': 'dobro',
  'sequence': 'O Código de Processo Civil prevê prazo em dobro para interposição de recurso pela Fazenda Pública'},
 {'score': 0.016465405002236366,
  'token': 2502,
  'token_str': 'Lei',
  'sequence': 'O Código de Processo Civil prevê prazo em Lei para interposição de recurso pela Fazenda Pública'},
 {'score': 0.011941857635974884,
  'token': 5370,
  'token_str': 'aberto',
  'sequence': 'O Código de Processo Civil prevê prazo em aberto para interposição de recurso pela Fazenda Pública'},
 {'score': 0.00545860268175602,
  'token': 20554,
  'token_str': 'razoável',
  'sequence': 'O Código de Processo Civil prevê prazo em razoável para interposição de recurso pela Fazenda Pública'}]

## Train our Document Classifier Using Our Fine-Tuned Language Model

### Understanding the Language Model Output Structure

Before diving into the details of document classification, it's essential to grasp the structure of the output from the language model. The output is a vector with dimensions of `max_tokens` x `embedding_dimension`. Taking BERT-base as an example, the embedding dimension is 768. This means that for each token in the input text, there is a corresponding vector of size 768.

In practical scenarios, utilizing the entire array of vectors as input for our classifier may not be feasible due to the vast amount of information involved. Instead, we focus on leveraging the vector corresponding to the `[CLS]` token.

### The Significance of the `[CLS]` Token

The `[CLS]` token is a special token that precedes the input text and represents the entirety of the input in the context of BERT models. This token's vector size is 768, which is significantly more manageable compared to the entire vector array. `[CLS]` stands for `CL`a`S`sification and is specifically designed for classification tasks.

Here's an example to illustrate the usage of the `[CLS]` token:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
outputs = tokenizer('Eu gosto muito de farofa')
tokenizer.decode(outputs['input_ids'])
```

Resulting output: `'[CLS] Eu gosto muito de farofa [SEP]'`

In the above output, you'll notice that the `[CLS]` token is added to the start of the input text, while the `[SEP]` token is appended to the end. However, for classification purposes, we only need to focus on the `[CLS]` token and can ignore the `[SEP]` token. The role of the `[SEP]` token in BERT is to enable the separation of two sentences, but since our input text contains only one sentence, its usage is unnecessary here.

### Implementing Classification Using the `[CLS]` Token

Now that we know how to extract the vector for the `[CLS]` token, we can use it as input for our classifier. The classifier's output will be a vector of size `num_labels`, where `num_labels` refers to the number of labels present in our dataset. For example, if we have 4 labels, the classifier would output a vector of size 4.

This output vector will be crucial in calculating the model's loss and updating its weights during the training process. By comparing the predicted label probabilities with the actual labels, we can measure the model's performance and make necessary adjustments to improve its accuracy.

### Putting It All Together

To summarize, the process of document classification using a fine-tuned language model involves the following steps:

1. Tokenize the input text and add the `[CLS]` token at the beginning.
2. Pass the tokenized input through the language model to obtain the output vector.
3. Extract the vector corresponding to the `[CLS]` token.
4. Use the `[CLS]` token vector as input for the classifier.
5. Obtain the classifier's output vector, which represents the predicted label probabilities.
6. Calculate the loss by comparing the predicted labels with the actual labels.
7. Update the model's weights based on the calculated loss to improve its performance.

In [35]:
import os
from pathlib import Path
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoConfig,
)

In [36]:
import pandas as pd
from pathlib import Path

# Load the training dataset from a Parquet file
# Only the 'text' and 'label' columns are read from the file
df_train = pd.read_parquet("data/legal/train.parquet", columns=["text", "label"])

# Load the validation dataset from a Parquet file
# Only the 'text' and 'label' columns are read from the file
df_valid = pd.read_parquet("data/legal/valid.parquet", columns=["text", "label"])

# Define the path to save the base masked language model (MLM)
# This path points to the directory where the base MLM model will be saved
path_to_save_lm_base = Path("./outputs/transformers_basics/bert_masked_lm_base")

# Define the path to save the large masked language model (MLM)
# This path points to the directory where the large MLM model will be saved
path_to_save_lm_large = Path("./outputs/transformers_basics/bert_masked_lm_large")

# Display the shapes of the training and validation datasets
# This shows the number of rows and columns in each dataset
df_train.shape, df_valid.shape

((52026, 2), (13007, 2))

In [37]:
# Create a dictionary to map each unique label in the training dataset to a unique ID
# df_train.label.unique() returns an array of unique labels in the training dataset
# The dictionary comprehension iterates over the unique labels and assigns an ID to each label
label2id = {df_train.label.unique()[i]: i for i in range(len(df_train.label.unique()))}

# Create a dictionary to map each unique ID back to its corresponding label
# This is the reverse mapping of the label2id dictionary
# The dictionary comprehension iterates over the items in label2id and swaps the keys and values
id2label = {v: k for k, v in label2id.items()}

# Display the label-to-ID and ID-to-label mappings
label2id, id2label

({'IMPROCEDENTE': 0,
  'PROCEDENTE': 1,
  'PARCIALMENTE PROCEDENTE': 2,
  'EXTINTO SEM MÉRITO': 3},
 {0: 'IMPROCEDENTE',
  1: 'PROCEDENTE',
  2: 'PARCIALMENTE PROCEDENTE',
  3: 'EXTINTO SEM MÉRITO'})

In [38]:
# Map the labels in the training dataset to their corresponding IDs
# This replaces the label names with their respective IDs using the label2id dictionary
df_train["label"] = df_train["label"].map(label2id)

# Map the labels in the validation dataset to their corresponding IDs
# This replaces the label names with their respective IDs using the label2id dictionary
df_valid["label"] = df_valid["label"].map(label2id)

# Display the first few rows of the training dataset
# This shows the updated training dataset with labels replaced by their corresponding IDs
df_train.head()

Unnamed: 0,text,label
1387,"SENTENÇA Vistos etc. Dispensado o relatório, a...",0
17972,"SENTENÇA Relatório dispensado. No caso, não há...",0
34527,SENTENÇA Vistos etc. Trata-se de pedido de res...,1
58381,TERMO DE AUDIÊNCIA DE INSTRUÇÃO Ação Especial ...,1
56474,SENTENÇA Trata-se de ação em que a parte autor...,2


In [39]:
import datasets

# Convert the training DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for training and evaluation
dataset_labeled_train = datasets.Dataset.from_pandas(df_train)

# Convert the validation DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for validation and evaluation
dataset_labeled_valid = datasets.Dataset.from_pandas(df_valid)

In [40]:
from transformers import AutoTokenizer
from functools import partial

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Load the tokenizer for the fine-tuned large masked language model (MLM)
# This tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(
    path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm"
)


# Define a function to preprocess the input examples using a specified tokenizer
# The function tokenizes the input text, truncates it to a maximum length of 512 tokens,
# and pads the sequences to ensure they are of equal length
def preprocess_function(examples, tokenizer):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)


# Create a partial function for preprocessing using the base tokenizer
preprocess_function_base = partial(preprocess_function, tokenizer=tokenizer_base)

# Create a partial function for preprocessing using the large tokenizer
preprocess_function_large = partial(preprocess_function, tokenizer=tokenizer_large)

In [41]:
# Tokenize the training dataset using the base tokenizer
# The preprocess_function_base tokenizes the text, truncates it to 512 tokens, and pads the sequences
# The batched=True argument processes the dataset in batches for efficiency
dataset_labeled_train_tokenized_base = dataset_labeled_train.map(
    preprocess_function_base, batched=True
)

# Tokenize the validation dataset using the base tokenizer
dataset_labeled_valid_tokenized_base = dataset_labeled_valid.map(
    preprocess_function_base, batched=True
)

# Tokenize the training dataset using the large tokenizer
dataset_labeled_train_tokenized_large = dataset_labeled_train.map(
    preprocess_function_large, batched=True
)

# Tokenize the validation dataset using the large tokenizer
dataset_labeled_valid_tokenized_large = dataset_labeled_valid.map(
    preprocess_function_large, batched=True
)

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

In [42]:
from transformers import DataCollatorWithPadding

# Create a data collator for the base tokenizer
# The data collator dynamically pads the input sequences to the maximum length in the batch
# This ensures that all sequences in a batch have the same length, which is required for efficient processing
data_collator_base = DataCollatorWithPadding(tokenizer=tokenizer_base)

# Create a data collator for the large tokenizer
data_collator_large = DataCollatorWithPadding(tokenizer=tokenizer_large)

In [43]:
# Import the evaluate module from the Hugging Face library
import evaluate

# Load the accuracy metric from the evaluate module
# This metric will be used to evaluate the performance of the model
accuracy = evaluate.load("accuracy")

In [44]:
import numpy as np


# Define a function to compute evaluation metrics
# This function will be used to evaluate the performance of the model during training and validation
def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the evaluation tuple
    predictions, labels = eval_pred

    # Convert the model's output logits to predicted class labels
    # np.argmax(predictions, axis=1) selects the index of the maximum logit for each prediction
    predictions = np.argmax(predictions, axis=1)

    # Compute the accuracy metric using the predicted and true labels
    # accuracy.compute() calculates the accuracy of the predictions
    return accuracy.compute(predictions=predictions, references=labels)

In [45]:
# Determine the number of unique labels in the training dataset
# This will be used to configure the classification model
n_labels = df_train.label.nunique()

# Load the configuration for the base masked language model (MLM) and modify it for sequence classification
# The configuration is loaded from the specified directory and the number of labels is set to n_labels
config_base = AutoConfig.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm",
    num_labels=n_labels,
)

# Load the base masked language model (MLM) and modify it for sequence classification
# The model is loaded from the specified directory and the configuration is set to config_base
classifier_base = AutoModelForSequenceClassification.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm",
    config=config_base,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [46]:
from transformers import Trainer, TrainingArguments

# Define the training arguments for the base classifier
# These arguments configure various aspects of the training process
training_args_base = TrainingArguments(
    output_dir=path_to_save_lm_base
    / "base_classifier_legal",  # Directory to save the model and other outputs
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=48,  # Batch size for training (adjust based on GPU memory)
    per_device_eval_batch_size=64,  # Batch size for evaluation (adjust based on GPU memory)
    num_train_epochs=5,  # Number of training epochs
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients before updating
    weight_decay=0.01,  # Weight decay for regularization
    bf16=True,  # Use 16-bit floating point precision for training (adjust based on GPU support)
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 10 steps
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Seed for reproducibility
)

# Create a Trainer instance for the base classifier
# The Trainer handles the training and evaluation of the model
trainer_base = Trainer(
    model=classifier_base,  # The model to be trained
    args=training_args_base,  # Training arguments
    train_dataset=dataset_labeled_train_tokenized_base,  # Training dataset
    eval_dataset=dataset_labeled_valid_tokenized_base,  # Evaluation dataset
    tokenizer=tokenizer_base,  # Tokenizer for preprocessing the input text
    data_collator=data_collator_base,  # Data collator for dynamic padding
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

# Train the model using the Trainer
trainer_base.train()

  trainer_base = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6032,0.615268,0.743369
2,0.5252,0.587841,0.759437
3,0.4957,0.557939,0.773199
4,0.4458,0.571086,0.769816
5,0.4054,0.575862,0.775505




TrainOutput(global_step=2710, training_loss=0.5300090861056563, metrics={'train_runtime': 4108.5228, 'train_samples_per_second': 63.315, 'train_steps_per_second': 0.66, 'total_flos': 6.844430787637248e+16, 'train_loss': 0.5300090861056563, 'epoch': 5.0})

In [47]:
trainer_base.evaluate()



{'eval_loss': 0.5579385161399841,
 'eval_accuracy': 0.7731990466671792,
 'eval_runtime': 62.6599,
 'eval_samples_per_second': 207.581,
 'eval_steps_per_second': 1.628,
 'epoch': 5.0}

`Can you guess why the accuracy is so low?`


## Understanding Low Accuracy: The Limitation of 512 Tokens

When working with transformer models, it's essential to be aware of a key limitation: most models can only process a maximum of **512 tokens**. This restriction has a significant impact on the accuracy of predictions, especially when dealing with longer texts.

### The Self-Attention Mechanism and Quadratic Complexity

The 512-token limit is a result of the *quadratic complexity* of the **self-attention mechanism**, which is a fundamental component of transformer models. Self-attention allows the model to weigh the importance of each token in relation to others, enabling it to capture context and dependencies within the input text.

However, the computational cost of self-attention grows quadratically with the number of tokens. As the input length increases, the memory and computational requirements become prohibitively expensive. To mitigate this issue, most transformer models impose a maximum token limit of 512.

### The Impact of Truncation on Accuracy

When an input text exceeds 512 tokens, the model automatically truncates it by removing tokens until it fits within the limit. This truncation process can have a detrimental effect on the model's accuracy.

Important information, such as key context or relevant details, may be lost during truncation. The model is forced to make predictions based on an incomplete representation of the original text, leading to lower accuracy scores.

### Strategies for Handling Longer Texts

While the 512-token limit can be challenging, there are several approaches to mitigate its impact:

1. **Sliding Window Approach**:
    - Divide the long text into smaller, overlapping chunks (windows).
    - Process each window individually and aggregate the results.
    - This approach can help capture local context, but it may struggle with long-range dependencies.

2. **Alternative Neural Network Architectures**:
    - Consider using other architectures, such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs).
    - These architectures can handle longer sequences without the same token limit constraints.
    - However, they may not capture long-range dependencies as effectively as transformers.

3. **Transformer Variants for Longer Sequences**:
    - Explore transformer-based models specifically designed for handling longer texts, such as Longformer and BigBird.
    - These models introduce modifications to the self-attention mechanism to reduce computational complexity.
    - Keep in mind that these models are relatively new and may have limitations or trade-offs compared to standard transformers.


To make informed decisions about handling longer texts, it's crucial to understand the characteristics of your dataset. Analyze the average number of tokens per text and the distribution of text lengths.

If a significant portion of your texts exceeds the 512-token limit, consider applying one of the strategies mentioned above. Experiment with different approaches and evaluate their impact on accuracy and computational efficiency.

In [48]:
from transformers import AutoTokenizer

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Initialize an empty list to store the sizes of tokenized input sequences
sizes = []

# Iterate over each text in the training dataset
for txt in df_train.text:
    # Tokenize the text without truncation and get the length of the tokenized input sequence
    # Append the length of the tokenized input sequence to the sizes list
    sizes.append(len(tokenizer_base(txt, truncation=False)["input_ids"]))

# Convert the sizes list to a Pandas Series and display descriptive statistics
# This provides an overview of the distribution of tokenized input sequence lengths
pd.Series(sizes).describe()

count    52026.000000
mean      2373.339407
std       1822.717847
min        151.000000
25%       1133.000000
50%       1799.000000
75%       3031.000000
max      11434.000000
dtype: float64

`As we can see above, the average number of tokens in our dataset is 2,373. This is significantly higher than the 512 token limit. Therefore, we need to employ a workaround to handle this limitation. We won't cover more complex approaches in this class, but we can use a simple and effective workaround - understanding our data! Let's see how we can do this.`

In [49]:
df_train.sample(10, random_state=271828)["text"].iloc[0]

"SENTENÇA Tipo A RELATÓRIO Trata-se de ação declaratória de inexistência de débito e indenizatória por danos morais, com pedido de repetição de indébito, ajuizada por Lúcia Matias de Souza em face do Instituto Nacional do Seguro Social – INSS e do Banco Bradesco S/A, em razão da existência de contrato de empréstimo consignado celebrado perante a aludida instituição financeira que, segundo diz a autora, não foi por ela contratado. É o que importa relatar. Passo a decidir. FUNDAMENTAÇÃO Das preliminares arguidas Quanto à preliminar de ilegitimidade passiva alegada pelo INSS (anexo 11), entendo que a Autarquia ré detém legitimidade para figurar no pólo passivo da ação, tendo em vista que é responsável pelo gerenciamento e pagamento dos descontos realizados nos benefícios previdenciários em decorrência de empréstimo consignado. Assim, a partir do momento em que opera o desconto nos valores tem interesse e legitimidade para figurar no pólo passivo da presente demanda. Ademais, só o INSS tem


`Can you notice that the really relevant information for our classification task is not in the beginning of the text, but in the end?`

> (....)
>
> DISPOSITIVO Isso posto, `julgo PROCEDENTE` o pedido para determinar que o INSS cesse os descontos das parcelas do Contratono 808431996. Condeno, também, a título de danos materiais, o Banco Bradesco a devolver os valores descontados com relação aos citados contratos de empréstimo, em dobro, nos termos do art. 42, parágrafo único, do CDC, devendo tais valores serem acrescidos de juros de mora de 1% ao mês desde o evento danoso (súmula 54 – STJ) e correção monetária com base no IPCA-E desde o efetivo prejuízo (súmula 43 – STJ). Condeno, ainda, o bancoréua pagar, a título de indenização por danos morais, a quantia de R$ 5.000,00 (cinco mil reais), valor este que deve ser atualizado exclusivamente pela taxa SELIC desde a publicação desta sentença. Declaro a inexistência do contrato no808431996. Declaro extinto o processo com resolução do mérito, nos termos do art. 487, I, do Código de Processo Civil. Custas e honorários advocatícios indevidos em primeiro grau de jurisdição (art. 55 da Lei no 9.099/95, c/c art. 1o da Lei no 10.259/01). Registre-se. Intimem-se as partes (Lei no 10.259/01, art. 8o). Campina Grande-PB, data supra. JUIZ FEDERAL
>

This is very common in this kind of documents. The judge starts with a thorough description of the case and then goes to the decision. So, we can use the last 512 tokens of the text to train our model. We just need to change the truncation_side parameter to 'left' in the tokenizer.

Let's see how we can do this.

In [50]:
from transformers import AutoTokenizer
from functools import partial

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm"
)

# Load the tokenizer for the fine-tuned large masked language model (MLM)
# This tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(
    path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm"
)

In [51]:
tokenizer_base.truncation_side

'right'

In [52]:
# Tokenize the input text using the base tokenizer
# The padding=True argument ensures that the sequence is padded to the maximum length
# The truncation=True argument ensures that the sequence is truncated to the maximum length if it exceeds it
# The max_length=5 argument sets the maximum length of the tokenized sequence to 5 tokens
out_len5 = tokenizer_base(
    "Eu gosto muito de farofa com banana", padding=True, truncation=True, max_length=5
)  # This is to simulate the truncation

# Decode the tokenized input IDs back to a string
# This converts the token IDs back to the corresponding text
# The decoded text will be truncated to the first 5 tokens
tokenizer_base.decode(out_len5["input_ids"])

'[CLS] Eu gosto muito [SEP]'

In [53]:
# Set the truncation side for the base tokenizer to 'left'
# This means that if the input text needs to be truncated, tokens will be removed from the beginning (left side) of the sequence
# This setting is useful when the most important information is at the end of the sequence
tokenizer_base.truncation_side = "left"

In [54]:
# Tokenize the input text using the base tokenizer
# The padding=True argument ensures that the sequence is padded to the maximum length
# The truncation=True argument ensures that the sequence is truncated to the maximum length if it exceeds it
# The max_length=5 argument sets the maximum length of the tokenized sequence to 5 tokens
out_len5 = tokenizer_base(
    "Eu gosto muito de farofa com banana", padding=True, truncation=True, max_length=5
)

# Decode the tokenized input IDs back to a string
# This converts the token IDs back to the corresponding text
# The decoded text will be truncated to the first 5 tokens
tokenizer_base.decode(out_len5["input_ids"])

'[CLS] com banana [SEP]'

In [55]:
import datasets

# Convert the training DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for training and evaluation
dataset_labeled_train = datasets.Dataset.from_pandas(df_train)

# Convert the validation DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for validation and evaluation
dataset_labeled_valid = datasets.Dataset.from_pandas(df_valid)

In [56]:
from functools import partial


# Define a function to preprocess the input examples using a specified tokenizer
# The function tokenizes the input text, truncates it to a maximum length of 512 tokens,
# and pads the sequences to ensure they are of equal length
def preprocess_function(examples, tokenizer):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)


# Create a partial function for preprocessing using the base tokenizer
# This partial function allows us to call preprocess_function with only the examples argument,
# as the tokenizer argument is already set to tokenizer_base
preprocess_function_base = partial(preprocess_function, tokenizer=tokenizer_base)

# Create a partial function for preprocessing using the large tokenizer
preprocess_function_large = partial(preprocess_function, tokenizer=tokenizer_large)

In [57]:
# Tokenize the training dataset using the base tokenizer
# The preprocess_function_base tokenizes the text, truncates it to 512 tokens, and pads the sequences
# The batched=True argument processes the dataset in batches for efficiency
dataset_labeled_train_tokenized_base = dataset_labeled_train.map(
    preprocess_function_base, batched=True
)

# Tokenize the validation dataset using the base tokenizer
dataset_labeled_valid_tokenized_base = dataset_labeled_valid.map(
    preprocess_function_base, batched=True
)

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

In [58]:
from transformers import DataCollatorWithPadding

# Create a data collator for the base tokenizer
# The data collator dynamically pads the input sequences to the maximum length in the batch
# This ensures that all sequences in a batch have the same length, which is required for efficient processing
data_collator_base = DataCollatorWithPadding(tokenizer=tokenizer_base)

In [59]:
# Import the evaluate module from the Hugging Face library
import evaluate

# Load the accuracy metric from the evaluate module
# This metric will be used to evaluate the performance of the model
accuracy = evaluate.load("accuracy")

In [60]:
import numpy as np


# Define a function to compute evaluation metrics
# This function will be used to evaluate the performance of the model during training and validation
def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the evaluation tuple
    predictions, labels = eval_pred

    # Convert the model's output logits to predicted class labels
    # np.argmax(predictions, axis=1) selects the index of the maximum logit for each prediction
    predictions = np.argmax(predictions, axis=1)

    # Compute the accuracy metric using the predicted and true labels
    # accuracy.compute() calculates the accuracy of the predictions
    return accuracy.compute(predictions=predictions, references=labels)

In [61]:
# Determine the number of unique labels in the training dataset
# This will be used to configure the classification model
n_labels = df_train.label.nunique()

# Load the configuration for the base masked language model (MLM) and modify it for sequence classification
# The configuration is loaded from the specified directory and the number of labels is set to n_labels
config_base = AutoConfig.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm",
    num_labels=n_labels,
)

# Load the base masked language model (MLM) and modify it for sequence classification
classifier_base = AutoModelForSequenceClassification.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm",
    config=config_base,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [62]:
from transformers import Trainer, TrainingArguments

# Define the training arguments for the base classifier
# These arguments configure various aspects of the training process
training_args_base = TrainingArguments(
    output_dir=path_to_save_lm_base
    / "base_classifier_legal",  # Directory to save the model and other outputs
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=48,  # Batch size for training (adjust based on GPU memory)
    per_device_eval_batch_size=64,  # Batch size for evaluation (adjust based on GPU memory)
    num_train_epochs=5,  # Number of training epochs
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients before updating
    weight_decay=0.01,  # Weight decay for regularization
    bf16=True,  # Use 16-bit floating point precision for training (adjust based on GPU support)
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 10 steps
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Seed for reproducibility
)

# Create a Trainer instance for the base classifier
# The Trainer handles the training and evaluation of the model
trainer_base = Trainer(
    model=classifier_base,  # The model to be trained
    args=training_args_base,  # Training arguments
    train_dataset=dataset_labeled_train_tokenized_base,  # Training dataset
    eval_dataset=dataset_labeled_valid_tokenized_base,  # Evaluation dataset
    tokenizer=tokenizer_base,  # Tokenizer for preprocessing the input text
    data_collator=data_collator_base,  # Data collator for dynamic padding
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

# Train the model using the Trainer
trainer_base.train()

  trainer_base = Trainer(


Epoch,Training Loss,Validation Loss




TrainOutput(global_step=2710, training_loss=0.11587148178987397, metrics={'train_runtime': 4188.4617, 'train_samples_per_second': 62.106, 'train_steps_per_second': 0.647, 'total_flos': 6.844430787637248e+16, 'train_loss': 0.11587148178987397, 'epoch': 5.0})

In [63]:
trainer_base.evaluate()



{'eval_loss': 0.1189282238483429,
 'eval_accuracy': 0.9610209886983931,
 'eval_runtime': 64.6572,
 'eval_samples_per_second': 201.169,
 'eval_steps_per_second': 1.578,
 'epoch': 5.0}

We've achieved a significant improvement in our model's accuracy, which soared from 77.9% to an impressive 96.1%. This upswing is indeed fantastic news!

Let's gain a better understanding of this improvement by examining it in terms of the error rate. The error rate is simply calculated as (1 - accuracy). With this formula, our initial error rate was 22.5%, and our improved error rate dropped dramatically to 3.9%.

To put this into perspective, we've effectively reduced the error rate by nearly six-fold! In other words, our model is now making far fewer mistakes than before, indicating an exponential enhancement in its overall performance.

By using the last 512 tokens in the text data, we were able to direct the focus of our model towards the most relevant information. This approach is a simple yet effective workaround to overcome the 512 token limitation in transformers.

This method may seem simple, but it's proven to be an effectively strategic approach to overcome such limitations and handle large amounts of data proficiently. `Remember, sometimes simplicity is the key to master complex challenges!`

# Questions

1. What are the key advantages of transformers compared to traditional sequential models like RNNs and LSTMs?

2. How does the self-attention mechanism in transformers function, and why is it important?

3. What role does positional encoding play in transformer architectures, and how is it typically implemented?

4. Why do transformers have quadratic complexity with respect to sequence length, and what challenges does this present when processing long texts?

5. In what ways have transformers been applied beyond natural language processing, and can you provide examples?

6. Can you describe the differences between BERT, GPT, and T5 transformer architectures?

7. How can transformers be utilized as feature extractors in NLP tasks?

8. What is the impact of the 512-token limit in transformers, and how can it affect the performance of models on longer texts?

9. What workaround was used in the notebook to handle texts longer than 512 tokens in the document classification task?

10. What steps were taken in the notebook to fine-tune a transformer model on domain-specific text and use it for a classification task?


`Answers are commented inside this cell`
<!-- 

1. Transformers offer key advantages over traditional sequential models like RNNs and LSTMs in that they can handle long-range dependencies and parallelize computations. Unlike RNNs and LSTMs, which process tokens sequentially and can struggle with vanishing gradients, transformers process entire sequences simultaneously using self-attention, allowing for efficient computation and better handling of dependencies across distant tokens.

2. The self-attention mechanism allows transformers to weigh the relevance of different tokens in the input sequence when generating the output for a specific token. It functions by computing Query, Key, and Value vectors for each token. The attention scores are calculated as the dot product of the Query vector of a token with the Key vectors of all tokens, resulting in weights that determine how much attention to pay to each token. These weights are then applied to the Value vectors to produce a new representation, capturing contextual relationships across the sequence.

3. Positional encoding addresses the fact that transformers process input tokens simultaneously without innate sequence order. It injects information about the position of each token in the sequence into the model. Typically, positional encoding is implemented by adding positional vectors to the input embeddings, often using sine and cosine functions of different frequencies to generate unique encodings for each position. This enables the transformer to capture the order of tokens and recognize sequence patterns.

4. Transformers have quadratic complexity with respect to sequence length because the self-attention mechanism requires calculating attention scores between every pair of tokens in the sequence. For a sequence of length _n_, this results in _n²_ computations. This quadratic scaling leads to high computational costs and memory usage when processing long texts, posing challenges in terms of efficiency and feasibility for longer sequences.

5. Transformers have been applied beyond natural language processing in various domains by adapting the self-attention mechanism to other types of sequential data. Examples include:
   - **Computer Vision**: Vision Transformers (ViT) treat images as sequences of patches (tokens) and capture spatial relationships using self-attention.
   - **Music Generation**: Models like MuseNet use transformers to generate complex musical compositions by modeling sequences of musical notes and attributes.
   - **Speech Recognition**: Speech-Transformers process sequences of acoustic feature vectors to transcribe speech to text.
   - **Video Processing**: Video Transformers apply self-attention across spatiotemporal tokens to analyze and understand video content.

6. **BERT** is a bidirectional encoder transformer that uses masked language modeling and next sentence prediction during pre-training, focusing on understanding the context from both directions. **GPT** is a unidirectional (left-to-right) decoder-only transformer that uses language modeling to predict the next word, excelling at text generation. **T5** is a text-to-text transfer transformer that frames all NLP tasks as text generation problems, using an encoder-decoder architecture and trained with a span-corruption objective.

7. Transformers can be used as feature extractors by using the contextualized representations generated by the model's hidden layers. Specifically, the vector corresponding to the `[CLS]` token, which represents the entire input sequence, can be extracted and used as a feature vector for downstream tasks like classification or regression. This approach leverages the rich syntactic and semantic information captured by the transformer without necessarily fine-tuning the whole model.

8. The 512-token limit in transformers arises because processing longer sequences would require excessive computational resources due to the quadratic complexity of self-attention. This limit means that when input texts exceed 512 tokens, they are truncated, potentially removing important information, especially if critical content is beyond this limit. This truncation can negatively affect the model's performance and accuracy on tasks involving long documents.

9. In the notebook, to handle texts longer than 512 tokens, the tokenizer's truncation side was adjusted to `'left'`, so that the model keeps the last 512 tokens of the text instead of the first 512. This approach ensures that in documents like legal texts, where important information like the judgment is often at the end, critical content is retained for the model to process, thereby improving accuracy.

10. The notebook followed these steps:
    - **Step 1**: Started with a pretrained transformer model.
    - **Step 2**: Optionally fine-tuned the model on domain-specific legal text to adapt it to the language and style of the domain.
    - **Step 3**: Tokenized and prepared the dataset, adjusting the tokenizer to truncate the beginning of texts (keep the end).
    - **Step 4**: Trained the model on the specific classification task (predicting court decision labels) using the adjusted data.
    - The process involved loading datasets, preprocessing, configuring the model for classification, and training with appropriate settings to handle the 512-token limit. -->