# Transformers
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](htttps://docente.ufrn.br/elias.jacob)

# Summary

## Keypoints
- Transformers have revolutionized NLP due to their ability to handle long-range dependencies and parallelize computations, overcoming limitations of RNNs and LSTMs.

- The attention mechanism is central to transformers, allowing models to focus on the most relevant parts of the input sequence when making predictions.

- Transformers consist of an encoder and decoder, each with multiple layers including self-attention, feed-forward networks, and positional encoding.

- The quadratic complexity of transformers poses challenges for processing longer texts, impacting training time and memory consumption.

- Transformers have been successfully applied beyond NLP in domains like computer vision, music generation, speech recognition, and video processing.

- Common transformer architectures for NLP include BERT, GPT, RoBERTa, T5, and XLNet, each with unique strengths.

- Transformers can be used as feature extractors, leveraging their ability to capture rich syntactical and contextual information from text data.

- Key steps for using transformers involve starting with a pretrained model, optional domain-specific fine-tuning, task-specific training, and potentially using the model as a feature extractor.

- The 512-token limit in transformers, due to quadratic complexity, can impact accuracy when important information is lost during truncation of longer texts.

- A simple workaround to handle the 512-token limit is to focus on the most relevant information, such as using the last 512 tokens in legal documents where the decision is often at the end.

## Takeaways
- Transformers have become a fundamental tool in NLP, enabling more effective handling of long-range dependencies and parallelization compared to traditional sequential models.

- The attention mechanism allows transformers to capture complex relationships and focus on the most relevant information, leading to improved performance on various NLP tasks.

- While the quadratic complexity of transformers poses challenges for longer texts, ongoing research aims to develop more efficient variants and attention mechanisms to overcome these limitations.

- The successful application of transformers beyond NLP highlights their versatility in capturing patterns and dependencies in structured data across different domains.

- Understanding the architecture and components of transformers, such as the encoder-decoder structure, self-attention, and positional encoding, is crucial for effectively leveraging their capabilities.

- Familiarity with common transformer architectures like BERT, GPT, RoBERTa, T5, and XLNet allows practitioners to choose the most suitable model for their specific NLP task.

- Transformers can be powerful feature extractors, providing rich representations that capture syntactical and contextual information for downstream tasks.

- Following best practices, such as starting with pretrained models, fine-tuning on domain-specific data, and task-specific training, can help achieve optimal results when using transformers.

- Awareness of the 512-token limit and developing strategies to handle longer texts, such as focusing on the most relevant information or using sliding window approaches, is essential for maintaining accuracy in real-world applications.

# Transformers

Transformers have revolutionized the field of Natural Language Processing (NLP) since their introduction by Vaswani et al. in 2017. These architectures have become the foundation for tackling a wide array of NLP tasks, including question answering, text summarization, and machine translation. The key to their success lies in their ability to handle long-range dependencies effectively, avoiding the vanishing gradient problem that plagues traditional models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs).

## The Shift from Sequential Models

Prior to the advent of transformers, sequential model architectures such as RNNs and LSTMs dominated the NLP landscape. These models processed data sequentially, one piece at a time, which limited their ability to capture long-range dependencies and parallelize computations. Transformers introduced a pattern shift by considering all the tokens in a sequence simultaneously, enabling them to capture complex relationships and dependencies between words more effectively.

## The Power of Attention Mechanism

At the heart of transformers lies the attention mechanism, a process that determines the influence of various inputs on the output. In essence, attention allows the model to focus on the most relevant parts of the input sequence when predicting a specific output. This is particularly useful in tasks like machine translation, where the correct translation of a word often depends on its context and relationships with other words in the sentence.

For example, when translating the word "it" from English to French, the attention mechanism would assign higher weights to the words that provide gender context, ensuring the correct gender agreement in the translated output. By capturing these dependencies, transformers can generate more accurate and contextually relevant predictions.

## A Closer Look

A typical transformer model consists of an encoder and a decoder, each composed of multiple identical layers. The encoder processes the input sequence, generating a vector representation for each token, while the decoder takes these vectors as inputs and predicts the output sequence one token at a time.

<p align="center">
<img src="images/transformers_basic.png" alt="" style="width: 40%; height: 40%"/>
</p>


<br>
<br>

The key components that make up the transformer architecture are:

### 1. Self-Attention Layer

Self-attention, also known as intra-attention, is a mechanism that relates each word in the input sequence to every other word. It calculates the similarity between words and assigns higher weights to words that are more closely related. This allows the model to capture the contextual relationships between words, regardless of their position in the sequence.

The self-attention process involves three main steps:

1. **Transformation:** Each input vector (word) is transformed into three different vectors: a Query vector, a Key vector, and a Value vector.
2. **Attention Scores:** The model calculates attention scores by measuring the compatibility between the Query and Key vectors. These scores determine the importance of each word in relation to the others.
3. **Output Generation:** The attention scores are used to weight the Value vectors, which are then summed up to produce the output for a single time step. This process allows each output to consider the entire input sequence, enabling the model to capture long-range dependencies effectively.

By processing each word and its context simultaneously, self-attention makes transformers highly parallelizable, resulting in faster training times compared to RNNs and LSTMs.

### 2. Feed Forward Neural Networks (FFNN)

Following the self-attention layer, transformers employ a position-wise feed-forward network. This network consists of two linear transformations with a ReLU activation function in between. It operates on each position separately but shares the same parameters across all positions. While this layer does not change the dimensionality of the input, it introduces additional non-linearity and flexibility to the model, allowing it to learn more complex representations.

### 3. Positional Encoding

Since the self-attention layer does not inherently consider the order of words in a sequence, transformers need a mechanism to incorporate positional information. This is achieved through positional encoding, which injects information about a token's position directly into its vector representation.

One common approach is to use sine and cosine functions to generate positional encodings. These functions are applied to different elements of the token vectors, effectively encoding the position as angles in a high-dimensional space. The periodic nature of these functions allows the model to deduce to longer sequences, making it robust to variations in sequence length.

> The combination of self-attention, feed-forward networks, and positional encoding enables transformers to capture complex relationships between words, adapt to different contexts, and maintain the order of the sequence. These components work together to create a powerful and flexible architecture that outperforms traditional sequential models in various NLP tasks.
>
> Transformers have transformed the NLP landscape, offering a more effective and efficient approach to handling long-range dependencies and capturing contextual relationships between words. By leveraging the attention mechanism, feed-forward networks, and positional encoding, transformers have become the go-to architecture for a wide range of NLP tasks, from question answering to machine translation.
>
> As you continue to explore the world of transformers, keep in mind the key concepts discussed in this overview. Understanding the inner workings of self-attention, the role of feed-forward networks, and the importance of positional encoding will provide you with a solid foundation for working with these powerful architectures.

## Quadratic Complexity in Transformers

Transformers have revolutionized the field of natural language processing (NLP) and achieved remarkable success in various tasks. However, despite their numerous advantages, Transformers face a significant limitation when dealing with longer texts due to their computational complexity.

### Understanding the Quadratic Complexity

The time and space complexity of Transformers is `O(n^2)`, where `n` represents the length of the input sequence. This quadratic complexity arises from the attention mechanism, which is a central component of Transformers.

In the self-attention layer, each token in the input sequence needs to calculate its relationship with every other token. As a result, for an input sequence of length `n`, the model performs `n^2` computations for each layer of the Transformer. This means that as the input text grows longer, the number of computations required increases quadratically.

The quadratic complexity not only affects the computational time but also the memory usage. The model needs to store the computed relationships between all pairs of tokens, leading to high memory consumption for longer sequences.

<p align="center">
<img src="images/transformer_quadratic.webp" alt="" style="width: 50%; height: 50%"/>
</p>

The image above illustrates the computational complexity for a sequence of length 9. In this case, the model performs 81 (or `9^2`) computations. Notably, as the sequence length doubles, the number of computations quadruples.

### Impact on Real-World Applications

The quadratic complexity of Transformers poses challenges when applying them to tasks involving long documents or large-scale language modeling. Some real-world applications affected by this limitation include:

1. **Question Answering**: In question answering tasks, the input text is a question, and the model needs to provide the correct answer. The input text can be quite lengthy, requiring the model to consider the entire context.

2. **Machine Translation**: Machine translation involves translating a sentence from one language to another. The input sentence can be long, and the model needs to take into account the complete context to generate an accurate translation.

3. **Summarization**: Summarization tasks involve generating a concise summary of a long document. The model needs to process the entire input text to capture the key information and produce a coherent summary.

4. **Language Modeling**: Language modeling aims to predict the next word or sequence of words given a context. When dealing with long documents, the model needs to consider the entire context to make accurate predictions.

5. **Text Classification**: Text classification involves assigning predefined categories or labels to input texts. Long documents pose challenges for Transformers in capturing the relevant information for accurate classification.

The quadratic complexity of Transformers results in slower training times and high memory consumption when dealing with longer sequences. This limitation hinders the practicability of Transformers in scenarios involving extensive texts.

### Addressing the Quadratic Complexity

To alleviate the quadratic complexity problem and make Transformers more efficient for longer sequences, several approaches have been proposed:

1. **Sparse Attention Mechanisms**: Instead of attending to all tokens in the input sequence, sparse attention mechanisms focus on a subset of relevant tokens. By selectively attending to fewer tokens, these mechanisms reduce the computational complexity.

2. **Long Range Arena (LRA) Benchmark**: The LRA benchmark was introduced to evaluate the efficiency of different sparse attention mechanisms. It provides a standardized framework for comparing and assessing the performance of Transformers on longer sequences.

3. **Memory-Efficient Transformers**: Researchers have developed techniques to reduce the memory usage of Transformers. For example, locality-sensitive hashing has been employed to efficiently store and retrieve the computed relationships between tokens.

Ongoing research efforts aim to further improve the efficiency and scalability of Transformers, enabling their application to a wider range of text lengths and real-world scenarios.

> It's important to note that while the quadratic complexity of Transformers presents challenges, their effectiveness in capturing contextual information and achieving state-of-the-art performance in various NLP tasks cannot be overlooked. The development of more efficient Transformer variants and attention mechanisms is an active area of research, with the goal of overcoming the limitations posed by longer sequences.
>
> As advancements continue to be made, Transformers are expected to become more capable of handling longer texts efficiently, expanding their applicability and impact in the field of natural language processing.

## Applications of Transformers Beyond NLP

Transformers, originally designed for Natural Language Processing (NLP) tasks, have a unique architecture that allows them to identify complex patterns and dependencies in input data. This powerful capability has led to their application in various domains beyond text processing, yielding promising results. Let's explore some of these areas where Transformers have made significant contributions.

### Computer Vision

In the field of computer vision, Transformers offer a significant advantage in recognizing long-range dependencies between pixels. By treating each pixel as a sequence, similar to how they handle words in a sentence, Transformers can effectively process and analyze images at a granular level.

#### Image Transformer

The Image Transformer is an advanced version of the original transformer model specifically designed for image processing. It treats each pixel in an image as a sequence token, enabling it to capture fine-grained details and relationships within the image.

#### Vision Transformer (ViT)

The Vision Transformer (ViT) takes a unique approach to image processing by using a single transformer encoder to process patches of an image. Instead of relying on traditional convolutional layers, ViT treats these image patches as tokens within a sentence, allowing it to learn and extract meaningful features from the visual data.

### Music Generation

Transformers have also found fascinating applications in the realm of music generation. The self-attention mechanism of Transformers provides a long context, which is particularly beneficial for music composition since musical notes often have dependencies on previous notes.

#### MuseNet

MuseNet is a notable example of how Transformers can be used for music generation. It utilizes Transformers to generate musical compositions up to four minutes in length, incorporating ten different instruments. Moreover, MuseNet has the ability to combine various musical styles, ranging from country music and classical Mozart to the Beatles, showcasing its versatility and creative potential.

### Speech Recognition

Transformers have shown remarkable effectiveness in the field of speech recognition. Their self-attention mechanism excels at modeling the temporal mechanics of speech, making them highly suitable for Automatic Speech Recognition (ASR) systems.

#### Speech-Transformer

The Speech-Transformer simplifies ASR systems by eliminating the need for complex components such as Hidden Markov Models or Connectionist Temporal Classification (CTC). Despite this simplification, the Speech-Transformer maintains its effectiveness in accurately recognizing and transcribing speech.

### Video Processing

Just as in computer vision, Transformers have found applications in video processing. Each frame of a video can be treated as a token, with sequence information indicating its position within the video.

#### Video Transformer

The Video Transformer leverages the power of Transformers to extract complex spatial and temporal patterns from video sequences. By analyzing the relationships between frames and their temporal order, the Video Transformer offers an effective way to understand and process video data.


<br>

> Transformers can offer significant benefits anywhere there is structured data with complex dependencies.
>
> The ability of Transformers to model detailed relationships between inputs makes them highly versatile and applicable across diverse domains. While their origins lie in NLP, researchers continue to explore and expand their potential, pushing the boundaries of what's possible with this powerful architecture.
>
> As we have seen, Transformers have successfully ventured beyond the realm of text processing, making significant contributions in areas such as computer vision, music generation, speech recognition, and video processing. Their unique ability to capture long-range dependencies and learn from structured data has opened up new avenues for innovation and advancement in these fields.

# Common Transformer Architectures for NLP

Transformers have transformed (sorry, pun intended) Natural Language Processing (NLP) with their ability to effectively capture dependencies in sequence data. This has led to the development of several powerful transformer architectures tailored for various NLP tasks. In this section, we will explore five commonly used transformer architectures and discuss their unique characteristics and strengths.

## 1. BERT (Bidirectional Encoder Representations from Transformers)

BERT, developed by Google, is a pre-trained transformer model that has significantly advanced the field of NLP. One of the key innovations of BERT is its bidirectional approach to language understanding. Unlike traditional models that only consider the words that come before a given word, BERT takes into account the full context by looking at both the preceding and following words. This bidirectional context enables BERT to capture a more nuanced understanding of language semantics, leading to improved performance on a wide range of NLP tasks.

BERT's architecture consists of multiple transformer encoder layers stacked on top of each other. During the pre-training phase, BERT is trained on large amounts of unlabeled text data using two novel unsupervised learning tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM involves randomly masking a percentage of input tokens and training the model to predict the original tokens based on the surrounding context. NSP, on the other hand, trains the model to determine whether two given sentences follow each other in the original text. These pre-training tasks allow BERT to learn rich linguistic representations that can be fine-tuned for specific downstream tasks with minimal modifications.

## 2. GPT (Generative Pretrained Transformer)

GPT, introduced by OpenAI, is another influential transformer architecture in NLP. Unlike BERT, GPT adopts a unidirectional approach, where only the previous words in a sentence are used for prediction. This unidirectional nature makes GPT particularly well-suited for tasks that require generating human-like text, such as language modeling, text completion, and conversational AI.

The GPT architecture consists of multiple transformer decoder layers stacked together. During pre-training, GPT is trained on a large corpus of text data using a language modeling objective, where the model learns to predict the next word given the previous words in a sequence. This allows GPT to capture the syntactic structures and patterns present in natural language.

One of the strengths of GPT is its ability to generate coherent and fluent text that closely resembles human writing. By conditioning the model on a prompt or a few examples, GPT can generate contextually relevant and grammatically correct continuations. This has led to impressive results in tasks like story generation, dialogue systems, and content creation.

## 3. RoBERTa (Robustly Optimized BERT Approach)

RoBERTa, developed by Facebook, is a variant of BERT that aims to improve upon the original architecture. While sharing the same fundamental structure as BERT, RoBERTa introduces several key modifications to enhance performance and robustness.

One notable change in RoBERTa is the use of dynamic masking during the pre-training phase. Unlike BERT, which uses a fixed set of masked tokens for each training instance, RoBERTa generates a new set of masked tokens for each input sequence. This dynamic masking strategy exposes the model to a more diverse set of masked positions, leading to better generalization.

Additionally, RoBERTa employs larger batch sizes and trains on significantly more data compared to BERT. The increased training data and batch sizes help the model learn more robust representations and improve its performance on downstream tasks.

RoBERTa has demonstrated state-of-the-art results on various NLP benchmarks, often outperforming the original BERT model. Its success highlights the importance of careful optimization and training strategies in achieving optimal performance with transformer architectures.

## 4. T5 (Text-to-Text Transfer Transformer)

T5, introduced by Google, takes a unique approach to NLP by framing every task as a text-to-text problem. Instead of designing task-specific architectures, T5 proposes a unified framework where all tasks are treated as sequence-to-sequence problems, with the input and output being text strings.

For example, in a translation task, the input would be the source language text, and the output would be the target language text. Similarly, for a summarization task, the input would be the original text, and the output would be the summary. This consistent problem formulation allows T5 to be applied to a wide range of NLP tasks without requiring task-specific modifications.

T5 is pre-trained on a massive corpus of web pages using a denoising objective, where the model learns to reconstruct the original text from corrupted input sequences. This pre-training enables T5 to learn rich linguistic representations that can be fine-tuned for various downstream tasks.

One of the advantages of T5's text-to-text approach is its flexibility and simplicity. By treating all tasks as sequence-to-sequence problems, T5 can exploit the same architecture and training procedure across different tasks, reducing the need for task-specific engineering. This has led to impressive performance on a wide range of NLP benchmarks, making T5 a versatile and powerful tool in the NLP toolkit.

## 5. XLNet

XLNet, jointly developed by Google Brain and Carnegie Mellon University, combines the strengths of both BERT and GPT architectures. It aims to address some of the limitations of BERT while incorporating the auto-regressive nature of GPT.

One of the key differences between XLNet and BERT is the way they handle masking during pre-training. While BERT uses a fixed set of masked tokens, XLNet employs a novel permutation-based training objective called "Permutation Language Modeling" (PLM). In PLM, the input sequence is randomly permuted, and the model is trained to predict the target token based on the permuted context. This allows XLNet to capture bidirectional context while preserving the auto-regressive property of language modeling.

XLNet also introduces the concept of "two-stream self-attention," where the model uses both content-based and query-based attention mechanisms. This enables XLNet to better capture long-range dependencies and model more complex relationships between tokens.

Compared to BERT, XLNet has demonstrated improved performance on various NLP tasks, particularly in scenarios where long-range dependencies and auto-regressive modeling are crucial. Its ability to combine the strengths of both BERT and GPT has made XLNet a popular choice for many NLP applications.

## Model Sizes: Base vs. Large

When working with transformer architectures, you may encounter terms like "base" and "large" to describe the size of the model. These terms refer to the number of parameters and the depth of the architecture.

- **Base Models**: Base models are the standard size for a given transformer architecture. They typically have a moderate number of parameters and are designed to balance performance and computational efficiency. For example, BERT-base consists of 12 transformer encoder layers with 768 hidden units each, resulting in approximately 110 million parameters.

- **Large Models**: Large models are expanded versions of the base architecture, with increased depth and/or width. They have a significantly higher number of parameters compared to their base counterparts. For instance, BERT-large has 24 transformer encoder layers with 1024 hidden units each, amounting to around 340 million parameters.

The choice between base and large models depends on several factors, such as the complexity of the task, the available computational resources, and the size of the training data. Large models generally achieve better performance due to their increased capacity to capture complex patterns and relationships in the data. However, they also require more computational resources and longer training times.

It's important to note that large models are more prone to overfitting, especially when trained on smaller datasets. They have a higher capacity to memorize noise and irrelevant patterns, which can hinder generalization to unseen data. Therefore, when working with limited training data, base models may be a more suitable choice to mitigate overfitting risks.

In practice, the decision between base and large models often involves a trade-off between performance and computational efficiency. It's common to start with a base model and scale up to a large model if the task demands higher performance and sufficient resources are available.

>
> As the field of NLP continues to evolve, we can expect further advancements and refinements in transformer architectures, pushing the boundaries of natural language understanding and generation. By leveraging these powerful tools and techniques, researchers and practitioners can unlock new possibilities in a wide range of NLP applications, from sentiment analysis and machine translation to question answering and content generation.

## Using Transformers as Feature Extractors

Transformers have revolutionized the field of Natural Language Processing (NLP) by providing a powerful tool for extracting rich syntactical and contextual information from text data. These extracted features can be leveraged for a wide range of downstream tasks, such as classification, regression, and more. In this section, we will explore the concept of features and how transformers can be effectively utilized as feature extractors.

### Understanding Features and Feature Extraction

Features are the distinctive properties or characteristics that are extracted from a dataset to capture its essential information. In the context of NLP, features can represent various aspects of text data, such as word frequencies, sentence structure, or semantic relationships. These features serve as the foundation for building accurate and insightful models for various tasks.

Feature extractors are sophisticated algorithms designed to automatically derive meaningful features from raw data, regardless of its modality (e.g., text, images, or audio). They are capable of identifying and capturing relevant patterns, structures, and relationships within the data, enabling more effective analysis and prediction.

### Transformers as Feature Extractors

Transformers are a class of neural networks that have proven to be exceptionally well-suited for extracting features from textual data. At their core, transformers operate by taking a sequence of words as input and generating a numeric vector representation that encapsulates the semantic meaning of the entire text.

When using transformers as feature extractors, the process involves feeding a word sequence into the transformer model and obtaining a dense numeric vector that captures the essential semantic information of the input text. This vector representation can then be used as input to other machine learning algorithms or neural networks, enabling them to perform various prediction and analysis tasks effectively.

### Benefits and Limitations of Transformers as Feature Extractors

One of the key advantages of using transformers as feature extractors is their ability to handle textual data effectively, making them particularly useful for tasks such as text classification or regression. Transformers also excel at unsupervised learning, meaning they can extract meaningful features from unlabeled text data, reducing the reliance on labeled datasets.

Moreover, transformers have the capacity to capture complex semantic relationships and contextual information within the text, enabling them to generate rich and informative feature representations. This ability to capture elaborate patterns and dependencies makes transformers a powerful tool for various NLP applications.

However, it is important to note that transformers also have some limitations. They are computationally intensive and require substantial amounts of data for training. Additionally, transformers have a significant memory footprint due to the need to store the weights of the neural network. This can make them challenging to deploy on resource-constrained devices such as mobile phones or embedded systems with limited memory.

<br>

> While transformers have their limitations in terms of computational complexity and memory requirements, their benefits in terms of feature extraction and unsupervised learning make them an indispensable tool in the NLP toolkit. As research in this area continues to advance, we can expect to see further improvements and innovations in the use of transformers as feature extractors, pushing the boundaries of what is possible in natural language understanding and processing.

# General Steps for Using Transformers

To use transformers on a specific task, we need to follow these steps:

### Step 1: Start with a Pretrained Model

The first step is to select a pretrained model from the [Hugging Face Transformers library](https://huggingface.co/transformers/pretrained_models.html). These pretrained models have been trained on large amounts of text data and have learned general language representations. Using a pretrained model provides a strong foundation for your specific task.

*Note:* Training a model from scratch is an advanced topic and rarely necessary. In most cases, you can warm-start your model from a pretrained model. If you're interested in learning more about training your own model from scratch, refer to [this resource](https://huggingface.co/blog/how-to-train).

### Step 2: Fine-tune the Model on Domain-Specific Text (Optional)

Fine-tuning involves adapting a pretrained model to a new domain by training it on domain-specific text. This step is optional but can enhance the model's performance on your specific task. By exposing the model to text that is similar to your target domain, it can learn domain-specific language patterns and representations.

### Step 3: Train the Model for Your Task

Once you have a fine-tuned model (or a pretrained model if you skipped step 2), you can train it for your specific task. This typically involves adding a classification or regression head on top of the model and training it using your task-specific data.

Alternatively, you can use the model as a feature extractor for your task, which is a more advanced approach.

## Example: Classifying Court Decision Labels

Let's explore these steps using a subset of the [BrCAD-5](https://www.kaggle.com/datasets/eliasjacob/brcad5) dataset, which contains over 765,000 legal case information from Brazilian Federal Courts. Our goal is to train a model to predict the label for a court decision based on its text.

1. **Select a Pretrained Model**: We'll choose a suitable pretrained model from the Hugging Face Transformers library that aligns with our task requirements, such as language support and model architecture.

2. **Fine-tune the Model (Optional)**: If we have a sufficient amount of domain-specific text (legal case information in this case), we can fine-tune the pretrained model on this data to capture domain-specific language patterns.

3. **Train the Model for Label Prediction**: We'll add a classification head on top of the model and train it using the labeled court decision data from BrCAD-5. The model will learn to predict the appropriate label based on the text of the court decision.

<br>

> Remember, the key is to start with a strong pretrained model and adapt it to your specific task through fine-tuning and task-specific training.

## Load pretrained models

In [1]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Define the model checkpoints for the base and large versions of the BERT model
model_checkpoint_base = "neuralmind/bert-base-portuguese-cased"
model_checkpoint_large = "neuralmind/bert-large-portuguese-cased"

# Load the tokenizer for the base BERT model
# The tokenizer is responsible for converting text into tokens that the model can understand
tokenizer_base = AutoTokenizer.from_pretrained(model_checkpoint_base)

# Load the masked language model (MLM) for the base BERT model
# The MLM is used for tasks like predicting masked words in a sentence
model_mlm_base = AutoModelForMaskedLM.from_pretrained(model_checkpoint_base)

# Load the tokenizer for the large BERT model
# This tokenizer works similarly to the base tokenizer but is tailored for the large model
tokenizer_large = AutoTokenizer.from_pretrained(model_checkpoint_large)

# Load the masked language model (MLM) for the large BERT model
# This MLM is used for tasks like predicting masked words in a sentence, similar to the base model but with more parameters
model_mlm_large = AutoModelForMaskedLM.from_pretrained(model_checkpoint_large)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at neuralmind/bert-large-portuguese-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertFor

In [2]:
tokenizer_base.is_fast # A fast tokenizer from HF Transformers uses Rust under the hood for faster tokenization

True

In [3]:
tokenizer_large.is_fast

True

In [4]:
# Define a function to count the number of trainable parameters in a model
def count_parameters(model):
    # Sum the number of elements (numel) for each parameter in the model
    # Only include parameters that require gradients (i.e., are trainable)
    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
    # Print the number of trainable parameters in a human-readable format with commas
    print(f"The model has {n_parameters:,} trainable parameters")

# Count and print the number of trainable parameters for the base BERT model
count_parameters(model_mlm_base)

# Count and print the number of trainable parameters for the large BERT model
count_parameters(model_mlm_large)

The model has 108,954,466 trainable parameters
The model has 334,428,258 trainable parameters


In [5]:
model_mlm_base

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

> Above, you can see the model architecture summary of a BERT (Bidirectional Encoder Representations from Transformers) model specifically designed for masked language modeling (MLM) tasks. Let's break it down:
>
> 1. `BertForMaskedLM`: This is the main class that represents the BERT model for masked language modeling.
>
> 2. `BertModel`: This is the fundamental BERT model that consists of the following components:
> - `BertEmbeddings`: This module handles the input embeddings, including word embeddings, position embeddings, and token type embeddings. It also applies layer normalization and dropout.
> - `BertEncoder`: This is the main encoder component of BERT, which consists of a stack of `BertLayer` modules.
> - `BertLayer`: Each layer in the encoder consists of a self-attention mechanism (`BertAttention`), an intermediate feed-forward network (`BertIntermediate`), and an output projection (`BertOutput`).
> - `BertAttention`: This module performs self-attention on the input representations using query, key, and value linear transformations, followed by dropout.
> - `BertIntermediate`: This is a feed-forward network with a GELU activation function.
> - `BertOutput`: This module applies a dense linear transformation, layer normalization, and dropout to the output of the intermediate layer.
>
> 3. `BertOnlyMLMHead`: This module is specific to the masked language modeling task and consists of the following components:
> - `BertLMPredictionHead`: This module performs the final prediction for the masked tokens.
> - `BertPredictionHeadTransform`: This module applies a dense linear transformation, GELU activation, and layer normalization to the output of the BERT encoder.
> - `decoder`: This is a linear layer that maps the transformed representations to the vocabulary size for predicting the masked tokens.
>
> The model architecture summary provides details about the dimensions of the embeddings, the number of layers in the encoder, and the sizes of the intermediate and output layers.
>

In [6]:
model_mlm_large

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-

> The main differences between the two BERT models are in the model size and architecture:
>
> 1. Embedding dimensions:
> - In the first model, the word embeddings, position embeddings, and token type embeddings have a dimension of 768.
> - In the second model, these embeddings have a dimension of 1024, indicating a larger embedding size.
>
> 2. Number of encoder layers:
> - The first model has 12 encoder layers (`(0-11): 12 x BertLayer`).
> - The second model has 24 encoder layers (`(0-23): 24 x BertLayer`)
>
> 3. Intermediate layer dimensions:
> - In the first model, the intermediate layer (`BertIntermediate`) has an output dimension of 3072.
> - In the second model, the intermediate layer has an output dimension of 4096, which is larger than the first model.
>
> 4. Hidden state dimensions:
> - The first model uses hidden states with a dimension of 768 throughout the architecture, including the self-attention layers, intermediate layers, and output layers.
> - The second model uses hidden states with a dimension of 1024 throughout the architecture.
>
> The rest of the architecture, including the self-attention mechanism, layer normalization, dropout, and the MLM head, remains the same between the two models.
>
> The large model has higher-dimensional embeddings, more encoder layers, and larger intermediate layer dimensions. This suggests that the large model has a higher capacity and can potentially capture more complex patterns and representations from the input data. However, the larger model size also means increased computational requirements and longer training times.

## Load dataset and creating a train/test split

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the unlabeled dataset from a Parquet file
# Only the 'text' column is read from the file
df_unlabeled = pd.read_parquet('data/legal/unlabeled_texts.parquet', columns=['text'])

# Split the unlabeled dataset into training and validation sets
# 10% of the data is used for validation, and the split is reproducible with a fixed random state
df_unlabeled_train, df_unlabeled_valid = train_test_split(df_unlabeled, test_size=0.10, random_state=271828)

# Display the shapes of the training and validation sets
# This shows the number of rows and columns in each set
df_unlabeled_train.shape, df_unlabeled_valid.shape

((58529, 1), (6504, 1))

In [8]:
import datasets

# Convert the pandas DataFrame containing the unlabeled training data into a Hugging Face Dataset
# This allows for easier manipulation and integration with Hugging Face's tools and models
dataset_unlabeled_train = datasets.Dataset.from_pandas(df_unlabeled_train)

# Convert the pandas DataFrame containing the unlabeled validation data into a Hugging Face Dataset
# This allows for easier manipulation and integration with Hugging Face's tools and models
dataset_unlabeled_valid = datasets.Dataset.from_pandas(df_unlabeled_valid)

In [9]:
dataset_unlabeled_train

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 58529
})

In [10]:
dataset_unlabeled_valid

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 6504
})

In [11]:
from pathlib import Path

# Define the path to save the outputs of the base BERT masked language model
path_to_save_lm_base = Path('./outputs/transformers_basics/bert_masked_lm_base')
# Create the directory (and any necessary parent directories) if it doesn't already exist
path_to_save_lm_base.mkdir(parents=True, exist_ok=True)

# Define the path to save the outputs of the large BERT masked language model
path_to_save_lm_large = Path('./outputs/transformers_basics/bert_masked_lm_large')
# Create the directory (and any necessary parent directories) if it doesn't already exist
path_to_save_lm_large.mkdir(parents=True, exist_ok=True)

## Fine tune the Language Model on the domain text

Remember our transfer learning class. During this stage, the general-domain language model adapts itself to the idiosyncrasies of the domain-specific text. This is done by training the model on the domain-specific text. This step is optional, but it can improve the performance of the model on your task.

In [12]:
from functools import partial
from multiprocessing import cpu_count

def tokenize_function(examples, tokenizer):
    """
    Tokenizes the input text in the given examples using the tokenizer object.

    Args:
    - examples: A dictionary containing the input text to be tokenized.

    Returns:
    - A dictionary containing the tokenized input text.
    """
    result = tokenizer(examples["text"])  # Tokenize the input text
    if tokenizer.is_fast:
        # If the tokenizer is a fast tokenizer, add word IDs to the result
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

# Create partial functions for tokenizing using the base and large tokenizers
# This allows us to pass the tokenizer as a fixed argument to the tokenize_function
tokenize_function_base = partial(tokenize_function, tokenizer=tokenizer_base)
tokenize_function_large = partial(tokenize_function, tokenizer=tokenizer_large)

# Tokenize the training dataset using the base tokenizer
# The map function applies the tokenize_function_base to each example in the dataset
# The batched=True argument processes the examples in batches for efficiency
# The remove_columns argument removes the specified columns from the dataset after tokenization
dataset_train_tokenized_mlm_base = dataset_unlabeled_train.map(
    tokenize_function_base, batched=True, remove_columns=["text", '__index_level_0__']
)

# Tokenize the validation dataset using the base tokenizer
dataset_valid_tokenized_mlm_base = dataset_unlabeled_valid.map(
    tokenize_function_base, batched=True, remove_columns=["text", '__index_level_0__']
)

# Tokenize the training dataset using the large tokenizer
dataset_train_tokenized_mlm_large = dataset_unlabeled_train.map(
    tokenize_function_large, batched=True, remove_columns=["text", '__index_level_0__']
)

# Tokenize the validation dataset using the large tokenizer
dataset_valid_tokenized_mlm_large = dataset_unlabeled_valid.map(
    tokenize_function_large, batched=True, remove_columns=["text", '__index_level_0__']
)

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

In [13]:
import numpy as np

def group_texts(examples):
    """
    This function groups together a set of texts as contiguous text of fixed length (chunk_size). 
    It's useful for training masked language models.

    Args:
    - examples: A dictionary containing the examples to group. Each key corresponds to a feature, 
                and each value is a list of lists of tokens.

    Returns:
    - A dictionary containing the grouped examples. Each key corresponds to a feature, 
      and each value is a list of lists of tokens.
    """
    # Concatenate all texts for each feature
    concatenated_examples = {k: np.concatenate(examples[k]) for k in examples.keys()}
    
    # Compute the total length of the concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    
    # Adjust the total length to be a multiple of chunk_size, dropping the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    
    # Split the concatenated texts into chunks of size chunk_size using NumPy
    result = {
        k: np.split(t[:total_length], total_length // chunk_size)
        for k, t in concatenated_examples.items()
    }
    
    # Create a new 'labels' column that is a copy of the 'input_ids' column
    result["labels"] = result["input_ids"].copy()
    
    return result

# Define the chunk size for grouping texts
chunk_size = 512

# Apply the group_texts function to the tokenized training dataset for the base BERT model
dataset_train_tokenized_mlm_base = dataset_train_tokenized_mlm_base.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized validation dataset for the base BERT model
dataset_valid_tokenized_mlm_base = dataset_valid_tokenized_mlm_base.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized training dataset for the large BERT model
dataset_train_tokenized_mlm_large = dataset_train_tokenized_mlm_large.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

# Apply the group_texts function to the tokenized validation dataset for the large BERT model
dataset_valid_tokenized_mlm_large = dataset_valid_tokenized_mlm_large.map(
    group_texts,
    batched=True,  # Process the examples in batches for efficiency
)

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

Map:   0%|          | 0/58529 [00:00<?, ? examples/s]

Map:   0%|          | 0/6504 [00:00<?, ? examples/s]

In [14]:
from transformers import DataCollatorForLanguageModeling

# Create a data collator for masked language modeling (MLM) using the base BERT tokenizer
# The data collator will dynamically mask tokens in the input text with a probability of 0.15
data_collator_mlm_base = DataCollatorForLanguageModeling(tokenizer=tokenizer_base, mlm_probability=0.15)

# Create a data collator for masked language modeling (MLM) using the large BERT tokenizer
# The data collator will dynamically mask tokens in the input text with a probability of 0.15
data_collator_mlm_large = DataCollatorForLanguageModeling(tokenizer=tokenizer_large, mlm_probability=0.15)

In [15]:
from transformers import TrainingArguments

# Define the batch size for training and evaluation using the base BERT model
batch_size_base = 20

# Extract the model name from the model checkpoint path for the base BERT model
model_name_base = model_checkpoint_base.split("/")[-1]

# Set up the training arguments for fine-tuning the base BERT model on a masked language modeling task
training_args_mlm_base = TrainingArguments(
    output_dir=path_to_save_lm_base / f"{model_name_base}-finetuned-mlm",  # Directory to save the model checkpoints
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
    learning_rate=5e-5,  # Learning rate for the optimizer
    weight_decay=0.01,  # Weight decay for regularization
    per_device_train_batch_size=batch_size_base,  # Batch size for training
    per_device_eval_batch_size=batch_size_base,  # Batch size for evaluation
    bf16=True,  # Use bfloat16 precision (change to "fp16" if using a free GPU)
    num_train_epochs=3,  # Number of training epochs
    save_total_limit=1,  # Limit the total number of saved checkpoints
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    logging_steps=1, # Log the training loss after every 1 epoch
    eval_steps=1, # Evaluate the model after every 1 epoch
    save_steps=1, # Save the model after every 1 epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Metric to use for selecting the best model
    greater_is_better=False,  # Lower evaluation loss is better
    gradient_accumulation_steps=3,  # Number of gradient accumulation steps
    seed=271828,  # Random seed for reproducibility
)

In [17]:
from transformers import Trainer

# Initialize the Trainer for the base BERT model
# The Trainer class provides an easy-to-use API for training and evaluating models
trainer_mlm_base = Trainer(
    model=model_mlm_base,  # The model to be trained (base BERT masked language model)
    args=training_args_mlm_base,  # Training arguments defined earlier
    train_dataset=dataset_train_tokenized_mlm_base,  # Tokenized training dataset
    eval_dataset=dataset_valid_tokenized_mlm_base,  # Tokenized validation dataset
    data_collator=data_collator_mlm_base,  # Data collator for dynamically masking tokens
    tokenizer=tokenizer_base,  # Tokenizer for processing the input text
)

In [18]:
# This took around 5 hours to train on 2 x NVIDIA RTX 3090 GPUs
trainer_mlm_base.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33meliasjacob[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
0,0.5797,0.473443
1,0.4326,0.408905
2,0.3782,0.3882


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


TrainOutput(global_step=6786, training_loss=0.5313287478850508, metrics={'train_runtime': 17450.1629, 'train_samples_per_second': 46.679, 'train_steps_per_second': 0.389, 'total_flos': 2.143305955958661e+17, 'train_loss': 0.5313287478850508, 'epoch': 2.9991160872127285})

In [19]:
# Save the trained model
trainer_mlm_base.save_model(path_to_save_lm_base / f"{model_name_base}-finetuned-mlm")
tokenizer_base.save_pretrained(path_to_save_lm_base / f"{model_name_base}-finetuned-mlm")

trainer_mlm_base.evaluate()



{'eval_loss': 0.38791826367378235,
 'eval_runtime': 293.0877,
 'eval_samples_per_second': 101.755,
 'eval_steps_per_second': 2.545,
 'epoch': 2.9991160872127285}

In [20]:
print(path_to_save_lm_base / f"{model_name_base}-finetuned-mlm")

outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm


In [16]:
import gc
import torch

# Set the trainer, model, and tokenizer for the base BERT model to None
# This helps free up memory by removing references to these objects
trainer_mlm_base = None
model_mlm_base = None
tokenizer_base = None

# Force garbage collection to free up memory
gc.collect()

# Clear the CUDA memory cache to free up GPU memory
torch.cuda.empty_cache()

In [17]:
from transformers import TrainingArguments

# Define the batch size for training and evaluation
batch_size_large = 14

# Extract the model name from the model checkpoint path
# This will be used to name the output directory for the trained model
model_name_large = model_checkpoint_large.split("/")[-1]

# Define the training arguments for the large masked language model (MLM)
training_args_mlm_large = TrainingArguments(
    output_dir=path_to_save_lm_large / f"{model_name_large}-finetuned-mlm",  # Output directory for the trained model
    overwrite_output_dir=True,  # Overwrite the output directory if it already exists
    learning_rate=5e-5,  # Learning rate for the optimizer
    weight_decay=0.01,  # Weight decay for regularization
    per_device_train_batch_size=batch_size_large,  # Batch size for training
    per_device_eval_batch_size=batch_size_large,  # Batch size for evaluation
    bf16=True,  # Use bf16 precision. Change to "fp16" if using a free GPU
    num_train_epochs=3,  # Number of training epochs
    save_total_limit=1,  # Limit the total amount of checkpoints and delete the older ones
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    logging_steps=1,  # Log the training loss after every 1 step
    eval_steps=1,  # Evaluate the model after every 1 step
    save_steps=1,  # Save the model after every 1 step
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Use the evaluation loss to determine the best model
    greater_is_better=False,  # Lower evaluation loss is better
    gradient_accumulation_steps=4,  # Number of steps to accumulate gradients before updating the model parameters
    seed=271828,  # Random seed for reproducibility
)

In [18]:
from transformers import Trainer

# Initialize the Trainer for the large masked language model (MLM)
trainer_mlm_large = Trainer(
    model=model_mlm_large,  # The pre-trained large BERT model for masked language modeling
    args=training_args_mlm_large,  # The training arguments defined earlier for the large model
    train_dataset=dataset_train_tokenized_mlm_large,  # The tokenized training dataset for the large model
    eval_dataset=dataset_valid_tokenized_mlm_large,  # The tokenized validation dataset for the large model
    data_collator=data_collator_mlm_large,  # The data collator for dynamic masking during training
    tokenizer=tokenizer_large,  # The tokenizer used to process the input text for the large model
)

In [19]:
# Train the large masked language model (MLM)
# This process involves multiple epochs of training on the training dataset
# Note: This training process took almost 14 hours on 2 x NVIDIA RTX 3090 GPUs
trainer_mlm_large.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33meliasjacob[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
0,0.3995,0.381371
2,0.3323,0.309468


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


TrainOutput(global_step=7272, training_loss=0.4201359052510217, metrics={'train_runtime': 48363.8901, 'train_samples_per_second': 16.842, 'train_steps_per_second': 0.15, 'total_flos': 7.590524853366497e+17, 'train_loss': 0.4201359052510217, 'epoch': 2.999381315735203})

In [20]:
# Save the trained large masked language model (MLM) to the specified directory
trainer_mlm_large.save_model(path_to_save_lm_large / f"{model_name_large}-finetuned-mlm")

# Save the tokenizer used for the large MLM to the same directory
tokenizer_large.save_pretrained(path_to_save_lm_large / f"{model_name_large}-finetuned-mlm")

# Evaluate the trained large MLM on the validation dataset
# This will return a dictionary containing the evaluation metrics
trainer_mlm_large.evaluate()



{'eval_loss': 0.30810099840164185,
 'eval_runtime': 709.2879,
 'eval_samples_per_second': 42.046,
 'eval_steps_per_second': 1.503,
 'epoch': 2.999381315735203}

In [21]:
print(path_to_save_lm_large / f"{model_name_large}-finetuned-mlm")

outputs/transformers_basics/bert_masked_lm_large/bert-large-portuguese-cased-finetuned-mlm


## Assessing a Language Model

To ensure that a language model is effective and reliable, we need to assess its performance. This is usually done by evaluating how well the model can predict a word in a sentence. The primary metric used for this purpose is known as 'Perplexity'.

### Understanding Perplexity

Perplexity is a quantitative measure of how well a probability model predicts a sample. In the context of language models, it gauges how surprised or 'thrown-off' the model is upon encountering new data. Essentially, it is a measure of "surprise".

A lower perplexity indicates that the model was less surprised by the new data, signifying that it was better trained and has a good understanding of the language patterns in the provided data. Therefore, a lower perplexity value is indicative of better training.

### Calculating Perplexity

Perplexity is defined as the exponentiation of the entropy. Entropy is a measure of the uncertainty associated with a random variable. Since the loss function of the language model is the cross-entropy loss, we can use the loss value to calculate the perplexity. The formula for perplexity is:

$$Perplexity = e^{loss}$$

Where:
- $e$ is the base of the natural logarithm (Euler's number, approximately 2.71828)
- $loss$ is the cross-entropy loss

### Choice of Logarithm Base

The choice of base for the logarithm in calculating perplexity or entropy often depends on the context or the historical convention of the field.

- In information theory, the base of the logarithm is typically 2, resulting in units of bits (binary digits). This is because information was originally conceptualized in the context of binary decisions (yes/no, true/false, 0/1), and thus, using a base-2 logarithm is intuitive: a message space of $2^n$ messages each carry $n$ bits of information.

- The rationale behind using $e$ as the base is somewhat unclear. In numerous domains of machine learning, $e$ possesses unique attributes, however, these properties do not hold relevance here. Euler's number ($e$) exhibits several intriguing properties, especially in machine learning, where a majority of the basic mathematical principles and techniques (like calculus and optimization methods) often function more efficiently or are simpler with natural logarithms.

> It's important to note that the base of the logarithm doesn't change the fundamental interpretation of entropy or perplexity - it's merely a scaling factor. However, base-2 logarithms will give you a measure in bits, while natural logarithms will give you a measure in nats (natural units of information).

In [22]:
import gc
import torch

# Set the trainer for the large masked language model (MLM) to None to free up memory
trainer_mlm_large = None

# Set the large masked language model (MLM) to None to free up memory
model_mlm_large = None

# Set the tokenizer for the large MLM to None to free up memory
tokenizer_large = None

# Collect garbage to free up memory
gc.collect()

# Empty the CUDA cache to free up GPU memory
torch.cuda.empty_cache()

In [23]:
import math

print(f'The perplexity for the base model is {math.exp(0.38791826367378235)}')
print(f'The perplexity for the large model is {math.exp(0.30810099840164185)}')

The perplexity for the base model is 1.4739093074325733
The perplexity for the large model is 1.3608384245024145


In [24]:
import os
from pathlib import Path
from transformers import pipeline, AutoModelForMaskedLM, AutoTokenizer

# Define the path to save the base masked language model (MLM)
# This path points to the directory where the base MLM model will be saved
path_to_save_lm_base = Path('./outputs/transformers_basics/bert_masked_lm_base')

# Define the path to save the large masked language model (MLM)
# This path points to the directory where the large MLM model will be saved
path_to_save_lm_large = Path('./outputs/transformers_basics/bert_masked_lm_large')

In [25]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load the fine-tuned base masked language model (MLM) from the specified directory
# This model is a BERT base model fine-tuned on a Portuguese dataset
model_base = AutoModelForMaskedLM.from_pretrained(path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm")

# Load the tokenizer for the fine-tuned base MLM from the same directory
# The tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm")

# Load the fine-tuned large masked language model (MLM) from the specified directory
# This model is a BERT large model fine-tuned on a Portuguese dataset
model_large = AutoModelForMaskedLM.from_pretrained(path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm")

# Load the tokenizer for the fine-tuned large MLM from the same directory
# The tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm")

In [26]:
from transformers import pipeline

# Create a pipeline for the base masked language model (MLM)
# The pipeline is used to fill in the masked tokens in the input text
# 'fill-mask' specifies the task type for the pipeline
# model_base is the fine-tuned base MLM model
# tokenizer_base is the tokenizer for the base MLM model
# top_k=5 specifies that the top 5 predictions for the masked token will be returned
pipe_base = pipeline(
    'fill-mask',
    model=model_base,
    tokenizer=tokenizer_base,
    top_k=5
)

# Create a pipeline for the large masked language model (MLM)
pipe_large = pipeline(
    'fill-mask',
    model=model_large,
    tokenizer=tokenizer_large,
    top_k=5
)

In [27]:
pipe_base("O artigo 121 do Código Penal prevê o crime de [MASK]")

[{'score': 0.9281540513038635,
  'token': 131,
  'token_str': ':',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de :'},
 {'score': 0.012005731463432312,
  'token': 21982,
  'token_str': 'homicídio',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de homicídio'},
 {'score': 0.0050378949381411076,
  'token': 18144,
  'token_str': 'roubo',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de roubo'},
 {'score': 0.0032502533867955208,
  'token': 1112,
  'token_str': '“',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de “'},
 {'score': 0.0027919497806578875,
  'token': 184,
  'token_str': 're',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de re'}]

In [28]:
pipe_large("O artigo 121 do Código Penal prevê o crime de [MASK]")

[{'score': 0.9837086796760559,
  'token': 131,
  'token_str': ':',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de :'},
 {'score': 0.006448815111070871,
  'token': 21982,
  'token_str': 'homicídio',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de homicídio'},
 {'score': 0.0011397271882742643,
  'token': 119,
  'token_str': '.',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de.'},
 {'score': 0.0007863400387577713,
  'token': 1386,
  'token_str': 'morte',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de morte'},
 {'score': 0.0007723842863924801,
  'token': 9566,
  'token_str': 'corrupção',
  'sequence': 'O artigo 121 do Código Penal prevê o crime de corrupção'}]

In [29]:
pipe_base("O Código de Processo Civil prevê prazo em [MASK] para interposição de recurso pela Fazenda Pública")

[{'score': 0.3354406952857971,
  'token': 17225,
  'token_str': 'julgado',
  'sequence': 'O Código de Processo Civil prevê prazo em julgado para interposição de recurso pela Fazenda Pública'},
 {'score': 0.2519117295742035,
  'token': 5370,
  'token_str': 'aberto',
  'sequence': 'O Código de Processo Civil prevê prazo em aberto para interposição de recurso pela Fazenda Pública'},
 {'score': 0.2214008867740631,
  'token': 21244,
  'token_str': 'dobro',
  'sequence': 'O Código de Processo Civil prevê prazo em dobro para interposição de recurso pela Fazenda Pública'},
 {'score': 0.06615797430276871,
  'token': 3418,
  'token_str': 'curso',
  'sequence': 'O Código de Processo Civil prevê prazo em curso para interposição de recurso pela Fazenda Pública'},
 {'score': 0.03829769790172577,
  'token': 4712,
  'token_str': 'branco',
  'sequence': 'O Código de Processo Civil prevê prazo em branco para interposição de recurso pela Fazenda Pública'}]

In [30]:
pipe_large("O Código de Processo Civil prevê prazo em [MASK] para interposição de recurso pela Fazenda Pública")

[{'score': 0.5983186960220337,
  'token': 2241,
  'token_str': 'lei',
  'sequence': 'O Código de Processo Civil prevê prazo em lei para interposição de recurso pela Fazenda Pública'},
 {'score': 0.31673961877822876,
  'token': 21244,
  'token_str': 'dobro',
  'sequence': 'O Código de Processo Civil prevê prazo em dobro para interposição de recurso pela Fazenda Pública'},
 {'score': 0.015298635698854923,
  'token': 2502,
  'token_str': 'Lei',
  'sequence': 'O Código de Processo Civil prevê prazo em Lei para interposição de recurso pela Fazenda Pública'},
 {'score': 0.01301574520766735,
  'token': 20554,
  'token_str': 'razoável',
  'sequence': 'O Código de Processo Civil prevê prazo em razoável para interposição de recurso pela Fazenda Pública'},
 {'score': 0.009656175971031189,
  'token': 5370,
  'token_str': 'aberto',
  'sequence': 'O Código de Processo Civil prevê prazo em aberto para interposição de recurso pela Fazenda Pública'}]

## Train our Document Classifier Using Our Fine-Tuned Language Model

### Understanding the Language Model Output Structure

Before diving into the details of document classification, it's essential to grasp the structure of the output from the language model. The output is a vector with dimensions of `max_tokens` x `embedding_dimension`. Taking BERT-base as an example, the embedding dimension is 768. This means that for each token in the input text, there is a corresponding vector of size 768.

In practical scenarios, utilizing the entire array of vectors as input for our classifier may not be feasible due to the vast amount of information involved. Instead, we focus on leveraging the vector corresponding to the `[CLS]` token.

### The Significance of the `[CLS]` Token

The `[CLS]` token is a special token that precedes the input text and represents the entirety of the input in the context of BERT models. This token's vector size is 768, which is significantly more manageable compared to the entire vector array. `[CLS]` stands for `CL`a`S`sification and is specifically designed for classification tasks.

Here's an example to illustrate the usage of the `[CLS]` token:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
outputs = tokenizer('Eu gosto muito de farofa')
tokenizer.decode(outputs['input_ids'])
```

Resulting output: `'[CLS] Eu gosto muito de farofa [SEP]'`

In the above output, you'll notice that the `[CLS]` token is added to the start of the input text, while the `[SEP]` token is appended to the end. However, for classification purposes, we only need to focus on the `[CLS]` token and can ignore the `[SEP]` token. The role of the `[SEP]` token in BERT is to enable the separation of two sentences, but since our input text contains only one sentence, its usage is unnecessary here.

### Implementing Classification Using the `[CLS]` Token

Now that we know how to extract the vector for the `[CLS]` token, we can use it as input for our classifier. The classifier's output will be a vector of size `num_labels`, where `num_labels` refers to the number of labels present in our dataset. For example, if we have 4 labels, the classifier would output a vector of size 4.

This output vector will be crucial in calculating the model's loss and updating its weights during the training process. By comparing the predicted label probabilities with the actual labels, we can measure the model's performance and make necessary adjustments to improve its accuracy.

### Putting It All Together

To summarize, the process of document classification using a fine-tuned language model involves the following steps:

1. Tokenize the input text and add the `[CLS]` token at the beginning.
2. Pass the tokenized input through the language model to obtain the output vector.
3. Extract the vector corresponding to the `[CLS]` token.
4. Use the `[CLS]` token vector as input for the classifier.
5. Obtain the classifier's output vector, which represents the predicted label probabilities.
6. Calculate the loss by comparing the predicted labels with the actual labels.
7. Update the model's weights based on the calculated loss to improve its performance.

In [31]:
import os
from pathlib import Path
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

In [32]:
import pandas as pd
from pathlib import Path

# Load the training dataset from a Parquet file
# Only the 'text' and 'label' columns are read from the file
df_train = pd.read_parquet('data/legal/train.parquet', columns=['text', 'label'])

# Load the validation dataset from a Parquet file
# Only the 'text' and 'label' columns are read from the file
df_valid = pd.read_parquet('data/legal/valid.parquet', columns=['text', 'label'])

# Define the path to save the base masked language model (MLM)
# This path points to the directory where the base MLM model will be saved
path_to_save_lm_base = Path('./outputs/transformers_basics/bert_masked_lm_base')

# Define the path to save the large masked language model (MLM)
# This path points to the directory where the large MLM model will be saved
path_to_save_lm_large = Path('./outputs/transformers_basics/bert_masked_lm_large')

# Display the shapes of the training and validation datasets
# This shows the number of rows and columns in each dataset
df_train.shape, df_valid.shape

((52026, 2), (13007, 2))

In [33]:
# Create a dictionary to map each unique label in the training dataset to a unique ID
# df_train.label.unique() returns an array of unique labels in the training dataset
# The dictionary comprehension iterates over the unique labels and assigns an ID to each label
label2id = {df_train.label.unique()[i]: i for i in range(len(df_train.label.unique()))}

# Create a dictionary to map each unique ID back to its corresponding label
# This is the reverse mapping of the label2id dictionary
# The dictionary comprehension iterates over the items in label2id and swaps the keys and values
id2label = {v: k for k, v in label2id.items()}

# Display the label-to-ID and ID-to-label mappings
label2id, id2label

({'IMPROCEDENTE': 0,
  'PROCEDENTE': 1,
  'PARCIALMENTE PROCEDENTE': 2,
  'EXTINTO SEM MÉRITO': 3},
 {0: 'IMPROCEDENTE',
  1: 'PROCEDENTE',
  2: 'PARCIALMENTE PROCEDENTE',
  3: 'EXTINTO SEM MÉRITO'})

In [34]:
# Map the labels in the training dataset to their corresponding IDs
# This replaces the label names with their respective IDs using the label2id dictionary
df_train['label'] = df_train['label'].map(label2id)

# Map the labels in the validation dataset to their corresponding IDs
# This replaces the label names with their respective IDs using the label2id dictionary
df_valid['label'] = df_valid['label'].map(label2id)

# Display the first few rows of the training dataset
# This shows the updated training dataset with labels replaced by their corresponding IDs
df_train.head()

Unnamed: 0,text,label
1387,"SENTENÇA Vistos etc. Dispensado o relatório, a...",0
17972,"SENTENÇA Relatório dispensado. No caso, não há...",0
34527,SENTENÇA Vistos etc. Trata-se de pedido de res...,1
58381,TERMO DE AUDIÊNCIA DE INSTRUÇÃO Ação Especial ...,1
56474,SENTENÇA Trata-se de ação em que a parte autor...,2


In [35]:
import datasets

# Convert the training DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for training and evaluation
dataset_labeled_train = datasets.Dataset.from_pandas(df_train)

# Convert the validation DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for validation and evaluation
dataset_labeled_valid = datasets.Dataset.from_pandas(df_valid)

In [36]:
from transformers import AutoTokenizer
from functools import partial

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm")

# Load the tokenizer for the fine-tuned large masked language model (MLM)
# This tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm")

# Define a function to preprocess the input examples using a specified tokenizer
# The function tokenizes the input text, truncates it to a maximum length of 512 tokens,
# and pads the sequences to ensure they are of equal length
def preprocess_function(examples, tokenizer):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

# Create a partial function for preprocessing using the base tokenizer
preprocess_function_base = partial(preprocess_function, tokenizer=tokenizer_base)

# Create a partial function for preprocessing using the large tokenizer
preprocess_function_large = partial(preprocess_function, tokenizer=tokenizer_large)

In [37]:
# Tokenize the training dataset using the base tokenizer
# The preprocess_function_base tokenizes the text, truncates it to 512 tokens, and pads the sequences
# The batched=True argument processes the dataset in batches for efficiency
dataset_labeled_train_tokenized_base = dataset_labeled_train.map(preprocess_function_base, batched=True)

# Tokenize the validation dataset using the base tokenizer
dataset_labeled_valid_tokenized_base = dataset_labeled_valid.map(preprocess_function_base, batched=True)

# Tokenize the training dataset using the large tokenizer
dataset_labeled_train_tokenized_large = dataset_labeled_train.map(preprocess_function_large, batched=True)

# Tokenize the validation dataset using the large tokenizer
dataset_labeled_valid_tokenized_large = dataset_labeled_valid.map(preprocess_function_large, batched=True)

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

In [38]:
from transformers import DataCollatorWithPadding

# Create a data collator for the base tokenizer
# The data collator dynamically pads the input sequences to the maximum length in the batch
# This ensures that all sequences in a batch have the same length, which is required for efficient processing
data_collator_base = DataCollatorWithPadding(tokenizer=tokenizer_base)

# Create a data collator for the large tokenizer
data_collator_large = DataCollatorWithPadding(tokenizer=tokenizer_large)

In [39]:
# Import the evaluate module from the Hugging Face library
import evaluate

# Load the accuracy metric from the evaluate module
# This metric will be used to evaluate the performance of the model
accuracy = evaluate.load("accuracy")

In [40]:
import numpy as np

# Define a function to compute evaluation metrics
# This function will be used to evaluate the performance of the model during training and validation
def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the evaluation tuple
    predictions, labels = eval_pred
    
    # Convert the model's output logits to predicted class labels
    # np.argmax(predictions, axis=1) selects the index of the maximum logit for each prediction
    predictions = np.argmax(predictions, axis=1)
    
    # Compute the accuracy metric using the predicted and true labels
    # accuracy.compute() calculates the accuracy of the predictions
    return accuracy.compute(predictions=predictions, references=labels)

In [41]:
# Determine the number of unique labels in the training dataset
# This will be used to configure the classification model
n_labels = df_train.label.nunique()

# Load the configuration for the base masked language model (MLM) and modify it for sequence classification
# The configuration is loaded from the specified directory and the number of labels is set to n_labels
config_base = AutoConfig.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm", 
    num_labels=n_labels
)

# Load the base masked language model (MLM) and modify it for sequence classification
# The model is loaded from the specified directory and the configuration is set to config_base
classifier_base = AutoModelForSequenceClassification.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm", 
    config=config_base
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [49]:
from transformers import Trainer, TrainingArguments

# Define the training arguments for the base classifier
# These arguments configure various aspects of the training process
training_args_base = TrainingArguments(
    output_dir=path_to_save_lm_base/"base_classifier_legal",  # Directory to save the model and other outputs
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=48,  # Batch size for training (adjust based on GPU memory)
    per_device_eval_batch_size=64,  # Batch size for evaluation (adjust based on GPU memory)
    num_train_epochs=5,  # Number of training epochs
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients before updating
    weight_decay=0.01,  # Weight decay for regularization
    bf16=True,  # Use 16-bit floating point precision for training (adjust based on GPU support)
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 10 steps
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Seed for reproducibility
)

# Create a Trainer instance for the base classifier
# The Trainer handles the training and evaluation of the model
trainer_base = Trainer(
    model=classifier_base,  # The model to be trained
    args=training_args_base,  # Training arguments
    train_dataset=dataset_labeled_train_tokenized_base,  # Training dataset
    eval_dataset=dataset_labeled_valid_tokenized_base,  # Evaluation dataset
    tokenizer=tokenizer_base,  # Tokenizer for preprocessing the input text
    data_collator=data_collator_base,  # Data collator for dynamic padding
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

# Train the model using the Trainer
trainer_base.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.6011,0.608936,0.744599
2,0.5274,0.584451,0.759899
3,0.485,0.558055,0.775967
4,0.4569,0.565196,0.777043
5,0.4197,0.580136,0.779657




TrainOutput(global_step=2710, training_loss=0.5146869243291031, metrics={'train_runtime': 3837.1369, 'train_samples_per_second': 67.793, 'train_steps_per_second': 0.706, 'total_flos': 6.844430787637248e+16, 'train_loss': 0.5146869243291031, 'epoch': 5.0})

In [50]:
trainer_base.evaluate()



{'eval_loss': 0.5580551624298096,
 'eval_accuracy': 0.7759667871146306,
 'eval_runtime': 60.6451,
 'eval_samples_per_second': 214.477,
 'eval_steps_per_second': 1.682,
 'epoch': 5.0}

`Can you guess why the accuracy is so low?`


## Understanding Low Accuracy: The Limitation of 512 Tokens

When working with transformer models, it's essential to be aware of a key limitation: most models can only process a maximum of **512 tokens**. This restriction has a significant impact on the accuracy of predictions, especially when dealing with longer texts.

### The Self-Attention Mechanism and Quadratic Complexity

The 512-token limit is a result of the *quadratic complexity* of the **self-attention mechanism**, which is a fundamental component of transformer models. Self-attention allows the model to weigh the importance of each token in relation to others, enabling it to capture context and dependencies within the input text.

However, the computational cost of self-attention grows quadratically with the number of tokens. As the input length increases, the memory and computational requirements become prohibitively expensive. To mitigate this issue, most transformer models impose a maximum token limit of 512.

### The Impact of Truncation on Accuracy

When an input text exceeds 512 tokens, the model automatically truncates it by removing tokens until it fits within the limit. This truncation process can have a detrimental effect on the model's accuracy.

Important information, such as key context or relevant details, may be lost during truncation. The model is forced to make predictions based on an incomplete representation of the original text, leading to lower accuracy scores.

### Strategies for Handling Longer Texts

While the 512-token limit can be challenging, there are several approaches to mitigate its impact:

1. **Sliding Window Approach**:
- Divide the long text into smaller, overlapping chunks (windows).
- Process each window individually and aggregate the results.
- This approach can help capture local context, but it may struggle with long-range dependencies.

2. **Alternative Neural Network Architectures**:
- Consider using other architectures, such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs).
- These architectures can handle longer sequences without the same token limit constraints.
- However, they may not capture long-range dependencies as effectively as transformers.

3. **Transformer Variants for Longer Sequences**:
- Explore transformer-based models specifically designed for handling longer texts, such as Longformer and BigBird.
- These models introduce modifications to the self-attention mechanism to reduce computational complexity.
- Keep in mind that these models are relatively new and may have limitations or trade-offs compared to standard transformers.


To make informed decisions about handling longer texts, it's crucial to understand the characteristics of your dataset. Analyze the average number of tokens per text and the distribution of text lengths.

If a significant portion of your texts exceeds the 512-token limit, consider applying one of the strategies mentioned above. Experiment with different approaches and evaluate their impact on accuracy and computational efficiency.

In [51]:
from transformers import AutoTokenizer

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm")

# Initialize an empty list to store the sizes of tokenized input sequences
sizes = []

# Iterate over each text in the training dataset
for txt in df_train.text:
    # Tokenize the text without truncation and get the length of the tokenized input sequence
    # Append the length of the tokenized input sequence to the sizes list
    sizes.append(len(tokenizer_base(txt, truncation=False)['input_ids']))

# Convert the sizes list to a Pandas Series and display descriptive statistics
# This provides an overview of the distribution of tokenized input sequence lengths
pd.Series(sizes).describe()

count    52026.000000
mean      2373.339407
std       1822.717847
min        151.000000
25%       1133.000000
50%       1799.000000
75%       3031.000000
max      11434.000000
dtype: float64

`As we can see above, the average number of tokens in our dataset is 2,373. This is significantly higher than the 512 token limit. Therefore, we need to employ a workaround to handle this limitation. We won't cover more complex approaches in this class, but we can use a simple and effective workaround - understanding our data! Let's see how we can do this.`

In [71]:
df_train.sample(10, random_state=271828)['text'].iloc[0]

"SENTENÇA Tipo A RELATÓRIO Trata-se de ação declaratória de inexistência de débito e indenizatória por danos morais, com pedido de repetição de indébito, ajuizada por Lúcia Matias de Souza em face do Instituto Nacional do Seguro Social – INSS e do Banco Bradesco S/A, em razão da existência de contrato de empréstimo consignado celebrado perante a aludida instituição financeira que, segundo diz a autora, não foi por ela contratado. É o que importa relatar. Passo a decidir. FUNDAMENTAÇÃO Das preliminares arguidas Quanto à preliminar de ilegitimidade passiva alegada pelo INSS (anexo 11), entendo que a Autarquia ré detém legitimidade para figurar no pólo passivo da ação, tendo em vista que é responsável pelo gerenciamento e pagamento dos descontos realizados nos benefícios previdenciários em decorrência de empréstimo consignado. Assim, a partir do momento em que opera o desconto nos valores tem interesse e legitimidade para figurar no pólo passivo da presente demanda. Ademais, só o INSS tem


`Can you notice that the really relevant information for our classification task is not in the beginning of the text, but in the end?`

> (....)
>
> DISPOSITIVO Isso posto, `julgo PROCEDENTE` o pedido para determinar que o INSS cesse os descontos das parcelas do Contratono 808431996. Condeno, também, a título de danos materiais, o Banco Bradesco a devolver os valores descontados com relação aos citados contratos de empréstimo, em dobro, nos termos do art. 42, parágrafo único, do CDC, devendo tais valores serem acrescidos de juros de mora de 1% ao mês desde o evento danoso (súmula 54 – STJ) e correção monetária com base no IPCA-E desde o efetivo prejuízo (súmula 43 – STJ). Condeno, ainda, o bancoréua pagar, a título de indenização por danos morais, a quantia de R$ 5.000,00 (cinco mil reais), valor este que deve ser atualizado exclusivamente pela taxa SELIC desde a publicação desta sentença. Declaro a inexistência do contrato no808431996. Declaro extinto o processo com resolução do mérito, nos termos do art. 487, I, do Código de Processo Civil. Custas e honorários advocatícios indevidos em primeiro grau de jurisdição (art. 55 da Lei no 9.099/95, c/c art. 1o da Lei no 10.259/01). Registre-se. Intimem-se as partes (Lei no 10.259/01, art. 8o). Campina Grande-PB, data supra. JUIZ FEDERAL
>

This is very common in this kind of documents. The judge starts with a thorough description of the case and then goes to the decision. So, we can use the last 512 tokens of the text to train our model. We just need to change the truncation_side parameter to 'left' in the tokenizer.

Let's see how we can do this.

In [53]:
from transformers import AutoTokenizer
from functools import partial

# Load the tokenizer for the fine-tuned base masked language model (MLM)
# This tokenizer is used to preprocess the input text for the base model
tokenizer_base = AutoTokenizer.from_pretrained(path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm")

# Load the tokenizer for the fine-tuned large masked language model (MLM)
# This tokenizer is used to preprocess the input text for the large model
tokenizer_large = AutoTokenizer.from_pretrained(path_to_save_lm_large / "bert-large-portuguese-cased-finetuned-mlm")

In [54]:
tokenizer_base.truncation_side

'right'

In [55]:
# Tokenize the input text using the base tokenizer
# The padding=True argument ensures that the sequence is padded to the maximum length
# The truncation=True argument ensures that the sequence is truncated to the maximum length if it exceeds it
# The max_length=5 argument sets the maximum length of the tokenized sequence to 5 tokens
out_len5 = tokenizer_base('Eu gosto muito de farofa com banana', padding=True, truncation=True, max_length=5) # This is to simulate the truncation

# Decode the tokenized input IDs back to a string
# This converts the token IDs back to the corresponding text
# The decoded text will be truncated to the first 5 tokens
tokenizer_base.decode(out_len5['input_ids'])

'[CLS] Eu gosto muito [SEP]'

In [56]:
# Set the truncation side for the base tokenizer to 'left'
# This means that if the input text needs to be truncated, tokens will be removed from the beginning (left side) of the sequence
# This setting is useful when the most important information is at the end of the sequence
tokenizer_base.truncation_side = 'left'

In [57]:
# Tokenize the input text using the base tokenizer
# The padding=True argument ensures that the sequence is padded to the maximum length
# The truncation=True argument ensures that the sequence is truncated to the maximum length if it exceeds it
# The max_length=5 argument sets the maximum length of the tokenized sequence to 5 tokens
out_len5 = tokenizer_base('Eu gosto muito de farofa com banana', padding=True, truncation=True, max_length=5)

# Decode the tokenized input IDs back to a string
# This converts the token IDs back to the corresponding text
# The decoded text will be truncated to the first 5 tokens
tokenizer_base.decode(out_len5['input_ids'])

'[CLS] com banana [SEP]'

In [58]:
import datasets

# Convert the training DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for training and evaluation
dataset_labeled_train = datasets.Dataset.from_pandas(df_train)

# Convert the validation DataFrame to a Hugging Face Dataset
# This allows the use of Hugging Face's dataset utilities for validation and evaluation
dataset_labeled_valid = datasets.Dataset.from_pandas(df_valid)

In [59]:
from functools import partial

# Define a function to preprocess the input examples using a specified tokenizer
# The function tokenizes the input text, truncates it to a maximum length of 512 tokens,
# and pads the sequences to ensure they are of equal length
def preprocess_function(examples, tokenizer):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

# Create a partial function for preprocessing using the base tokenizer
# This partial function allows us to call preprocess_function with only the examples argument,
# as the tokenizer argument is already set to tokenizer_base
preprocess_function_base = partial(preprocess_function, tokenizer=tokenizer_base)

# Create a partial function for preprocessing using the large tokenizer
preprocess_function_large = partial(preprocess_function, tokenizer=tokenizer_large)

In [60]:
# Tokenize the training dataset using the base tokenizer
# The preprocess_function_base tokenizes the text, truncates it to 512 tokens, and pads the sequences
# The batched=True argument processes the dataset in batches for efficiency
dataset_labeled_train_tokenized_base = dataset_labeled_train.map(preprocess_function_base, batched=True)

# Tokenize the validation dataset using the base tokenizer
dataset_labeled_valid_tokenized_base = dataset_labeled_valid.map(preprocess_function_base, batched=True)

Map:   0%|          | 0/52026 [00:00<?, ? examples/s]

Map:   0%|          | 0/13007 [00:00<?, ? examples/s]

In [61]:
from transformers import DataCollatorWithPadding

# Create a data collator for the base tokenizer
# The data collator dynamically pads the input sequences to the maximum length in the batch
# This ensures that all sequences in a batch have the same length, which is required for efficient processing
data_collator_base = DataCollatorWithPadding(tokenizer=tokenizer_base)

In [62]:
# Import the evaluate module from the Hugging Face library
import evaluate

# Load the accuracy metric from the evaluate module
# This metric will be used to evaluate the performance of the model
accuracy = evaluate.load("accuracy")

In [63]:
import numpy as np

# Define a function to compute evaluation metrics
# This function will be used to evaluate the performance of the model during training and validation
def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the evaluation tuple
    predictions, labels = eval_pred
    
    # Convert the model's output logits to predicted class labels
    # np.argmax(predictions, axis=1) selects the index of the maximum logit for each prediction
    predictions = np.argmax(predictions, axis=1)
    
    # Compute the accuracy metric using the predicted and true labels
    # accuracy.compute() calculates the accuracy of the predictions
    return accuracy.compute(predictions=predictions, references=labels)

In [64]:
# Determine the number of unique labels in the training dataset
# This will be used to configure the classification model
n_labels = df_train.label.nunique()

# Load the configuration for the base masked language model (MLM) and modify it for sequence classification
# The configuration is loaded from the specified directory and the number of labels is set to n_labels
config_base = AutoConfig.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm", 
    num_labels=n_labels
)

# Load the base masked language model (MLM) and modify it for sequence classification
classifier_base = AutoModelForSequenceClassification.from_pretrained(
    path_to_save_lm_base / "bert-base-portuguese-cased-finetuned-mlm", 
    config=config_base
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at outputs/transformers_basics/bert_masked_lm_base/bert-base-portuguese-cased-finetuned-mlm and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [65]:
from transformers import Trainer, TrainingArguments

# Define the training arguments for the base classifier
# These arguments configure various aspects of the training process
training_args_base = TrainingArguments(
    output_dir=path_to_save_lm_base/"base_classifier_legal",  # Directory to save the model and other outputs
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=48,  # Batch size for training (adjust based on GPU memory)
    per_device_eval_batch_size=64,  # Batch size for evaluation (adjust based on GPU memory)
    num_train_epochs=5,  # Number of training epochs
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients before updating
    weight_decay=0.01,  # Weight decay for regularization
    bf16=True,  # Use 16-bit floating point precision for training (adjust based on GPU support)
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 10 steps
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Seed for reproducibility
)

# Create a Trainer instance for the base classifier
# The Trainer handles the training and evaluation of the model
trainer_base = Trainer(
    model=classifier_base,  # The model to be trained
    args=training_args_base,  # Training arguments
    train_dataset=dataset_labeled_train_tokenized_base,  # Training dataset
    eval_dataset=dataset_labeled_valid_tokenized_base,  # Evaluation dataset
    tokenizer=tokenizer_base,  # Tokenizer for preprocessing the input text
    data_collator=data_collator_base,  # Data collator for dynamic padding
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

# Train the model using the Trainer
trainer_base.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.1213,0.125238,0.955178
2,0.0769,0.121206,0.957408
3,0.0936,0.12133,0.960483
4,0.0874,0.118434,0.961252
5,0.0613,0.12168,0.96179




TrainOutput(global_step=2710, training_loss=0.11554007523614102, metrics={'train_runtime': 3846.022, 'train_samples_per_second': 67.636, 'train_steps_per_second': 0.705, 'total_flos': 6.844430787637248e+16, 'train_loss': 0.11554007523614102, 'epoch': 5.0})

In [66]:
trainer_base.evaluate()



{'eval_loss': 0.11843354254961014,
 'eval_accuracy': 0.9612516337356808,
 'eval_runtime': 60.1118,
 'eval_samples_per_second': 216.38,
 'eval_steps_per_second': 1.697,
 'epoch': 5.0}

We've achieved a significant improvement in our model's accuracy, which soared from 77.9% to an impressive 96.1%. This upswing is indeed fantastic news!

Let's gain a better understanding of this improvement by examining it in terms of the error rate. The error rate is simply calculated as (1 - accuracy). With this formula, our initial error rate was 22.5%, and our improved error rate dropped dramatically to 3.9%.

To put this into perspective, we've effectively reduced the error rate by nearly six-fold! In other words, our model is now making far fewer mistakes than before, indicating an exponential enhancement in its overall performance.

By using the last 512 tokens in the text data, we were able to direct the focus of our model towards the most relevant information. This approach is a simple yet effective workaround to overcome the 512 token limitation in transformers.

This method may seem simple, but it's proven to be an effectively strategic approach to overcome such limitations and handle large amounts of data proficiently. `Remember, sometimes simplicity is the key to master complex challenges!`

# Questions

1. What is the key advantage of transformers compared to traditional sequential models like RNNs and LSTMs?

2. What is the role of the attention mechanism in transformers?

3. What are the main components of a typical transformer architecture?

4. What is the impact of quadratic complexity on the performance of transformers for longer texts?

5. How have transformers been applied beyond natural language processing (NLP)?

6. What are some common transformer architectures used for NLP tasks?

7. How can transformers be used as feature extractors?

8. What are the key steps for using transformers on a specific task?

9. What is the limitation of the 512-token limit in transformers, and how does it impact accuracy?

10. What is a simple workaround to handle the 512-token limit and improve accuracy?

`Answers are commented inside this cell`
<!--
1. Transformers can handle long-range dependencies effectively and parallelize computations, avoiding the vanishing gradient problem that plagues RNNs and LSTMs.

2. The attention mechanism allows the model to focus on the most relevant parts of the input sequence when predicting a specific output, enabling it to capture complex relationships and dependencies between words.

3. A typical transformer consists of an encoder and a decoder, each composed of multiple identical layers. The key components include self-attention layers, feed-forward neural networks, and positional encoding.

4. The quadratic complexity of transformers results in slower training times and high memory consumption when dealing with longer sequences, hindering their practicability in scenarios involving extensive texts.

5. Transformers have been successfully applied in various domains, including computer vision (e.g., Image Transformer, Vision Transformer), music generation (e.g., MuseNet), speech recognition (e.g., Speech-Transformer), and video processing (e.g.,
Video Transformer).

6. Common transformer architectures for NLP include BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pretrained Transformer), RoBERTa (Robustly Optimized BERT Approach), T5 (Text-to-Text Transfer Transformer), and
XLNet.

7. Transformers can be used as feature extractors by leveraging their ability to capture rich syntactical and contextual information from text data. The extracted features, typically represented by the vector corresponding to the [CLS] token, can
be used as input for downstream tasks like classification or regression.

8. The key steps for using transformers include starting with a pretrained model, optionally fine-tuning the model on domain-specific text, training the model for the specific task using task-specific data, and using the model as a feature
extractor if needed.

9. Most transformer models can only process a maximum of 512 tokens due to the quadratic complexity of the self-attention mechanism. When an input text exceeds this limit, it is truncated, potentially losing important information and leading to
lower accuracy in predictions.

10. A simple workaround is to focus on the most relevant information in the text data. For example, in legal documents where the decision is often at the end, using the last 512 tokens of the text can significantly improve accuracy by directing the model's attention to the most important part of the document. -->