# Transfer Learning
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](htttps://docente.ufrn.br/elias.jacob)

# Summary

## Keypoints
- Transfer learning leverages pre-trained models as a starting point, significantly reducing training time and data requirements for NLP tasks.

- Tokenization breaks down text into meaningful semantic units (tokens), enabling models to understand the base units and context of words. Subword tokenization handles complex morphologies and unseen words.

- Language modeling is a self-supervised learning task that captures language structure and patterns by predicting the next or masked words in a sequence, providing valuable insights for various NLP tasks.

- ULMFiT is a transfer learning approach that pre-trains a language model on a large corpus, fine-tunes it on domain-specific text, and then uses the fine-tuned model for downstream NLP tasks, improving performance even with limited labeled data.

- Whole Word Masking masks entire words instead of subwords or tokens, forcing the model to understand the contextual meaning of the masked word as a whole, leading to better semantic understanding.

- Fine-tuning a pre-trained language model on domain-specific text before training a classifier is crucial for adapting the model to the target domain and capturing task-specific nuances, resulting in superior performance compared to using the pre-trained model directly or training from scratch.

## Takeaways
- Transfer learning greatly improves the performance and efficiency of NLP models, especially when labeled data is limited, by employing knowledge from large-scale pre-training on vast amounts of unlabeled text.

- Self-supervised learning techniques like language modeling enable models to learn valuable information about language structure and semantics from unlabeled data, reducing reliance on labeled data and fostering a deep understanding of language that can be transferred to various NLP tasks.

- Proper tokenization strategies, such as subword tokenization and Whole Word Masking, are essential for models to effectively process and understand the semantic meaning of text, handling complex morphologies and out-of-vocabulary words.

- The practical example demonstrates the power of transfer learning and domain-specific fine-tuning, with the fine-tuned pre-trained model outperforming both the model trained from scratch and the pre-trained model without fine-tuning, highlighting the importance of adapting models to the target domain for optimal performance in NLP tasks like sentiment analysis.

# Deep Learning for Natural Language Processing (NLP)

Deep learning is a powerful tool to process and analyze large amounts of data, which has been particularly effective in the field of Natural Language Processing (NLP). In this document, we will cover the basics of deep learning for NLP, including understanding key concepts like neural networks, perceptrons, feedforward neural networks, and the backpropagation algorithm.


For a thorough yet accessible approach, please check the [Deep Learning for Coders](https://github.com/fastai/fastbook) book.

## Deep Learning Overview

Deep learning is a subfield of machine learning that relies on artificial neural networks, which are inspired by biological neural networks. By learning from examples, deep learning enables computers to perform tasks that come naturally to humans. It is a crucial technology behind innovations like driverless cars, voice-controlled devices, and advanced recommendation systems.

### Neural Networks

In 1943 Warren McCulloch, a neurophysiologist, and Walter Pitts, a logician, teamed up to develop a mathematical model of an artificial neuron. In their [paper](https://link.springer.com/article/10.1007/BF02478259) "A Logical Calculus of the Ideas Immanent in Nervous Activity" they declared that:

A neural network is an interconnected group of neurons, or nodes. These networks can be biological, consisting of real neurons, or artificial, designed for solving AI problems. The connections between neurons in the network are modeled as weights, with positive values representing excitatory connections and negative values representing inhibitory connections. Inputs are modified by weights and summed up in a process called linear combination. An activation function then controls the output's amplitude, usually within a specific range (e.g., 0-1 or -1-1).
>
> Because of the “all-or-none” character of nervous activity, neural events and the relations among them can be treated by means of propositional logic. It is found that the behavior of every net can be described in these terms.
>
McCulloch and Pitts realized that a simplified model of a real neuron could be represented using simple addition and thresholding. Pitts was self-taught, and by age 12, had received an offer to study at Cambridge University with the great Bertrand Russell. He did not take up this invitation, and indeed throughout his life did not accept any offers of advanced degrees or positions of authority. Most of his famous work was done while he was homeless. Despite his lack of an officially recognized position and increasing social isolation, his work with McCulloch was influential, and was taken up by a psychologist named Frank Rosenblatt.

<img alt="Natural and artificial neurons" width="500" caption="Natural and artificial neurons" src="images/chapter7_neuron.png" id="neuron"/>



### The Perceptron

Rosenblatt's perceptron, often considered one of the earliest forms of machine learning, had its origin as an attempt to teach a machine to recognize images. Although we might not consider his original creation as a computer in the contemporary sense due to its hardware composition, the basic concept remains relevant and foundational.

#### Structure and Function of the Perceptron

The perceptron was essentially an assembly of photo-receptors and potentiometers, designed to mimic the biological neuron's structure and function. It effectively operated through identifying and processing features captured from small subdivisions of an entire image. Each feature, correlating to part of the image, would be assigned a weight indicating its significance or relevance to the overall interpretation of the image.

##### Perceiving the Image: The Role of Photo-Receptors

For the perceptual process, an image was exposed to a grid of photo-receptors. Each photo-receptor would perceive a minuscule section of the image and interpret the brightness level within that section. The perceived brightness would subsequently determine the strength of the signal that the photo-receptor would relay to its associated "dendrite."

##### Processing the Signal: The Role of Dendrites

In Rosenblatt's design, each dendrite was equipped with a potentiometer, functioning as a form of adjustable weight. These weights were critical in governing whether the signal received by the dendrite was deemed sufficiently sturdy to pass on to the next stage, the "nucleus" or main body of the "cell."

##### Classifying the Image: The Decision at the Nucleus

At the nucleus, signals from various dendrites amalgamated. If the combined strength of all incoming signals crossed a predetermined threshold, the perceptron would activate, transmitting a signal down its "axon." This activation was equivalent to a positive classification match for the input image. Simply put, the perceptron recognized the image based on its training. Conversely, if the combined signals failed to meet the threshold, the perceptron would not activate, indicating a negative classification match.

Through this process, the perceptron could perform binary classification tasks such as distinguishing between "dog" and "not dog" images. The perceptron's ability to recognize images was dependent on the weights assigned to each dendrite. These weights were adjusted during the training process, which involved exposing the perceptron to a series of images and adjusting the weights based on the perceptron's performance. The training process was iterative, and the perceptron's performance improved with each iteration.

essentially, the perceptron model gives us an insight into how complex pattern recognition can begin with simple, feature-based image analysis and how weighted importance of these features can contribute towards accurate image classification.

### Feedforward Neural Networks

An artificial neural network is reminiscent of a human brain in its function to process and analyze information. Among the many types of networks, a specific one, called the **Feedforward Neural Network**, is noteworthy for its simplicity and efficiency.

A feedforward neural network is often called so because it allows information to travel in just one direction - from the input nodes, through any hidden layers (if present), and finally to the output nodes. As the name implies, 'feed forward' suggests that there are no loops or cycles; the data 'feeds forward' through the network. The absence of cycles makes the architecture of the network relatively simpler when compared to recurrent neural networks yet still remarkably efficient.

### Backpropagation Algorithm

Training an artificial neural network revolves around adjusting the individual weights of connections within the network until it can accurately make predictions. The key question that arises here then is - How do we adjust these weights optimally?

This is where the **Backpropagation Algorithm** steps in. This term "backpropagation" stands for "backward propagation of errors," indicating that the error in prediction starts at the output end and works its way backwards while updating the weights. Mainly used to train deep neural networks, networks with more than one hidden layer, the backpropagation algorithm plays a crucial role.

The algorithm calculates the gradient (i.e., the rate of change) of the loss function concerning each weight in the network, which guides the adjustment of the weights. It uses the mathematical principle known as the chain rule to compute gradients one layer at a time, iterating backward from the last layer. 

This process prevents redundant calculations of intermediate terms, thereby enhancing computational efficiency. Due to these characteristics, backpropagation is deemed to be a classic example of dynamic programming.


# Supplementary Video Resources to Enhance Understanding

To further aid your understanding of machine learning concepts and especially neural networks, I have curated a list of video resources. These resources range from brief overviews to more in-depth explanations and visual examples.

### 1. **The Significance of Neural Networks**

This super-short video gives a concise yet effective explanation on the importance of neural networks.

- [Why Neural Networks are Important (in 45 seconds)](https://www.youtube.com/watch?v=PAZTIAfaNr8)

### 2. **Demystifying Neural Networks' Learning Ability**
   
Want to know how Neural Networks gain their remarkable learning abilities? This video breaks it down in an understandable manner.

- [Why Neural Networks Can Learn (Almost) Anything?](https://www.youtube.com/watch?v=0QczhVg5HaI)

### 3. **Observing Neural Networks in Action**

Watching neural networks learn can be a fascinating experience and this video offers exactly that.

- [Watching Neural Networks Learn](https://www.youtube.com/watch?v=TkwXa7Cvfr8)

### 4. **A thorough playlist about neural networks**

For those who are looking for a deeper dive into how neural networks function, this playlist is filled with well-crafted videos covering various aspects of neural networks.

- [Playlist with Four Insightful Videos about How Neural Networks Work](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)

Taking some time out and watching these videos will surely assist you in understanding the workings of neural networks more intuitively. Happy learning!

---

## Tokenization: The Breakdown of Semantic Units

### What is Tokenization?

Tokenization, serving as a fundamental step in Natural Language Processing (NLP), is the process of segmenting running text into individual meaningful elements, referred to as 'tokens'. Depending on the complexity of the language and the task at hand, these tokens could be simple words, sentences, and even subwords.

#### Subword Tokenization

Subword tokenization is a powerful technique that essentially splits words into smaller, more manageable units. This process allows NLP models to handle languages with large vocabularies, elaborate morphologies, and even words not seen during training - a common occurrence in languages with rich vocabularies like Portuguese.

The inclusion of subword tokenization methods allows the model to understand the base units of each word better, leading to an improved comprehension of the semantic meaning of each sentence.

<p align="center">
<img src="images/multifit_vocabularies.png" alt="" style="width: 30%; height: 30%"/>
</p>


### Examining Tokens in Detail

To grasp the concept of tokenization better, let's consider an example. Let's look at the Portuguese word "cacimbinha". Depending on the kind of tokenizer we use, this one word can be broken down into varying numbers of tokens.

Consider this Python code:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
tokens = tokenizer.tokenize("cacimbinha")
print(tokens)
```
This generates the following output:
```['ca', '##ci', '##mb', '##inha']```

However, when using a different tokenizer, we receive a different number of tokens:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("cacimbinha")
print(tokens)
```
Output:
```['ca', '##ci', '##mb', '##in', '##ha']```

The tokenization output is highly dependent on the training of the specific tokenizer used. Hence, a single word might end up containing multiple tokens based on how the tokenizer was originally trained.

Tokenization operates as the foundation for higher-level language tasks, delivering nuanced understanding to NLP models by recognizing and analyzing the fundamental semantic units within a sentence. Therefore, choosing the correct tokenizer and understanding its operations forms an essential part of building successful and effective NLP models.

---

# Transfer Learning for NLP

## What is transfer learning?

Transfer learning is a potent machine learning methodology in which we capitalize on a model that was developed for one task as the basis or starting point for a model designed to tackle another task. It has gained popularity, particularly in the realms of deep learning, where pre-trained models are utilized as initial points on tasks related to computer vision and natural language processing (NLP).

The primary reason behind this approach lies in the immense computational and time resources necessary to develop neural network models geared toward these areas. Also, it's worth noting that such models provide substantial improvements in ability when applied to related issues.

### Origins and Applications of Transfer Learning

Historically, the first methodologies for transfer learning were developed for tasks related to computer vision. However, its practical applications have since then diversified significantly. For instance, in Computer Vision (CV), the pre-trained model typically takes the form of a neural network that underwent training on a benchmark dataset, such as ImageNet. This trained model can then be repurposed as the foundation for an image classification or object detection model, which will be trained on a new dataset.

The basic rationale here is that the model has acquired feature extraction capabilities. Meaning, it learned how to extract relevant features from photographs that are generally handy for describing a vast variety of photograph distributions. These might include different times, varying lighting conditions, distinct objects present within the scene, amongst many others. This concept is popularly referred to as 'feature extraction'.

In more recent years, specifically around 2018, transfer learning proved its mettle in the field of Natural Language Processing (NLP) and is now arguably deemed as the dominant technique within the NLP sphere. The team at [FastAI](https://www.fast.ai/) demonstrated that transfer learning could yield effective results for NLP-related tasks by introducing the [Universal Language Model Fine-Tuning (ULMFiT)](https://arxiv.org/abs/1801.06146).

With ULMFiT, the process starts with pre-training a language model on a substantial corpus of unlabeled text (e.g., Wikipedia). Afterward, this pre-trained model is fine-tuned for a specific domain. This refined model can then serve as the basis for tackling other NLP tasks such as sentiment analysis, document classification, and question answering.

## The Value of Transfer Learning

Transfer learning introduces a myriad of benefits to the field of deep learning. Predominantly, it furnishes a means to construct accurate models in a time-efficient fashion. As such, it stands as the quickest path toward developing practical models designed to resolve common NLP problems.

Also, transfer learning proves especially useful when there's scarcity in labeled training data. In such cases, we can utilize a model that underwent training on a larger dataset within the same domain as our initial point. This frequently yields better performance results in comparison to training a model from scratch.

<p align="center">
<img src="images/ulmfit_imdb.png" alt="" style="width: 30%; height: 30%"/>
</p>

Even in scenarios where ample labeled data exists, applying transfer learning may enhance our models' performance. This is attributable to the depth of information the pre-trained model has already learned about the language. For instance, a pre-trained model can offer insight into the relationships between words, syntax, sentiment, and sentence structure - all of which are crucial elements for tasks involving part-of-speech tagging, named entity recognition, solving analogy tasks, question answering, sentiment analysis, and machine translation.

Transfer learning is increasingly becoming a cornerstone in today's machine learning and artificial intelligence advancements, owing to several practical advantages:

- **Reduced Training Time**: Pre-trained models have already learned features from massive datasets. Hence, time required to train these models on new tasks can be significantly reduced compared to training models from scratch.

- **Lower Computational Requirements**: Because pre-trained models require less time to train, the computational resources necessary are also substantially decreased.

- **Better Performance with Limited Data**: In scenarios where available data for training is limited, transfer learning offers better model performance because it leverages knowledge from related tasks.

- **Broad Applicability**: One of the key strengths of transfer learning is its broad applicability across many domains including image processing, natural language processing, and audio recognition, among others.

## The ULMFiT Approach

The ULMFiT methodology is executed by following these key steps:

<p align="center">
<img src="images/ulmfit.svg" alt="" style="width: 60%; height: 60%"/>
</p>

1. **Step 1: Pre-training** - This is the first step in the ULMFiT approach, which involves training a language model on a large corpus of unlabeled text. The language model here tries to predict the next word in a sentence, which inherently forces the model to learn about grammar, semantics, and even some world facts.
2. **Step 2: Fine-tuning** - The second step involves fine-tuning the previously trained language model according to specific needs of the target task. During this stage, the general-domain language model adapts itself to the idiosyncrasies of the domain-specific text.
3. **Step 3: Classifier Fine-tuning** - Finally, we add a classifier to the model and fine-tune it for the target task. This way, the model leverages the transfer learning capabilities and, along with its classification layer and tuned parameters, can classify or predict outcomes.


## Precautions in Using Transfer Learning

While transfer learning is certainly powerful, users should be aware of a few considerations:

- **Data Similarity**: The success of transfer learning highly depends on the similarities between the source and target data. If the data sets have little in common, the performance gain from transfer learning will likely be diminished.

- **Task Complexity**: Transfer learning might not yield significant benefits for simple tasks that can be performed with minimal computational resources and time.

- **Fine-tuning**: Care should be taken when fine-tuning the pre-trained model. Over-fine-tuning may lead to overfitting, where the model performs poorly on new, unseen data.


---

## How to kickstart your NLP journey with Transfer Learning by using self-supervised learning

### The Power of Pre-training
As a primer, when initiating your neural network training, it's highly recommended to start with a pre-trained model. This approach is much more effective than launching the process with random weights, as starting from that point would entail training a model that fundamentally knows nothing! With pre-training, you greatly diminish the amount of data required for the training phase by almost 1000 times compared to starting from scratch.

### Pre-trained Models and Domain-Specific Challenges
However, you might wonder what can be done in scenarios where no pre-trained models exist within your specific domain. For instance, medical imaging is one such field with a scarce availability of pre-trained models. A recent insightful paper, [Transfusion: Understanding Transfer Learning for Medical Imaging](https://arxiv.org/abs/1902.07208), tackled this question. It found that even incorporating a few early layers from a pre-trained ImageNet model could potentially increase the speed of training and boost the final accuracy of medical imaging models. Therefore, regardless of the domain-specificity of your problem, employing a general-purpose pre-trained model would still benefit you.

Nonetheless, as indicated in the paper, the improvement achieved through applying an ImageNet pre-trained model on medical imaging is modest. We ideally need a method that works more effectively without requiring vast amounts of data. Enter "self-supervised learning".

### Unveiling Self-Supervised Learning

Self-supervised learning becomes our secret weapon under these circumstances. Definition-wise, self-supervised learning involves training a model using labels inherently contained in the input data, which eliminates the need for separate external labels.

ULMFiT leverages self-supervised learning to dramatically enhance the latest in this crucial field.

### How to use Self-Supervised Learning for NLP
In ULMFiT, we start by pre-training a "language model". Essentially, this is a model that learns to predict the subsequent word in a sentence. Now, our primary focus is not necessarily on the language model itself. However, it has been observed that a model capable of completing this task invariably gains insights about the nature of language and somewhat about the world during its training process. This pre-trained language model can then be fine-tuned for another task (like sentiment analysis, for instance). What's even more amazing is that this method quickly yields latest results with very little data.

>
> Language modeling is the "de facto" way to apply self-supervised learning to NLP. However, it's worth noting that there are several ways to capitalize on self-supervised learning for language modeling tasks, which we'll explore in the next section.

---

# Language Modeling

Language modeling is a fundamental task in natural language processing that involves predicting the next word or character in a sequence of text. The goal of language modeling is to capture the fundamental structure and patterns of language, which can be used to generate new text that is similar to the original corpus. Language modeling is a key component in many NLP applications such as speech recognition, machine translation, part-of-speech tagging, parsing, optical character recognition, handwriting recognition, and information retrieval.

To model language, we need to frame it as a self-supervised learning problem. Self-supervised learning is a type of unsupervised learning that uses a pretext task to train a model on a large amount of unlabeled data. The pretext task is designed to solve a simpler problem than the target task, which is the actual task we want to solve. For example, in the case of language modeling, the pretext task is to predict the next word in a sequence of text, and the target task is to classify a document into a category. The pretext task is easier to solve than the target task because it requires less information about the data. However, the pretext task is still useful because it provides a way to learn the basic structure of the data, which can be used to solve the target task.

## Next Word Prediction

Next word prediction is a specific type of language modeling that involves predicting the next word in a sequence of text. This task is typically performed by training a model on a large corpus of text and then using that model to predict the probability distribution of the next word given the previous words in the sequence. Next word prediction is a key component in many NLP applications such as autocomplete, text completion, and text generation.

<p align="center">
<img src="images/nwp.png" alt="" style="width: 50%; height: 50%"/>
</p>

---


## Masked Language Modeling

Masked language modeling is a task in natural language processing that involves predicting a missing word in a sentence. In this task, a random word in a sentence is replaced with a mask token, and the model is trained to predict the original word. This task is useful in applications such as language translation and text generation, where it is important to understand the context of a sentence and predict missing words. Masked language modeling is typically performed using a pre-trained language model such as BERT or GPT-2.


<p align="center">
<img src="images/MLM.png" alt="" style="width: 50%; height: 50%"/>
</p>

---

## Improving MLM with Whole Word Masking

Whole Word Masking can be considered as an advanced stage of Masked Language Modeling. This refinement is a significant stride away from traditional approaches where individual tokens or random subwords are hidden by a mask token. Instead, Whole Word Masking takes into account and isolates complete words, considering the entire semantic unit in its approach.

### Understanding Whole Word Masking

In Whole Word Masking, a random word within a given sentence is masked, that is concealed. The goal of the model then becomes, to correctly predict this original, concealed word. This methodology compels the model to deeply comprehend the contextual connotations of the sentence while attempting to predict the masked word.

Consider this illustration: We have the phrase, `"Eu amo viajar para Cacimbinha"``. with respect to this sentence, "Cacimbinha" is a proper noun, and maintaining the integrity of the full word is crucial for understanding the sentence's context. If we were to employ a traditional masking technique, "Cacimbinha" would become fragmented into several tokens. This shattering results in lost context and meaning, as seen below:

`['eu', 'am', '##o', 'via', '##jar', 'para', 'ca', '##ci', '##mb', '##in', '[MASK]']`

Here, "Cacimbinha" is broken down into numerous tokens and the model finds it too simple to predict the masked token. However, if Whole Word Masking is activated, "Cacimbinha" can be masked as an entire token, obligating the model to decipher the sentence's context to predict the concealed word.

`['eu', 'am', '##o', 'via', '##jar', 'para', '[MASK]', '[MASK]', '[MASK]', '[MASK]', '[MASK]']`

In this scenario, the model has to grapple with the overall context – "Cacimbinha" being a place the speaker loves travelling to. This method bolsters the understanding of the complex relationship between words, subsequently improving comprehension of sentence structure and semantics.

### Advantages of Whole Word Masking

Whole Word Masking's importance is particularly palpable when utilized for tasks that entail language translation and text generation; tasks where preserving context and maintaining meaningful sentence structure are of utmost importance.

By masking complete words rather than fractions of words or random tokens, the NLP model pays greater attention to elaborate word relationships as well as broader syntactic and semantic structures within the sentences. As a result, it's able to provide more precise predictions about the masked words, which ultimately results in superior translations or generated texts.

Further refining its efficacy, Whole Word Masking improves the model's ability to handle unseen or out-of-vocabulary words. Additionally, it deals better with complicated phrases, idioms, and multi-word expressions that might otherwise be troublesome to process.

### Collaboration of Whole Word Masking and Tokenization Strategies

Pairing effective tokenization strategies with advanced mechanisms like Whole Word Masking provides the necessary tools to build stable and efficient NLP models. These models can get deeper linguistic nuances and offer more precise translations and text generation.

Whole Word Masking combined with thoughtful tokenization forms the backbone of fostering successful NLP models. They collectively pave the path for greater accuracy in comprehending and translating human languages, making machine interaction increasingly natural and effective.

# Understanding Transfer-Learning through Practice

We will check practical application of transfer learning using the renowned [HuggingFace Transformers](https://huggingface.co/transformers/) library. This is an advanced tool for Natural Language Processing (NLP) that comes equipped with pre-trained models, fine-tuning mechanisms and supports numerous NLP tasks.

Our task at hand is to predict whether product reviews in Portuguese are positive or negative by using a simple sentiment analysis dataset. The goal here is to demonstrate how transfer learning can be leveraged to create a sentiment analysis model with minimal data available.



## Setting the Scene

Let's suppose you work as a Machine Learning engineer at `Americanas`. The company is facing a financial challenge due to a 40 billion dollars debt which restricts them from labeling a large dataset to train a sentiment analysis model. However, they do possess a modest dataset of 600 product reviews. Their objective is to utilize this small dataset to train a model that can accurately predict whether a given review is positive or negative.

Alongside the labeled examples, they also have a massive unlabeled dataset consisting roughly of 100,000 reviews. As unlabeled data is cost-free, it can be used extensively without concerns about budget constraints.

---

## Our Approach

We'll adopt the following sequence of steps to address this scenario:

### Option 1 - Develop a Classifier from Scratch

To set the baseline, we'll start by creating a transformer classifier from scratch, without relying on any pre-trained models. This method, prevalent before 2018, will help us understand the significance of transfer learning in text classification.

By building a classifier from the ground up, we can gain valuable insights into the challenges and limitations of training models with limited labeled data. This approach will require significant computational resources and time, as the model will need to learn all the necessary features and patterns from our small dataset of 600 examples.

### Option 2 - Use a Pretrained Language Model

We'll employ a pretrained transformer model for Portuguese known as **BERTimbau** and train a classifier with our limited labeled dataset (600 examples). This will serve as our initial transfer learning model. Despite the small amount of labeled data, this model will likely outperform the baseline due to the massive volume of Portuguese text that the language model was trained on.

By employing the pretrained BERTimbau model, we can take advantage of the rich linguistic knowledge it has acquired from extensive training on large-scale Portuguese text corpora. This transfer learning approach allows us to build upon the model's existing understanding of the language, enabling it to better capture the nuances and semantics of our specific classification task.

### Option 3 - Use a Pretrained Language Model and Fine-tune it on Domain-Specific Text

For this step, we'll need to:

1. **Fine-tune the Pretrained Language Model:** We'll fine-tune the BERTimbau pretrained language model on our 100,000 unlabeled reviews, creating our second transfer learning model. Although this adaptation doesn't directly aid classification, it enables the model to comprehend the peculiarities of our domain (product reviews) and transfer this understanding to the classifier.

By fine-tuning BERTimbau on our domain-specific text, we can further adapt the model to the unique characteristics and vocabulary of product reviews. This process allows the model to capture the subtle nuances, sentiment expressions, and language patterns commonly found in customer feedback, ultimately enhancing its ability to accurately classify reviews.

2. **Train a Classifier with Fine-tuned Pretrained Language Model:** Lastly, using the domain fine-tuned language model, we'll train another classifier on our concise labeled dataset of 600 examples. This step will fully unleash the potential of transfer learning, as it will allow our model to recognize domain-specific features (product reviews) which aren't ordinarily present in a general language model.

By training the classifier on top of the fine-tuned language model, we can capitalize on the model's deep understanding of both the Portuguese language and the product review domain. This combination of general linguistic knowledge and domain-specific insights will enable the classifier to make more accurate predictions, even with a limited amount of labeled data.

> We'll then compare the performance of these three approaches using a holdout dataset containing 41,354 reviews. This will simulate what would happen if we were to deploy our model in a real-world scenario.

By evaluating the models on a substantial holdout dataset, we can assess their generalization capabilities and robustness in handling unseen data. This comparison will provide valuable insights into the effectiveness of each approach and help us determine the most suitable model for real-world deployment.

Through this systematic exploration of different transfer learning strategies, you'll notice the power of  pretrained language models and domain-specific fine-tuning in text classification tasks, particularly when labeled data is scarce. By comparing the performance of these approaches, we can make informed decisions on the best practices for developing accurate and reliable sentiment analysis models concerning product reviews.

### Summary of Steps

- **Option 1:** Develop a classifier from scratch.
- **Option 2:** Use a pretrained language model (BERTimbau) and train a classifier on top of it.
- **Option 3:** Fine-tune the pretrained language model on domain-specific text, then train a classifier on the labeled dataset.

## Preparing ourselves for the task

In [None]:
import pandas as pd

df = pd.read_parquet("data/dataset_reviews.parquet")
df

Unnamed: 0,source,review_id,text,label,split
20938,b2w,47dd8d461db193a7050331933268cc536925fa3fb3fc21...,nao gostei _##_ quando faz duas xicaras transb...,0,train
112876,b2w,a922bcb60952b82664bada9c66c461603d43fce6e9a4fc...,otimo aparelho _##_ nao e barulhenta... simple...,1,train
30813,olist,fa810fdb06c8ca30b4b596a6f5a13aa1,_##_ meu produto n chegou entao n vou falar nada,0,train
108072,olist,a898722ed49b88ab4b45b35776ad1180,nota 10!!! _##_ otimo site! entrega no prazo! ...,1,train
22060,b2w,a121941db8eedaa20b5e0e16e38ed9025d48e56f890018...,"muito bom. _##_ excelente produto, com boa per...",1,train
...,...,...,...,...,...
16421,b2w,c7c4904a68cd57abb0ff33bed6e5a6e0faa8c137f15d8c...,produto otimo _##_ deixa seus cabelos lisos e ...,1,valid
71452,b2w,c435fb3bd2c9eaf371dcb8d29fdd6fcede54f45427054e...,produto riscado e com avarias _##_ o guarda ro...,0,valid
135326,b2w,9ab66ba65f4036d4bdbcaf7fc2cc883ec9a979f44b8dd8...,otimo produto _##_ celular de otima qualidade....,1,valid
995,olist,b12d5e7fb052eca4b86fd61f42a73ac7,_##_ nao recebi o produto,0,valid


In [None]:
# Select 600 random samples from the training data where 'split' column is 'train'. The idea here is to simulate having only 600 labeled samples and lots of unlabeled samples.
df_train_labeled = df.query('split == "train"').sample(600, random_state=271828)

# Select the remaining training data and drop the 'label' column
df_train_unlabeled = (
    df.query('split == "train"').drop(df_train_labeled.index).drop(columns=["label"])
)

# Select all data where 'split' column is not 'train' (validation data)
df_valid = df.query('split != "train"')

# Display the shapes of the labeled training data, unlabeled training data, and validation data
df_train_labeled.shape, df_train_unlabeled.shape, df_valid.shape

((600, 5), (95891, 4), (41354, 5))

In [3]:
import datasets

# Convert the labeled training DataFrame to a Hugging Face Dataset
dataset_train_labeled = datasets.Dataset.from_pandas(df_train_labeled)

# Convert the unlabeled training DataFrame to a Hugging Face Dataset
dataset_train_unlabeled = datasets.Dataset.from_pandas(df_train_unlabeled)

# Convert the validation DataFrame to a Hugging Face Dataset
dataset_valid = datasets.Dataset.from_pandas(df_valid)

In [4]:
dataset_train_labeled

Dataset({
    features: ['source', 'review_id', 'text', 'label', 'split', '__index_level_0__'],
    num_rows: 600
})

In [None]:
from transformers import AutoTokenizer

# Load the pre-trained BERT tokenizer for Portuguese with specific settings
tokenizer = AutoTokenizer.from_pretrained(
    "neuralmind/bert-base-portuguese-cased",
    use_fast=True,
    truncation=True,
    padding=True,
    max_length=512,
)


# Define a preprocessing function to tokenize the text data
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)


# Apply the preprocessing function to the labeled training dataset
dataset_train_tokenized_classification = dataset_train_labeled.map(
    preprocess_function, batched=True
)

# Apply the preprocessing function to the validation dataset
dataset_valid_tokenized_classification = dataset_valid.map(
    preprocess_function, batched=True
)

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

Map:   0%|          | 0/41354 [00:00<?, ? examples/s]

In [6]:
from transformers import DataCollatorWithPadding

# Create a data collator that dynamically pads the inputs to the maximum length in the batch
# This ensures that all inputs in a batch have the same length, which is required for efficient processing
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")


# Define a function to compute evaluation metrics
def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the evaluation tuple
    predictions, labels = eval_pred

    # Convert the prediction probabilities to predicted class indices
    predictions = np.argmax(predictions, axis=1)

    # Compute and return the accuracy metric using the predicted and true labels
    return accuracy.compute(predictions=predictions, references=labels)

## Option 1 - Develop a Classifier from Scratch

For simplicity, we'll use the same tokenizer from neuralmind/bert-base-portuguese-cased for all three approaches. This will ensure that the tokenization process is consistent across all models.

In [None]:
from pathlib import Path

# Define the path where the language model will be saved
path_to_save_lm = Path("./outputs/nlp_deep_learning/bert_masked_lm")

# Create the directory (and any necessary parent directories) if it doesn't already exist
path_to_save_lm.mkdir(parents=True, exist_ok=True)

In [None]:
# initialize a non pretrained bert model

# Import necessary libraries from the transformers package
from transformers import (
    BertTokenizer,
    BertModel,
    AutoModel,
    AutoConfig,
    AutoTokenizer,
    BertForSequenceClassification,
    AutoModelForSequenceClassification,
    BertConfig,
)
import torch

# Define the configuration for the BERT model
config = BertConfig(
    attention_probs_dropout_prob=0.1,  # Dropout probability for the attention probabilities
    directionality="bidi",  # Model directionality
    hidden_act="gelu",  # Activation function to use in the hidden layers
    hidden_dropout_prob=0.1,  # Dropout probability for the hidden layers
    hidden_size=768,  # Size of the hidden layers
    initializer_range=0.02,  # Range of the weight initializer
    intermediate_size=3072,  # Size of the "intermediate" (i.e., feed-forward) layer
    layer_norm_eps=1e-12,  # Epsilon for the layer normalization layers
    max_position_embeddings=512,  # Maximum number of position embeddings to use
    model_type="bert",  # Type of the model
    num_attention_heads=12,  # Number of attention heads
    num_hidden_layers=12,  # Number of hidden layers
    output_past=True,  # Whether or not to output the past hidden states
    pad_token_id=0,  # The ID of the padding token
    pooler_fc_size=768,  # Size of the pooling fully connected layer
    pooler_num_attention_heads=12,  # Number of attention heads for the pooling layer
    pooler_num_fc_layers=3,  # Number of fully connected layers in the pooling layer
    pooler_size_per_head=128,  # Size per head in the pooling layer
    pooler_type="first_token_transform",  # Type of pooling to use
    position_embedding_type="absolute",  # Type of position embedding to use
    type_vocab_size=2,  # Size of the type vocabulary
    use_cache=True,  # Whether or not to use caching
    vocab_size=29794,  # Size of the vocabulary
    num_labels=2,  # Number of labels for the classification task
)

# Initialize a BERT model for sequence classification with the defined configuration
# This is a randomly initialized model with a linear layer on top of the BERT model for classification
model_random = BertForSequenceClassification(config)

# Initialize all weights of the BERT model with a random normal distribution
model_random.bert.init_weights()

In [None]:
# Extract the model name from the model checkpoint path
# This will be used to name the output directory for the trained model
model_name = "bert-base-portuguese-cased"

In [None]:
from transformers import Trainer, TrainingArguments

# Define the training arguments
training_args_random = TrainingArguments(
    output_dir=path_to_save_lm
    / f"{model_name}-random_model",  # Output directory for the trained model
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=32,  # Batch size for training. May need to be lowered for free GPUs
    per_device_eval_batch_size=256,  # Batch size for evaluation. May need to be lowered for free GPUs
    num_train_epochs=5,  # Number of training epochs
    weight_decay=0.01,  # Weight decay
    bf16=True,  # Use bf16 precision. May need to be changed to fp16 for free GPUs
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 1 step
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Random seed for reproducibility
)

# Initialize the Trainer
trainer_random = Trainer(
    model=model_random,  # The model to train
    args=training_args_random,  # The training arguments
    train_dataset=dataset_train_tokenized_classification,  # The training dataset
    eval_dataset=dataset_valid_tokenized_classification,  # The evaluation dataset
    processing_class=tokenizer,  # The tokenizer
    data_collator=data_collator,  # The data collator
    compute_metrics=compute_metrics,  # The function to compute the metrics
)

# Train the model
trainer_random.train()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8069,0.683252,0.623495
2,0.6569,0.619714,0.736567
3,0.5762,0.537847,0.745007
4,0.523,0.522069,0.764448
5,0.4918,0.487131,0.77371


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=50, training_loss=0.6109645938873292, metrics={'train_runtime': 806.0846, 'train_samples_per_second': 3.722, 'train_steps_per_second': 0.062, 'total_flos': 369999921600000.0, 'train_loss': 0.6109645938873292, 'epoch': 5.0})

In [12]:
results_random = trainer_random.evaluate()
results_random

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


{'eval_loss': 0.48713117837905884,
 'eval_accuracy': 0.7737099192339314,
 'eval_runtime': 136.2978,
 'eval_samples_per_second': 303.409,
 'eval_steps_per_second': 0.594,
 'epoch': 5.0}

## Option 2 - Use a Pretrained Language Model

In [13]:
import gc

# Set the random model to None to free up memory
model_random = None

# Set the trainer for the random model to None to free up memory
trainer_random = None

# Collect garbage to free up memory
gc.collect()

# Empty the CUDA cache to free up GPU memory
torch.cuda.empty_cache()

In [None]:
# Import necessary classes from the transformers library
from transformers import (
    BertTokenizer,
    BertModel,
    AutoModel,
    AutoConfig,
    AutoTokenizer,
    BertForSequenceClassification,
    AutoModelForSequenceClassification,
)
import torch

# Load the configuration for the pre-trained BERT model
# This configuration includes model settings such as the number of hidden layers, attention heads, etc.
config = AutoConfig.from_pretrained("neuralmind/bert-base-portuguese-cased")

# Set the number of labels for the classification task (binary classification in this case)
config.num_labels = 2

# Initialize a pre-trained BERT model for sequence classification
# This model includes a linear layer on top of the BERT model for classification tasks
model_pretrained = BertForSequenceClassification.from_pretrained(
    "neuralmind/bert-base-portuguese-cased", config=config
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import Trainer, TrainingArguments

# Define the training arguments for the pretrained model
training_args_pretrained = TrainingArguments(
    output_dir=path_to_save_lm
    / f"{model_name}-pretrained_model",  # Output directory for the trained model
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=32,  # Batch size for training. May need to be lowered for free GPUs
    per_device_eval_batch_size=256,  # Batch size for evaluation. May need to be lowered for free GPUs
    num_train_epochs=5,  # Number of training epochs
    weight_decay=0.01,  # Weight decay for regularization
    bf16=True,  # Use bf16 precision. May need to be changed to fp16 for free GPUs
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 1 step
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Random seed for reproducibility
)

# Initialize the Trainer for the pretrained model
trainer_pretrained = Trainer(
    model=model_pretrained,  # The pretrained model to train
    args=training_args_pretrained,  # The training arguments
    train_dataset=dataset_train_tokenized_classification,  # The training dataset
    eval_dataset=dataset_valid_tokenized_classification,  # The evaluation dataset
    processing_class=tokenizer,  # The tokenizer
    data_collator=data_collator,  # The data collator for dynamic padding
    compute_metrics=compute_metrics,  # The function to compute the evaluation metrics
)

# Train the pretrained model
trainer_pretrained.train()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5428,0.415471,0.880278
2,0.3521,0.261947,0.921604
3,0.2279,0.22344,0.92748
4,0.1745,0.192697,0.933404
5,0.1415,0.1874,0.934565


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=50, training_loss=0.28772231817245486, metrics={'train_runtime': 785.4887, 'train_samples_per_second': 3.819, 'train_steps_per_second': 0.064, 'total_flos': 369999921600000.0, 'train_loss': 0.28772231817245486, 'epoch': 5.0})

In [16]:
# Evaluate the pretrained model on the validation dataset
# This will return a dictionary containing evaluation metrics such as loss and accuracy
results_pretrained = trainer_pretrained.evaluate()

# Display the evaluation results
results_pretrained

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


{'eval_loss': 0.18739964067935944,
 'eval_accuracy': 0.9345649755767278,
 'eval_runtime': 126.9486,
 'eval_samples_per_second': 325.754,
 'eval_steps_per_second': 0.638,
 'epoch': 5.0}

## Option 3 - Use a Pretrained Language Model and Fine-tune it on Domain-Specific Text

### 3.1 Fine-tune the Pretrained Language Model

In [17]:
import gc
import torch

# Set the pretrained model to None to free up memory
model_pretrained = None

# Set the trainer for the pretrained model to None to free up memory
trainer_pretrained = None

# Collect garbage to free up memory
gc.collect()

# Empty the CUDA cache to free up GPU memory
torch.cuda.empty_cache()

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset from a Parquet file, selecting only the 'text' column. This data is unlabeled. Our task will be self-supervised learning (MLM)
df_unlabeled = pd.read_parquet("data/dataset_reviews.parquet", columns=["text"])

# Split the dataset into training and validation sets
# Use 10% of the data for validation and the remaining 90% for training
# Set a random seed for reproducibility
df_unlabeled_train, df_unlabeled_valid = train_test_split(
    df_unlabeled, test_size=0.10, random_state=271828
)

In [None]:
import datasets

# Convert the unlabeled training DataFrame to a Hugging Face Dataset
dataset_unlabeled_train = datasets.Dataset.from_pandas(df_unlabeled_train)

# Convert the unlabeled validation DataFrame to a Hugging Face Dataset
dataset_unlabeled_valid = datasets.Dataset.from_pandas(df_unlabeled_valid)

In [20]:
dataset_unlabeled_train

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 124060
})

In [21]:
dataset_unlabeled_valid

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 13785
})

In [None]:
from pathlib import Path

# Define the path where the language model will be saved
path_to_save_lm = Path("./outputs/nlp_deep_learning/bert_masked_lm")

# Create the directory (and any necessary parent directories) if it doesn't already exist
path_to_save_lm.mkdir(parents=True, exist_ok=True)

In [None]:
from transformers import AutoTokenizer

# Load the pre-trained BERT tokenizer for Portuguese without truncation
# This tokenizer will not truncate the input text, which means it will keep the full length of the text
# The 'use_fast=True' parameter enables the fast version of the tokenizer for better performance
tokenizer_no_truncation = AutoTokenizer.from_pretrained(
    "neuralmind/bert-base-portuguese-cased", use_fast=True, truncation=False
)



In [None]:
def tokenize_function_no_truncation(examples):
    """
    Tokenizes the input text in the given examples using the tokenizer object.

    Args:
    - examples: A dictionary containing the input text to be tokenized.

    Returns:
    - A dictionary containing the tokenized input text.
    """
    # Tokenize the input text without truncation
    result = tokenizer_no_truncation(examples["text"])

    # If using the fast tokenizer, also include word IDs for each token
    if tokenizer_no_truncation.is_fast:
        result["word_ids"] = [
            result.word_ids(i) for i in range(len(result["input_ids"]))
        ]

    return result


# Tokenize the unlabeled training dataset
# This step converts the text data into numerical representations (tokens) that the model can process
# The 'batched=True' parameter processes the data in batches for efficiency
# The 'remove_columns' parameter removes the original text and index columns from the dataset
dataset_train_tokenized_mlm = dataset_unlabeled_train.map(
    tokenize_function_no_truncation,
    batched=True,
    remove_columns=["text", "__index_level_0__"],
)

# Tokenize the unlabeled validation dataset
# Similar to the training dataset, this step converts the text data into tokens
dataset_valid_tokenized_mlm = dataset_unlabeled_valid.map(
    tokenize_function_no_truncation,
    batched=True,
    remove_columns=["text", "__index_level_0__"],
)

Map:   0%|          | 0/124060 [00:00<?, ? examples/s]

Map:   0%|          | 0/13785 [00:00<?, ? examples/s]

In [25]:
dataset_train_tokenized_mlm

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
    num_rows: 124060
})

In [None]:
def group_texts(examples):
    """
    This function groups together a set of texts as contiguous text of fixed length (chunk_size).
    It's useful for training masked language models.

    Args:
    - examples: A dictionary containing the examples to group. Each key corresponds to a feature,
      and each value is a list of lists of tokens.

    Returns:
    - A dictionary containing the grouped examples. Each key corresponds to a feature,
      and each value is a list of lists of tokens.
    """
    # Concatenate all texts into a single list for each feature
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    # Compute the total length of the concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # Adjust the total length to be a multiple of chunk_size, dropping the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size

    # Split the concatenated texts into chunks of size chunk_size
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }

    # Create a new 'labels' column that is a copy of the 'input_ids' column
    result["labels"] = result["input_ids"].copy()

    return result


# Define the chunk size for grouping texts
chunk_size = 512

# Apply the group_texts function to the tokenized training dataset
# This step groups the tokenized texts into chunks of size chunk_size
dataset_train_tokenized_mlm = dataset_train_tokenized_mlm.map(
    group_texts,
    batched=True,
)

# Apply the group_texts function to the tokenized validation dataset
# This step groups the tokenized texts into chunks of size chunk_size
dataset_valid_tokenized_mlm = dataset_valid_tokenized_mlm.map(
    group_texts,
    batched=True,
)

Map:   0%|          | 0/124060 [00:00<?, ? examples/s]

Map:   0%|          | 0/13785 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForLanguageModeling

# Create a data collator for masked language modeling (MLM)
# This collator will dynamically mask tokens in the input text with a probability of 15%
# The masked tokens will be replaced with a special [MASK] token, which the model will try to predict during training
data_collator_mlm = DataCollatorForLanguageModeling(
    tokenizer=tokenizer_no_truncation,  # The tokenizer used to process the input text
    mlm_probability=0.15,  # The probability of masking a token in the input text
)

In [28]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Define the model checkpoint for the pre-trained BERT model
# This checkpoint corresponds to a BERT model pre-trained on Portuguese text
model_checkpoint = "neuralmind/bert-base-portuguese-cased"

# Load the pre-trained BERT model for masked language modeling (MLM)
# This model will be used to predict masked tokens in the input text
model_mlm = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
from transformers import TrainingArguments

# Define the batch size for training and evaluation
batch_size = 32

# Extract the model name from the model checkpoint path
# This will be used to name the output directory for the trained model
model_name = model_checkpoint.split("/")[-1]

# Define the training arguments for the masked language model (MLM)
training_args_mlm = TrainingArguments(
    output_dir=path_to_save_lm
    / f"{model_name}-finetuned-mlm",  # Output directory for the trained model
    overwrite_output_dir=True,  # Overwrite the output directory if it already exists
    learning_rate=5e-5,  # Learning rate for the optimizer
    weight_decay=0.01,  # Weight decay for regularization
    per_device_train_batch_size=batch_size,  # Batch size for training. May need to be lowered for free GPUs
    per_device_eval_batch_size=batch_size,  # Batch size for evaluation. May need to be lowered for free GPUs
    bf16=True,  # Use bf16 precision. May need to be changed to fp16 for free GPUs
    num_train_epochs=20,  # Number of training epochs
    save_total_limit=1,  # Limit the total amount of checkpoints and delete the older ones
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 1 step
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Use the evaluation loss to determine the best model
    greater_is_better=False,  # Lower evaluation loss is better
    gradient_accumulation_steps=2,  # Number of steps to accumulate gradients before updating the model parameters
    seed=271828,  # Random seed for reproducibility
)

In [None]:
from transformers import Trainer

# Initialize the Trainer for the masked language model (MLM)
trainer_mlm = Trainer(
    model=model_mlm,  # The pre-trained BERT model for masked language modeling
    args=training_args_mlm,  # The training arguments defined earlier
    train_dataset=dataset_train_tokenized_mlm,  # The tokenized training dataset
    eval_dataset=dataset_valid_tokenized_mlm,  # The tokenized validation dataset
    data_collator=data_collator_mlm,  # The data collator for dynamic masking during training
    processing_class=tokenizer_no_truncation,  # The tokenizer used to process the input text
)

In [31]:
trainer_mlm.train()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss
0,1.3482,1.253211
2,1.2342,1.16085
4,1.1685,1.113159
6,1.1298,1.088677
8,1.1013,1.074122
10,1.0811,1.038989
12,1.0657,1.043078
14,1.0685,1.040189
16,1.05,1.016487
18,1.0531,1.039519


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(devic

TrainOutput(global_step=1440, training_loss=1.1465044491820866, metrics={'train_runtime': 3951.4596, 'train_samples_per_second': 46.697, 'train_steps_per_second': 0.364, 'total_flos': 4.82434530357289e+16, 'train_loss': 1.1465044491820866, 'epoch': 19.862068965517242})

In [32]:
# Save the trained model
trainer_mlm.save_model(path_to_save_lm / f"{model_name}-finetuned-mlm")
tokenizer_no_truncation.save_pretrained(path_to_save_lm / f"{model_name}-finetuned-mlm")

('outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm/tokenizer_config.json',
 'outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm/special_tokens_map.json',
 'outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm/vocab.txt',
 'outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm/added_tokens.json',
 'outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm/tokenizer.json')

In [33]:
print(path_to_save_lm / f"{model_name}-finetuned-mlm")

outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm


### 3.2 Train a Classifier with Fine-tuned Pretrained Language Model

In [34]:
# Set the pretrained model to None to free up memory
model_pretrained = None

# Set the trainer for the pretrained model to None to free up memory
trainer_pretrained = None

# Set the masked language model (MLM) to None to free up memory
model_mlm = None

# Set the trainer for the masked language model (MLM) to None to free up memory
trainer_mlm = None

# Set the tokenizer to None to free up memory
tokenizer = None

# Collect garbage to free up memory
gc.collect()

# Empty the CUDA cache to free up GPU memory
torch.cuda.empty_cache()

In [None]:
# Import necessary modules from the transformers library
from transformers import AutoConfig, AutoTokenizer, BertForSequenceClassification
import torch

# Load the configuration for the pre-trained BERT model
# This configuration is loaded from the directory where the fine-tuned masked language model (MLM) is saved
config = AutoConfig.from_pretrained(
    "./outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm"
)

# Set the number of labels for the classification task
# In this case, we are setting it to 2 for binary classification
config.num_labels = 2

# Initialize a BERT model for sequence classification using the fine-tuned MLM model
# This model will have a linear layer on top of the BERT model for classification
model_ft = BertForSequenceClassification.from_pretrained(
    "./outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm",
    config=config,
)

# Load the tokenizer for the fine-tuned MLM model
# The tokenizer will be used to preprocess the input text for the BERT model
tokenizer = AutoTokenizer.from_pretrained(
    "./outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm",
    use_fast=True,  # Use the fast version of the tokenizer for better performance
    truncation=True,  # Enable truncation to ensure the input text does not exceed the maximum length
    padding=True,  # Enable padding to ensure the input text is of uniform length
    max_length=512,  # Set the maximum length for the input text
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./outputs/nlp_deep_learning/bert_masked_lm/bert-base-portuguese-cased-finetuned-mlm and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import Trainer, TrainingArguments

# Define the training arguments for the fine-tuned model
training_args_ft = TrainingArguments(
    output_dir=path_to_save_lm
    / f"{model_name}-ft_model",  # Output directory for the trained model
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=32,  # Batch size for training. May need to be lowered for free GPUs
    per_device_eval_batch_size=256,  # Batch size for evaluation. May need to be lowered for free GPUs
    num_train_epochs=5,  # Number of training epochs
    weight_decay=0.01,  # Weight decay for regularization
    bf16=True,  # Use bf16 precision. May need to be changed to fp16 for free GPUs
    eval_strategy="epoch",  # Evaluate the model after each epoch
    logging_strategy="steps",  # Log the training progress after each step
    save_strategy="epoch",  # Save the model after each epoch
    eval_steps=1,  # Evaluate the model after every 1 epoch
    save_steps=1,  # Save the model after every 1 epoch
    logging_steps=10,  # Log the training progress after every 1 step
    load_best_model_at_end=True,  # Load the best model at the end of training
    seed=271828,  # Random seed for reproducibility
)

# Initialize the Trainer for the fine-tuned model
trainer_ft = Trainer(
    model=model_ft,  # The fine-tuned model to train
    args=training_args_ft,  # The training arguments defined above
    train_dataset=dataset_train_tokenized_classification,  # The tokenized training dataset
    eval_dataset=dataset_valid_tokenized_classification,  # The tokenized evaluation dataset
    processing_class=tokenizer,  # The tokenizer used to preprocess the input text
    data_collator=data_collator,  # The data collator for dynamic padding and batching
    compute_metrics=compute_metrics,  # The function to compute the evaluation metrics
)

# Train the fine-tuned model
trainer_ft.train()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=50, training_loss=0.24602191686630248, metrics={'train_runtime': 730.0486, 'train_samples_per_second': 4.109, 'train_steps_per_second': 0.068, 'total_flos': 369999921600000.0, 'train_loss': 0.24602191686630248, 'epoch': 5.0})

In [37]:
# Evaluate the fine-tuned model on the evaluation dataset
# This will return a dictionary containing the evaluation metrics
results_ft = trainer_ft.evaluate()

# Display the evaluation results
# The results will include metrics such as loss, accuracy, etc.
results_ft

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


{'eval_loss': 0.15792906284332275,
 'eval_accuracy': 0.9431977559607293,
 'eval_runtime': 122.4647,
 'eval_samples_per_second': 337.681,
 'eval_steps_per_second': 0.661,
 'epoch': 5.0}

In [None]:
# Create a DataFrame to store the evaluation results of different models
# The DataFrame will have three rows, each corresponding to a different model:
# - 'random': A model with random weights (trained from sctach)
# - 'pretrained': A pre-trained model without fine-tuning
# - 'finetuned': A pre-trained model that has been fine-tuned on a specific task
# The columns of the DataFrame will contain the evaluation metrics for each model

df_results = pd.DataFrame(
    [
        results_random,
        results_pretrained,
        results_ft,
    ],  # List of dictionaries containing the evaluation results
    index=["random", "pretrained", "finetuned"],  # Index labels for the rows
)

# Display the DataFrame with the evaluation results
df_results

Unnamed: 0,eval_loss,eval_accuracy,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
random,0.487131,0.77371,136.2978,303.409,0.594,5.0
pretrained,0.1874,0.934565,126.9486,325.754,0.638,5.0
finetuned,0.157929,0.943198,122.4647,337.681,0.661,5.0


## Comparing Models

The finetuned model from step 3 demonstrates the best overall performance, achieving the lowest evaluation loss and highest evaluation accuracy among the three models. This indicates that the finetuning process effectively adapts the pretrained model to the specific task at hand, employing the knowledge gained from pretraining on a large corpus of text.

The pretrained model from step 2 also exhibits strong performance, albeit slightly inferior to the finetuned model. Its ability to perform well can be attributed to the extensive pretraining on a vast amount of text data, which provides a sturdy foundation for understanding language patterns and semantics. However, without the task-specific finetuning, it may not fully capture the nuances and characteristics of the downstream task.

On the other hand, the random model yields the poorest performance across all evaluation metrics. This is expected, as the random model lacks any prior knowledge or training on language understanding tasks. It serves as a baseline to highlight the significance of pretraining and finetuning in improving model performance.

These results underscore the power and effectiveness of transfer learning in natural language processing (NLP) tasks, particularly when labeled data is limited. Transfer learning allows models to utilize the knowledge acquired from pretraining on large, diverse datasets and apply it to specific downstream tasks. By finetuning the pretrained model on domain-specific text, the model can further adapt and specialize its understanding to the target task.

It is important to note that the cost of labeling data (600 instances) was the same across all models. However, the performance outcomes varied significantly. This emphasizes the value of transfer learning in improving model performance while minimizing the need for extensive labeled data.

# Questions

1. What is transfer learning and how does it benefit NLP tasks?

2. Explain the role of tokenization in NLP and how it helps models understand the semantic meaning of text.

3. What is language modeling and how does it enable models to learn from unlabeled data?

4. Describe the ULMFiT approach and its three main steps for transfer learning in NLP.

5. How does Whole Word Masking differ from traditional masking techniques and what advantages does it offer?

6. Compare and contrast training a classifier from scratch, using a pre-trained language model, and fine-tuning a pre-trained language model on domain-specific text. Which approach would likely yield the best performance and why?

7. In the Americanas scenario, why is it beneficial to fine-tune the pre-trained language model on the unlabeled reviews before training the classifier?

8. What are some key considerations and precautions to keep in mind when applying transfer learning to NLP tasks?

9. How does self-supervised learning, such as language modeling, enable models to learn valuable information about language structure and semantics without requiring labeled data?

10. Based on the comparison of the three model approaches (random, pre-trained, fine-tuned), what conclusions can you draw about the effectiveness of transfer learning and domain-specific fine-tuning for sentiment analysis tasks with limited labeled data?

`Answers are commented inside this cell`


<!-- 1. Transfer learning is a powerful technique in NLP that leverages pre-trained models as a starting point for training models on related tasks. By utilizing the knowledge gained from large-scale pre-training on vast amounts of text data, transfer learning significantly reduces the training time and data requirements for new tasks. This is particularly beneficial in NLP because pre-trained language models capture detailed patterns and relationships in language, which can be effectively transferred to downstream tasks.

2. Tokenization is a fundamental step in NLP that involves breaking down text into meaningful semantic units called tokens. It is crucial because it enables models to understand and process the base units of language. Subword tokenization is a specific technique that splits words into smaller, more manageable units. This approach allows models to handle languages with large vocabularies, complex morphologies, and even unseen words by recognizing the building blocks of words, leading to improved understanding of the semantic meaning of each sentence.

3. Language modeling is a self-supervised learning task that aims to capture the basic structure and patterns of language. It involves training a model to predict the next word or masked words in a sequence of text. By learning to predict the next word, the model gains insights into grammar, semantics, and even world knowledge. Language modeling helps in understanding the elaborate relationships between words, syntax, and sentence structure, which is essential for various NLP tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, and machine translation.

4. ULMFiT (Universal Language Model Fine-tuning) is a transfer learning approach that revolutionized the field of NLP. It involves pre-training a language model on a large corpus of unlabeled text, fine-tuning it on domain-specific text, and then using the fine-tuned model for downstream NLP tasks. ULMFiT improves the process of transfer learning by adapting the pre-trained model to the specific characteristics and nuances of the target domain. This fine-tuning step allows the model to capture domain-specific features and patterns, leading to improved performance on the target task, even with limited labeled data.

5. Whole Word Masking is an advanced masking technique used in language modeling that masks entire words instead of individual subwords or tokens. Unlike traditional token masking, where random subwords are masked, Whole Word Masking ensures that all the subwords corresponding to a word are masked together. This approach forces the model to understand the contextual meaning of the masked word as a whole, rather than relying on individual subwords. Whole Word Masking improves the model's ability to capture the semantic relationships between words and leads to better performance on downstream tasks.

6. Fine-tuning a pre-trained language model on domain-specific text is a crucial step in transfer learning for NLP tasks. By exposing the pre-trained model to text from the target domain, the model learns the specific vocabulary, writing style, and linguistic patterns of that domain. This fine-tuning process allows the model to adapt its knowledge to the characteristics of the target task, capturing domain-specific features that may not be present in the general pre-trained model. As a result, the fine-tuned model exhibits improved performance on the target task compared to using the pre-trained model directly.

7. Transfer learning offers several key benefits when dealing with limited labeled data in NLP tasks. Firstly, it enables the utilization of large-scale pre-trained models that have learned rich representations of language from vast amounts of unlabeled text. These pre-trained models can be fine-tuned on the target task with a relatively small labeled dataset, as they have already captured general language patterns. Secondly, transfer learning reduces the need for extensive labeled data by employing the knowledge transferred from the pre-training phase. This is particularly advantageous in domains where labeled data is scarce or expensive to obtain.

8. Self-supervised learning is a powerful approach in NLP that allows models to learn from unlabeled data by solving pretext tasks. regarding NLP, language modeling is a common self-supervised learning task. The process involves training a model to predict the next word or masked words in a sequence of text. By learning to solve this pretext task, the model captures valuable information about the structure and semantics of language without requiring explicit labels. Self-supervised learning enables models to learn from vast amounts of unlabeled text data, which is abundantly available. This approach reduces the reliance on labeled data and allows models to develop a deep understanding of language that can be transferred to various downstream NLP tasks.

9. In the practical example provided, three different approaches are used to build a sentiment analysis model. The first approach involves developing a classifier from scratch, without using any pre-trained models. This serves as a baseline to understand the importance of transfer learning. The second approach utilizes a pre-trained language model (BERTimbau) and trains a classifier on top of it using the limited labeled dataset. This approach leverages the knowledge captured by the pre-trained model to improve performance. The third approach goes a step further by fine-tuning the pre-trained language model on domain-specific text (unlabeled product reviews) before training the classifier. This fine-tuning step adapts the model to the specific characteristics of the target domain, leading to further improvements in performance.

10. In the given example, the performance of the three approaches is compared using a holdout dataset. The model trained from scratch, without utilizing any pre-trained knowledge, serves as a baseline. The pre-trained model (BERTimbau) outperforms the baseline model, even with limited labeled data, due to the extensive knowledge it has acquired during pre-training on large amounts of text. However, the fine-tuned pre-trained model, which is adapted to the specific domain of product reviews, achieves the best performance among the three approaches. This demonstrates the effectiveness of transfer learning and the importance of fine-tuning on domain-specific text to capture task-specific nuances and improve model performance. -->