<a href="https://colab.research.google.com/github/byrocuy/REA_AI_Bootcamp/blob/main/week-4/session-4/00_transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transfer Learning

![](https://storage.googleapis.com/rg-ai-bootcamp/model_usage/transfer-learning.gif)

## Introduction

Struggling to train complex models due to limited high-quality data and resources? 🤔 Don't panic! The answer is **Transfer Learning**. This technique leverages pre-trained models, like BERT for NLP or [ImageNet](https://www.image-net.org/) for image classification, to significantly cut down training time. Think of it as teaching an old dog new tricks: you can easily adapt, say an ImageNet model, to tasks like dog breed classification. And voila! You've just made quick progress even with scarce data.

## How does Neural Network change in Transfer Learning?

Imagine a chef who is skilled in baking cakes. Now, suppose this chef needs to cook a new dish, like pasta. Instead of starting from scratch, they leverage their existing culinary skills, adjusting only where necessary for the pasta-specific nuances.

Similarly, in machine learning, two possible approaches exist: "Training from Scratch" and "Transfer Learning". In the former, a model like CNN is trained on a new dataset, say Vehicles, without any prior knowledge. In the latter, the model leverages prior knowledge acquired from a different dataset, like Animals, and adjusts this understanding to the new task.

![](https://miro.medium.com/v2/resize:fit:720/0*xNjEPIZmPvKeqss6)

The image above illustrates this concept. As shown, a model trained from scratch (the top one) is set up to learn directly from the Vehicles dataset. It starts with no inherent understanding of images and must learn the features that differentiate one vehicle from another.

In contrast, a model using transfer learning (the bottom one) begins with a pre-trained network that has pre-existing knowledge about different animals. This model is fine-tuned to distinguish different types of vehicles, typically achieving faster and more efficient results than training from scratch.

In essence, while both models aim to classify different types of vehicles, they learn differently: the model trained from scratch learns all features independently, like a chef learning a new dish from scratch, whereas the transfer learning model refines existing knowledge for the new task, similar to a chef adapting their existing skills to a new recipe.

## Fine-Tuning a Model

In the context of transfer learning, fine-tuning a pre-trained model to a specific task is an efficient way to leverage existing knowledge and adapt it to novel use cases. To effectively carry out this process, the key steps are:
- **Load Dataset**: Prepare and load the new data.
- **Preprocess**: Tokenize and format the data to match the model's input.
- **Setting up Evaluation Metric**: Define the 'accuracy' metric to evaluate the model's performance.
- **Training and Evaluation**: Train and optimize the model through epochs, evaluating the model after each epoch.
- **Evaluating the Trained Model**: Test the model's performance on a separate test dataset.
- **Inference**: Use the trained model for predictions.

## Use Cases

Transfer learning has been widely applied for diverse applications. Some of these applications include:

1. **Natural Languange Processing (NLP)**
   - Text Classification
   - Summarization
   - And more
2. **Audio**
   - Audio classification
   - Automatic speech recognition
   - And more
3. **Computer Vision**
   - Image Classification
   - Object detection
   - And more
4. **Multimodal**
   - Image Captioning
   - Document Question Answering
   - And more

By utilizing transfer learning, we can significantly improve the performance of these tasks and many more!

### Text classification

[![](https://storage.googleapis.com/rg-ai-bootcamp/model_usage/text-classification.png)](https://youtu.be/leNG9fN9FQU)

Tasks: Text Classification (source: [youtube.com](https://youtu.be/leNG9fN9FQU))

Transfer learning for text classification comes to the rescue when we aim to discern the sentiment of movie reviews. The state-of-the-art models like [DistilBERT](https://huggingface.co/distilbert-base-uncased), which are already trained on a diverse range of internet text, can provide us a robust feature-extraction foundation. We retrain this model on the [IMDb](https://huggingface.co/datasets/imdb) dataset to classify whether a movie review carries a positive or negative sentiment. This pre-trained model significantly cuts down our training time and needs less data than starting from scratch.

Before you begin, make sure you have all the necessary libraries installed:

In [None]:
%pip install transformers datasets evaluate

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

**Load IMDb dataset**

Start by loading the IMDb dataset from the Datasets library:

In [None]:
from datasets import load_dataset

imdb = load_dataset("imdb")

Then take a look at an example:

In [None]:
imdb["test"][0]

There are two fields in this dataset:

- `text`: the movie review text.
- `label`: a value that is either 0 for a negative review or 1 for a positive review.

In [None]:
import pandas as pd

# Select a part of the dataset (e.g. 'train', 'test', or 'unsupervised', depending on what part you want to see)
imdb_set = imdb['train']
df = pd.DataFrame(imdb_set)
df

**Preprocess**

The next step is to load a `DistilBERT` tokenizer to preprocess the text field:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Create a preprocessing function to tokenize text and truncate sequences to be no longer than `DistilBERT’s` maximum input length:

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

Use the Datasets map function to apply the preprocessing function over the full dataset. Quicken the process by setting `batched=True` for simultaneous processing of multiple dataset elements.

In [None]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Use `DataCollatorWithPadding` to create a batch of examples. It's more efficient to pad sentences to the max length in a batch during collation, rather than padding the entire dataset to the max length.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

**Setting up Evaluation Metric**

We use 'accuracy' as the evaluation metric because it is a simple yet effective metric for classification tasks, and is especially useful when the classes are balanced, as in the case of the IMDb dataset.

In [None]:
%pip install evaluate

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Your compute_metrics function is ready to go now, and you’ll return to it when you setup your training.

**Training and Evaluation**

The process of training involves both optimizing the parameters of the model and evaluating its performance. After each epoch of training, the model is evaluated on the validation data. The evaluation metric used is accuracy which we've defined in the previous section. Based on this evaluation, the training process may decide to continue training the model, to halt and save it, or even to revert to a previous state of the model.

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

You’re ready to start training your model now! Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.31.0/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

After load the model, it's now time to set your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments), specifying `output_dir` and enabling `push_to_hub` to upload your model to Hugging Face.

One critical argument here is `load_best_model_at_end=True`, which ensures that the trainer will load the best model (with respect to the evaluation metric) at the end of training. This way, we'll always end up with the model that performed best on the validation set during the training phase.

To use Hugging Face `Trainer` you need to install the accelerate library version `0.20.1` or later. It is used for performance enhancement on PyTorch.

In [None]:
%pip install accelerate -U

To speed up the training process, you could try **Use a GPU if available**: If you're running your code on Google Colab, you can adjust the settings to utilize a GPU. Click on `Runtime -> Change runtime type -> Hardware accelerator -> Choose GPU`.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="model/text_classification",
    learning_rate=2e-5,                            # learning rate
    per_device_train_batch_size=16,                # training batch size
    per_device_eval_batch_size=16,                 # evaluation batch size
    num_train_epochs=2,                            # number of training epochs
    weight_decay=0.01,                             # weight decay for regularization
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,                   # early stopping
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

**Evaluating the Trained Model**

After training, it's time to evaluate the model on a separate test dataset. This can give us a good understanding of how well the model performs on unseen data.

In [None]:
eval_result = trainer.evaluate(eval_dataset=tokenized_imdb["test"])
print(eval_result)

The evaluation results will give us the accuracy on the test set. If this accuracy is satisfactory, we could then decide to publish the model. If it's not, we might need to revisit the preprocessing, model architecture, or the training process (e.g., tuning hyperparameters, increasing number of epochs).

Once training is completed, share your model to the Hub with the `push_to_hub()` method so everyone can use your model:

In [None]:
trainer.push_to_hub()

The following is an example of a model that has been trained: <https://huggingface.co/aditira/text_classification>

**Inference**

Great, now that you’ve finetuned a model, you can use it for inference! Grab some text you’d like to run inference on:

In [None]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for sentiment analysis with your model, and pass your text to it:

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="model/text_classification")
classifier(text)

### Audio Classification

[![](https://storage.googleapis.com/rg-ai-bootcamp/model_usage/audio-classification.png)](https://youtu.be/KWwzcmG98Ds)

Tasks: Audio Classification (source: [youtube.com](https://youtu.be/KWwzcmG98Ds))

Audio Classification can be used for an array of applications such as detecting a speaker's intent, classifying languages, or identifying animal species by their sounds. In our story, we aim to classify the speaker's intent from the audio input. Utilizing the pre-trained [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model, we fine-tune it on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset specifically designed for this task. This method cuts down our training time and improves the model's performance even with less data.

**Load MInDS-14 dataset**

Start by loading the MInDS-14 dataset from the Datasets library:

In [None]:
from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

Split the dataset’s `train` split into a smaller train and test set with the [train_test_split](https://huggingface.co/docs/datasets/v2.13.1/en/package_reference/main_classes#datasets.Dataset.train_test_split) method. This’ll give you a chance to experiment and make sure everything works before spending more time on the full dataset.

In [None]:
minds = minds.train_test_split(test_size=0.2)

Then take a look at the dataset:

In [None]:
minds

While the dataset contains a lot of useful information, like lang_id and english_transcription, you’ll focus on the audio and intent_class in this guide. Remove the other columns with the [remove_columns](https://huggingface.co/docs/datasets/v2.13.1/en/package_reference/main_classes#datasets.Dataset.remove_columns) method:

In [None]:
minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

Take a look at an example now:

In [None]:
minds["train"][0]

There are two fields:

- audio: a 1-dimensional array of the speech signal that must be called to load and resample the audio file.
- intent_class: represents the class id of the speaker’s intent.

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

In [None]:
labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

Now you can convert the label id to a label name:

In [None]:
id2label[str(2)]

**Preprocess**

The next step is to load a Wav2Vec2 feature extractor to process the audio signal:

In [None]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

The MInDS-14 dataset has a sampling rate of 8000khz (you can find this information in it’s [dataset card](https://huggingface.co/datasets/PolyAI/minds14)), which means you’ll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:

In [None]:
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

Create a preprocessing function to:

1. Load and resample the audio file as required.
2. Verify if the audio file's sampling rate matches that of the model's pre-training data (as seen in the Wav2Vec2 [model card](https://huggingface.co/facebook/wav2vec2-base)).
3. Establish a maximum input length for batching longer inputs without truncation.

In [None]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

Use the Datasets map function to apply the preprocessing function across the complete dataset. Speed it up by enabling `batched=True` to process multiple dataset elements simultaneously. Remove unnecessary columns and rename `intent_class` to `label`, as the model expects this name.

In [None]:
encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds = encoded_minds.rename_column("intent_class", "label")

**Setting up Evaluation Metric**

Incorporating a metric during training aids in gauging your model's performance. Use the Evaluate library to easily access an evaluation method. For this task, employ the accuracy metric.

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

Your `compute_metrics` function is ready to go now, and you’ll return to it when you setup your training.

**Training and Evaluation**

You’re ready to start training your model now! Load Wav2Vec2 with [AutoModelForAudioClassification](https://huggingface.co/docs/transformers/v4.31.0/en/model_doc/auto#transformers.AutoModelForAudioClassification) along with the number of expected labels, and the label mappings:

In [None]:
from transformers import AutoModelForAudioClassification

num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)

During the training phase, we have already implemented an optimization strategy. Here, we have set a learning rate and batch size in TrainingArguments. Additionally, the `load_best_model_at_end=True` option is set, implying that the Trainer will load the best model at the end of training, which is a form of model optimization.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="model/audio_classification",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,                  # learning rate
    per_device_train_batch_size=32,      # training batch size
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,         # model optimization via early stopping
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

**Evaluating the Trained Model**

After the training is completed, we need to evaluate the model. For that, we'll use the accuracy metric from the `evaluate` library to assess the model's performance.

In [None]:
eval_result = trainer.evaluate(eval_dataset=encoded_minds["test"])
print(eval_result) # print the evaluation results

The evaluation results will give us the accuracy on the test set. If this accuracy is satisfactory, we could then decide to publish the model. If it's not, we might need to revisit the preprocessing, model architecture, or the training process (e.g., tuning hyperparameters, increasing number of epochs).

Once training is completed, share your model to the Hub with the push_to_hub() method so everyone can use your model:

In [None]:
trainer.push_to_hub()

The following is an example of a model that has been trained: <https://huggingface.co/aditira/audio_classification>

**Inference**

Great, now that you’ve finetuned a model, you can use it for inference!

Load an audio file you’d like to run inference on. Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!

In [None]:
from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = dataset.features["audio"].sampling_rate
audio_file = dataset[0]["audio"]["path"]

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for audio classification with your model, and pass your audio file to it:

In [None]:
from transformers import pipeline

classifier = pipeline("audio-classification", model="model/audio_classification")
classifier(audio_file)

### Image classification

[![](https://storage.googleapis.com/rg-ai-bootcamp/model_usage/image-classification.png)](https://youtu.be/tjAIM7BOYhw)

Tasks: Image Classification (source: [youtube.com](https://youtu.be/tjAIM7BOYhw))

Image classification can be used in countless applications, ranging from detecting objects in an image, satellite image analysis to medical imaging. In our case, we aim to classify food items from given images. By using a pre-trained [ViT](https://huggingface.co/docs/transformers/v4.31.0/en/tasks/model_doc/vit) model, fine-tuned on the [Food-101](https://huggingface.co/datasets/food101) dataset, we reduce the training time while preserving high performance.

**Load Food-101 dataset**

Begin by loading a small subset of the Food-101 dataset from the Datasets library, allowing you to experiment and ensure everything works before committing to training on the full dataset.

In [None]:
from datasets import load_dataset

food = load_dataset("food101", split="train[:5000]")

Split the dataset’s train split into a train and test set with the `train_test_split` method:

In [None]:
food = food.train_test_split(test_size=0.2)

Then take a look at an example:

In [None]:
food["train"][0]

Each example in the dataset has two fields:

- `image`: a PIL image of the food item
- `label`: the label class of the food item

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

In [None]:
labels = food["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

Now you can convert the label id to a label name:

In [None]:
id2label[str(79)]

**Preprocess**

The next step is to load a ViT image processor to process the image into a tensor:

In [None]:
from transformers import AutoImageProcessor

checkpoint = "google/vit-base-patch16-224-in21k"
image_processor = AutoImageProcessor.from_pretrained(checkpoint)

Apply some image transformations to the images to make the model more robust against overfitting. Here you’ll use torchvision’s transforms module, but you can also use any image library you like.

Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:

In [None]:
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
size = (
    image_processor.size["shortest_edge"]
    if "shortest_edge" in image_processor.size
    else (image_processor.size["height"], image_processor.size["width"])
)
_transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])

Then create a preprocessing function to apply the transforms and return the pixel_values - the inputs to the model - of the image:

In [None]:
def transforms(examples):
    examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
    del examples["image"]
    return examples

To apply the preprocessing function over the entire dataset, use Datasets with_transform method. The transforms are applied on the fly when you load an element of the dataset:

In [None]:
food = food.with_transform(transforms)

Now create a batch of examples using DefaultDataCollator. Unlike other data collators in Transformers, the DefaultDataCollator does not apply additional preprocessing such as padding.

In [None]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

**Setting up Evaluation Metric**

Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load an evaluation method with the Evaluate library:

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Your compute_metrics function is ready to go now, and you’ll return to it when you set up your training.

**Training and Evaluation**

You’re ready to start training your model now! Load ViT with `AutoModelForImageClassification`. Specify the number of labels along with the number of expected labels, and the label mappings:

In [None]:
from transformers import AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained(
    checkpoint,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id,
)

At this point, only three steps remain:

- Set your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments). Ensure `remove_unused_columns=False` to keep the image column, crucial for creating `pixel_values`. Specify `output_dir` to save your model and enable `push_to_hub` to upload it to Hugging Face. Trainer will conduct an accuracy evaluation and save a checkpoint after each epoch.
- Feed these arguments, the model, dataset, tokenizer, data collator, and `compute_metrics` function to Trainer.
- Use `train()` to fine-tune your model.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="model/image_classification",
    remove_unused_columns=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,                 # learning rate
    per_device_train_batch_size=16,     # training batch size
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,        # model optimization via early stopping
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=food["train"],
    eval_dataset=food["test"],
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
)

trainer.train()

**Evaluating the Trained Model**

After the training is completed, we need to evaluate the model. For that, we'll use the accuracy metric from the `evaluate` library to assess the model's performance.

In [None]:
eval_result = trainer.evaluate(eval_dataset=food["test"])
print(eval_result)

The evaluation results will give us the accuracy on the test set. If this accuracy is satisfactory, we could then decide to publish the model. If it's not, we might need to revisit the preprocessing, model architecture, or the training process (e.g., tuning hyperparameters, increasing number of epochs).

Once training is completed, share your model to the Hub with the `push_to_hub()` method so everyone can use your model:

In [None]:
trainer.push_to_hub()

The following is an example of a model that has been trained: <https://huggingface.co/aditira/image_classification>

**Inference**

Great, now that you’ve fine-tuned a model, you can use it for inference!

Load an image you’d like to run inference

In [None]:
ds = load_dataset("food101", split="validation[:10]")
image = ds["image"][0]

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for image classification with your model, and pass your image to it:

In [None]:
from transformers import pipeline

classifier = pipeline("image-classification", model="my_awesome_food_model")
classifier(image)