# **Transfer Learning**

Transfer learning is a machine learning technique in which knowledge gained through one task or dataset is used to improve model performance on another related task and/or different dataset.

### Why Transfer Learning is important in NLP?

Transfer Learning plays a vital role in Natural Language Processing (NLP) because it allows knowledge gained from one task or domain to be used in another, usually related, task or domain. This method is particularly beneficial in NLP for several reasons:

1. **Data Efficiency**: NLP models generally need a large amount of labeled data to perform effectively. Transfer Learning helps by pretraining models on extensive datasets, like Wikipedia, and then fine-tuning them on smaller, task-specific datasets. This reduces the necessity for large amounts of labeled data for each specific task.

2. **Resource Savings**: Training large NLP models from scratch is often costly and time-consuming. Using a pretrained model for fine-tuning requires fewer resources, making it more practical for researchers and practitioners.

3. **Performance Improvement**: Pretrained models already contain useful linguistic features and patterns learned from vast text data. Fine-tuning these models for specific tasks usually enhances performance, especially when there is limited labeled data available.

4. **Domain Adaptation**: Transfer Learning allows models to adapt to new domains or languages with minimal additional training, making it essential for NLP applications that need to perform well across various domains and languages.

5. **Continual Learning**: A model trained through Transfer Learning can be easily updated or adjusted with new data, enabling it to continuously learn and improve its performance over time.

### How Transfer Learning in NLP Works

1.   **Pre-training on Large Datasets**: Initially, models are trained on extensive and diverse text corpora to learn general language features, such as syntax and semantics. This is done using methods like masked language modeling or autoregressive language modeling.

2.   **Fine-Tuning on Specific Tasks**: After pre-training, these models are fine-tuned with smaller, specialized datasets to adjust their parameters for specific tasks, such as sentiment analysis or question answering.

3.   **Efficiency and Performance**: Transfer learning greatly reduces the need for extensive computational resources and time for training. It also enhances model performance, particularly in situations where there is limited data.

4.   **Applications Across Domains**: This approach is effective for adapting models to specialized domains, such as legal or medical fields, and for transferring knowledge from models trained in one language to others.

5.   **Challenges**: There can be challenges, such as mismatches between the data used in pre-training and the specific task data, as well as the high computational demands associated with using large, complex models.

### List of Transfer Learning NLP Models

Here's a list of notable models in natural language processing that utilize transfer learning, each recognized for their unique contributions and advancements:

1. **BERT (Bidirectional Encoder Representations from Transformers)**: Developed by Google, BERT uses a transformer-based architecture. It improves model understanding by employing techniques like masked language modeling and next sentence prediction.

2. **GPT (Generative Pre-trained Transformer)**: Created by OpenAI, GPT models are known for their strength in text generation, using autoregressive language modeling during their training.

3. **T5 (Text-To-Text Transfer Transformer)**: An innovation from Google, T5 reformulates all natural language processing tasks into a text-to-text framework, treating both inputs and outputs as text strings.

4. **DistilBERT**: This streamlined version of BERT is designed to be smaller and faster while maintaining most of BERT’s original language understanding capabilities.

5. **BART (Bidirectional and Auto-Regressive Transformers)**: BART combines the bidirectional training of BERT and the autoregressive features of GPT. It is trained by corrupting texts and learning to accurately reconstruct the original text.

# Transformers

Let's have a quick look at the 😉 Transformers library features. The library downloads pretrained models for Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG), such as completing a prompt with new text or translating in another language.

First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.

# Getting started on a task with a pipeline

The easiest way to use a pretrained model on a given task is to use pipeline. 😉 Transformers provides the following tasks out of the box:

* Sentiment analysis: is a text positive or negative?
* Text generation (in English): provide a prompt and the model will generate what follows.
* Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)
* Question answering: provide the model with some context and a question, extract the answer from the context.
* Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.
* Summarization: generate a summary of a long text.
* Translation: translate a text in another language.
* Feature extraction: return a tensor representation of the text.

Let's see how this work for sentiment analysis (the other tasks are all covered in the task summary):

In [1]:
! pip install transformers




In [2]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [3]:
classifier('We are very happy to be a part of this LLM course')


[{'label': 'POSITIVE', 'score': 0.9998376369476318}]

In [4]:
classifier('The LLM course is great except this module is too long')


[{'label': 'NEGATIVE', 'score': 0.9933103322982788}]

In [5]:
results = classifier(["We are very happy to show you the 😉 Transformers library.",
                      "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")


label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is fairly neutral.

By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can look at its [model page](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
)  to get more information about it. It uses the [DistilBERT architecture](https://huggingface.co/transformers/model_doc/distilbert.html
) and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

Let's say we want to use another model; for instance, one that has been trained on French data. We can search through the [model hub](https://huggingface.co/models
) that gathers models pretrained on a lot of data by research labs, but also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags "French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's see how we can use it.

You can directly pass the name of the model to use to pipeline:

In [6]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")


config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [7]:
classifier("Esperamos que no lo odie.")


[{'label': '3 stars', 'score': 0.33688199520111084}]

This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model object and its associated tokenizer.

We will need two classes for this. The first is AutoTokenizer, which we will use to download the tokenizer associated to the model we picked and instantiate it. The second is AutoModelForSequenceClassification (or TFAutoModelForSequenceClassification if you are using TensorFlow), which we will use to download the model itself. Note that if we were using the library on an other task, the class of the model would change. The task summary tutorial summarizes which class is used for which task.

In [8]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification


Now, to download the models and tokenizer we found previously, we just have to use the
AutoModelForSequenceClassification.from_pretrained method (feel free to replace model_name by any other model from the model hub):

In [9]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
# This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Device set to use 0


# Under the Hood: pretrained models


In [11]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tf_model  = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)

# quick test
import tensorflow as tf
inputs = tokenizer("Great movie!", return_tensors="tf")
outputs = tf_model(**inputs)
print("Predicted label id:", tf.math.argmax(outputs.logits, axis=-1).numpy())



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


Predicted label id: [1]


We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in words (or part of words, punctuation symbols, etc.) usually called tokens. There are multiple rules that can govern that process (you can learn more about them in the tokenizer summary), which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was pretrained.

The second step is to convert those tokens into numbers, to be able to build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocab, which is the part we download when we instantiate it with the from_pretrained method, since we need to use the same vocab as when the model was pretrained.

To apply these steps on a given text, we can just feed it to our tokenizer:

In [12]:
inputs = tokenizer("We are very happy to show you the 😉 Transformers library.")


This returns a dictionary string to list of ints. It contains the ids of the tokens, as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an attention mask that the model will use to have a better understanding of the sequence:

In [13]:
print(inputs)


{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept and get tensors back. You can specify all of that to the tokenizer:

In [14]:
tf_batch = tokenizer(
    ["We are very happy to show you the 😉 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)


The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding token the model was pretrained with. The attention mask is also adapted to take the padding into account:

In [15]:
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")


input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


You can learn more about tokenizers here.

Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the dictionary keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding **.

In [16]:
tf_outputs = tf_model(tf_batch)


In 😉 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final activations of the model.

In [17]:
print(tf_outputs)


TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.0832963 ,  4.3364143 ],
       [ 0.08181235, -0.04178774]], dtype=float32)>, hidden_states=None, attentions=None)


The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for the final activations, so we get a tuple with one element.

**NOTE**: All 😉 Transformers models (PyTorch or TensorFlow) return the activations of the model before the final activation function (like SoftMax) since this final activation function is often fused with the loss.

# Accessing the Code

The AutoModel and AutoTokenizer classes are just shortcuts that will automatically work with any pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the code is easy to access and tweak if you need to.

In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using the DistilBERT architecture. As AutoModelForSequenceClassification (or TFAutoModelForSequenceClassification if you are using TensorFlow) was used, the model automatically created is then a DistilBertForSequenceClassification. You can look at its documentation for all details relevant to that specific model, or browse the source code. This is how you would directly instantiate model and tokenizer without the auto magic:

In [20]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFDistilBertForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)


All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


# Customize the Model

If you want to change how the model itself is built, you can define your custom configuration class. Each architecture comes with its own relevant configuration (in the case of DistilBERT, DistilBertConfig) which allows you to specify any of the hidden dimension, dropout rate, etc. If you do core modifications, like changing the hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would then instantiate the model directly from this configuration.

Here we use the predefined vocabulary of DistilBERT (hence load the tokenizer with the DistilBertTokenizer.from_pretrained method) and initialize the model from scratch (hence instantiate the model from the configuration instead of using the DistilBertForSequenceClassification.from_pretrained method).

In [21]:
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification(config)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

For something that only changes the head of the model (for instance, the number of labels), you can still use a pretrained model for the body. For instance, let's define a classifier for 10 different labels using a pretrained body. We could create a configuration with all the default values and just change the number of labels, but more easily, you can directly pass any argument a configuration would take to the from_pretrained method and it will update the default configuration with it:

In [23]:
from transformers import AutoTokenizer, DistilBertConfig, TFDistilBertForSequenceClassification

model_name = "distilbert-base-uncased"  # base encoder weights (PyTorch on HF)

config = DistilBertConfig.from_pretrained(model_name, num_labels=10)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Convert PyTorch -> TensorFlow, and re-init the classification head for 10 labels
model = TFDistilBertForSequenceClassification.from_pretrained(
    model_name,
    config=config,
    from_pt=True,
    ignore_mismatched_sizes=True,
)

# quick sanity check
import tensorflow as tf
inputs = tokenizer("hello world", return_tensors="tf")
logits = model(**inputs).logits
print(logits.shape)  # should be (1, 10)


pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'cla

(1, 10)
