# **What is NLP?**

NLP is subfield of AI and computer science that focuses on the interaction between computers and human(natural) languages. The goal of NLP is to enable machines to undstand, interpret, generate, and respond to human language in a way that is both meaningful and useful. 

In essence, NLP bridges the gap between human communication and computer understanding, allowing machines to read, hear, and make sense of the vast amounts of human language data. This field combines computational linguistics, machine learning, and deep learning to process, analyze, and generate natural language text or speech.

The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.

NLP isn't limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio or a description of an image.

### Natural **Language** Processing

What is Language?
1. Way of communication?
2. Way of capturing complex communication?
    1. Many words for weather, tempurature
    2. Many words for love, romance
    3. Domain Expert: Many words specific to that dicipline 
3. Way of storage of knowledge?
    1. Knowledge of physics, chemistry
    2. Knowledge of Engineering & Building machines
4. Way of building of intelligence? 
    1. Knowledge: as collection of facts
    2. Intelligence: ability to reason & think, find answers & grow knowledge
    3. Building Machines to intelligent Machines

**NLP** -> everything related to human languages
1. Ability to understand, answer questions, translate, write
2. Ability to understand "Collection of existing knowledge"

**NLP** related to intelligence
1. "Knowledge as collection of existing facts"
2. Intelligence as ability to think, reason, experiment, find answer, grow knowledge

**NLP** comes easy to us humans
1. Walking is easy for us, but getting a robot to walk is hard
2. Driving is easy for us, but self driving car is hard
3. Looking & identifying objects is easy for us, but for computers it is hard
4. Understanding words, language, is easy for us. We do it all the time. But it's hard.
4. 'NLP' is deceptively hard, we feel it's easy but it's not easy. 

## Why is it challenging?

Computer don't process information in the same way as humans. For example, When we read the sentence "I am hungry", we can easily understand its meaning. Similarly, given two sentences such as "I am hungry" and "I am sad", we're able to easily determine how similar they are. For machine Learning models, such tasks are more difficult. The text needs to be processed in a way that enables the models to learn from it. And because language is complex, we need to think carefully abbout how this processing must be done. There has been a lot of reasearch done on how to represent text, and we will look at some methods in the next chapter.

# **Transformers**

#### **Transformers are everywhere!**

Transformers models used to solve all kind of NLP tasks. Here are some of the companies and organizations using Hugging Face and Transformer model. 

The most basic object in the 🤗 Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:

Copied


In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I love you")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

In [3]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [12]:
classifier(
    ["I love you", "I hate you", "I love you but as friend"]
    )

[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079},
 {'label': 'POSITIVE', 'score': 0.9997859597206116}]

There are three main steps involved when you pass some text to a pipeline:
1. The text is preprocessed into a format the model can understand.
1. The preprocessed inputs are passed to the model.
1. The predictions of the model are post-processed, so you can make sense of them.

Some of the currently [available pipelines](https://huggingface.co/transformers/main_classes/pipelines) are:

- `feature-extraction` (get the vector representation of a text)
- `fill-mask`
- `ner` (named entity recognition)
- `question-answering`
- `sentiment-analysis`
- `summarization`
- `text-generation`
- `translation`
- `zero-shot-classification`

### **zero-shot classification**

we'll start by tackling a more challenging task where we need to classify texts that haven't been labeleed. This is a common scenerio in real world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the `zero-shot-classification` pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like. 

In [13]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445961475372314, 0.1119762659072876, 0.04342758283019066]}

This pipeline is called `zero-shot` because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

### Text Generation 

Now let's how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.

In [14]:
classifier = pipeline("text-generation")
classifier("I am lloking for job in AI/ML")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I am lloking for job in AI/ML, where I\'m actually going to try to do something different to make something else interesting or interesting to make fun of with a friend and to make me happy."\n\nHow did things look from the'}]

In [16]:
classifier("In this course, we will teach you how to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use a tool created by Microsoft that automatically detects what it detects, so you will not be left out of the action.\n\nYou will receive an email asking you to sign up for an Azure account'}]

You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`.

### **Using any model from the Hub in a pipeline**
The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation.

In [24]:
generator = pipeline("text-generation", model="distilgpt2")
output = generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
    temperature=0.7,  # Lower temp for less randomness
    top_k=50,         # Restrict to top 50 words
    top_p=0.95,       # Reduce chance of rare words
    clean_up_tokenization_spaces=True
)

# Clean and print the results
for i, entry in enumerate(output):
    cleaned_text = entry["generated_text"].replace("\n", " ").strip()
    print(f"Output {i+1}: {cleaned_text}")

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output 1: In this course, we will teach you how to create a "real" world.    The first thing you need to know is that
Output 2: In this course, we will teach you how to use your skills to learn new languages, learn basic language, and learn new languages.


### **Mask Filling** 

The next pipeline you’ll try is `fill-mask`. The idea of this task is to ***fill in the blanks*** in a given text:

In [30]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'score': 0.19198723137378693,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.042092349380254745,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [38]:
unmasker("This IIM will teach you all about <mask>.", top_k=5)

[{'score': 0.012076430954039097,
  'token': 17759,
  'token_str': ' physics',
  'sequence': 'This IIM will teach you all about physics.'},
 {'score': 0.012056591920554638,
  'token': 8326,
  'token_str': ' programming',
  'sequence': 'This IIM will teach you all about programming.'},
 {'score': 0.011760421097278595,
  'token': 10795,
  'token_str': ' evolution',
  'sequence': 'This IIM will teach you all about evolution.'},
 {'score': 0.011267362162470818,
  'token': 10638,
  'token_str': ' math',
  'sequence': 'This IIM will teach you all about math.'},
 {'score': 0.011056757532060146,
  'token': 29736,
  'token_str': ' teamwork',
  'sequence': 'This IIM will teach you all about teamwork.'}]

In [1]:
from transformers import pipeline

filler = pipeline("fill-mask", model="bert-base-cased")
result = filler("...")

  from .autonotebook import tqdm as notebook_tqdm
BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model

PipelineException: No mask_token ([MASK]) found on the input

The `top_k` argument controls how many possibilities you want to be displayed. Note that here the model fills in the special `<mask>` word, which is often referred to as a mask token. Other `mask-filling` models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget.

### **Named entity recognition**

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:

In [39]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'entity_group': 'PER',
  'score': np.float32(0.9981694),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9796019),
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': np.float32(0.9932106),
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

We pass the option `grouped_entities=True` in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words. In fact, as we will see in the next chapter, the preprocessing even splits some words into smaller parts. For instance, Sylvain is split into four pieces: S, ##yl, ##va, and ##in. In the post-processing step, the pipeline successfully regrouped those pieces.

### **Question answering**

The question-answering pipeline answers questions using information from a given context:

In [40]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Dhaval and I am looking for job in AI or DS or ML. Ich wohne in Mannheim, Germany",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'score': 0.7723680734634399, 'start': 46, 'end': 48, 'answer': 'AI'}

### **Summarization**

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:

In [41]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

Like with text generation, you can specify a max_length or a min_length for the result.

### **Translation**

For translation, you can use a default model if you provide a language pair in the task name (such as `"translation_en_to_fr"`), but the easiest way is to pick the model you want to use on the Model Hub. Here we’ll try translating from French to English:

In [46]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

ValueError: This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed in order to use this tokenizer.

In [47]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "Helsinki-NLP/opus-mt-fr-en"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir="./models")
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir="./models")

translator = pipeline("translation_fr_to_en", model=model, tokenizer=tokenizer)


KeyboardInterrupt: 

# **How do Transformers work?**

we will take a high-level look at the architecture of Transformer models.

### **A bit of Trnasformer history**

Here are some reference points in the (short) history of tranformer models:

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_chrono.svg)


The tranformer architecture was introduced in 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including:

* **June 2018**: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
* **October 2018**: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)
* **February 2019**: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns

* **October 2019**: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance

* **October 2019**: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so)

* **May 2020**: GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)


This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Brodly, they can be grouped into three categories:
* GPT-like (also called *auto-regresive* Transformer models)
* BERT-like (also called *auto-encoding* Transformer models)
* BERT/T5-like (also called *sequence-to-sequence* Tranfomer models)

### **Transformers are language models**

All the transfomer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as *language models*. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This type of model developes a statistical inderstanding of the language it has been trained on, but it's not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called ***transfer learning***. During this process, the model is fine-tuned in a supervised way - that is, using human-annotated labels - on a given task. 

An example of a task is predicting the next word in a sentence having read the *n* previous words. This called ***causal language modeling*** because the output depeneds on the past and present inputs, but not the future ones.

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/causal_modeling.svg)

Another example is ***masked language modeling***, in which the model predicts a masked word in the sentence. 

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/masked_modeling.svg)

### **Transformers are big models**

Apart from a few outliers (like Distil), the general strategy to achieve better performance is by increasing the models' sizes as well as the amount of data they are pretrained on.

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/model_parameters.png)

Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resourse. It even translates to environmental impact, as can be seen in the following graph. 

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/carbon_footprint.svg)

And this is showing a project for a (very big) model led by a team consciously trying to reduce the environmental impact of pretraining. The footprint of running lots of trials to get the best hyperparameter would be even higher. 

Imagine if each time a research team, a student organization, or a company wanted to train a model, it did so from scratch. This would lead to huge, unnecessary global costs!

This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.

By the way, you can evaluate the carbon footprint of your models’ training through several tools. For example ML CO2 Impact or Code Carbon which is integrated in 🤗 Transformers. To learn more about this, you can read this blog post which will show you how to generate an emissions.csv file with an estimate of the footprint of your training, as well as the documentation of 🤗 Transformers addressing this topic.

### **Transfer Learning**

*Pretraining* is the act of training a model from scratch: the weights are randomaly initialized, and the training starts without any prior knowledge. 

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/pretraining.svg)

This pretraining is usually done on very large amounts of data. Therefor, it requires a very large corpus of data, and traininf can take up to several weeks. 

*Fine-tuning*, on the other hand, is the training done **after** a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model; then perform additional trainng training with a dataset specific to your task. Wait - why not simply train the model for your final use case from the start (**scratch**)? There are a couple of reasons:

* The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task). 
* Since the pretrained model was already trained on lots of data, the fine-tuning reqquires way less data to get decent results.
* For the same reason, the amount of time and resources needed to get good results are much lower. 

For example, one could leverage a pretrained models trained on the English language and then fine-tune it on arXiv corpus, resulting in a science/research-based model. The fine-tuining will only require a limited amount of data: the knowledge the pretrained model has acquired is "transferrd," hence the term *transfer learning*,

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/finetuning.svg)

Fine-tuning a model therefor has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fien-tuning schemes, as the training is less constraining than a full pretraining. 

This process will also achieve better results that training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model - one as close as possible to the task you have at hand - and fine-tune it. 

### **General architecture**

#### **Inroduction**

The model is primarily composed of two blocks:

* **Encoder (left)**: The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
* **Decoder (right)**: The decoder uses the encoder's representation(features) along with other inputs to generate a traget sequence. This means that the model is optimized for generating outputs.

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_blocks.svg)

Each of these parts can be used independently, depending on the task:
* **Encoder-only models**: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition. 
* **Decoder-only models**: Good for generative tasks such as text generation.
* **Encoder decoders models** or **sequence-to-sequence models**: Good for generative tasks that require an input, such as translation or summarization.

We will dive into those architectures independently in later sections. 

#### **Attention Layers**

A key features of Transformer models is that they are built with special layers called *attention layers*. In fact, the title of the paper introducing the transformer architecture was "**Attention Is All you Need**"! This layer will tell the model to pay specific to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word. 

The same concept applies to any task associated with natural language: a word by itself has meaning, but that meaning is deeply affected by the context, which can be any other word(or words) before or after the word being studied. 

Now that you have idea of what attention layers are all about, let's take a closer look at the Transformer architecture. 

#### **The Original architecture**

The transformer architecture was originally for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receive the same sentences in the desired tartget language. In the encoder, the attention layers can use all the words in a sentences (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

To speed things up during training (when the model has access to target sentences), the encoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in position 1 to 3.

The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers.svg)

Note that the first attention layers in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layers uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence amy be helpful to determine the best translation of a given word.

The *attention mask* can also be used in the encoder/decoder to prevent the model from paying attention to some special words 
- for instance, the special padding word used to make all the inputs the same length when batching together sentences. 

#### **Architecture vs. checkpoints**

* **Architecture**: This is the skeleton of the model - the definition of each layers and each operation that happens within the model.
* **Checkpoints**: These are the weight that will be loaded in a given architecture.
* **Model**: This is umbralla term that isn't as precise as "architecture" or "checkpoint" : it can mean both. This course will specify *architecture* or *checkpoint* when it matters to reduce ambiguity. 

# **Encoder Models** 

Encoder models use only the encoder of a tranformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having "bi-directional" attention, and are often called *auto-encoding models*.

The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by maasking random words in it) and tasking the model with finding or reconstructing the initial sentence. 

Encoder models are best suited for tasks requiring and understanding full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering. 

Representatives of this of models include:

* ALBERT
* BERT
* DistilBERT
* ELECTRA
* RoBERTa

# **Decoder models**

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can be only access the words positioned before it in the sentence. These models are often called *auto-regressive models*. 

The pretraining of decoder models usually resolves around predicting the next word in the sentence. 

These models are best suited for task involving text generation. 

Representatives of this family of models include:

* CTRL
* GPT
* GPT-2
* Transformer XL

# **Sequence-to-sequence models sequence-to-sequence-models** 

Encoder-decoder models (also called *sequence-to-sequence models*) use both parts of the Transformer architecture. At each stage, the attention of the layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input. 

The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, T5 is pretrained by replacing random spans of text that this mask word replaces. 

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering. 

Representatives of this family of models include:
* BERT
* mBERT 
* Marian 
* T5

# **Bias and limitations** 

If your intent is to use a pretrained model or a fine-tuned version in production, please be aware that, while these models are powerful tools, they come with limitations. The biggest of these is that, to enable pretraining on larg amounts of data, researchers often scrap all the content they can find, taking the best as well as the worst of what is available on the internet. 

To give a quick illustration, let's go back the example of a `fill-mask` pipeline with the BERT model:

```python

from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token-str"] for r in result])

output: ['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
````

In [4]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']


When asked to fill in the missing word in these two sentences, the model gives only one gender-free answer (waiter/waiterss). The others are work occupations usually associated with one specific geneder - and yes, prostitute ended up in the top 5 possibilities the model associates with "woman" and "work". This happens even though ***BERT*** is one of the rare transformer models not built by scraping data from all over the internet, but rather using apparently neutral data(it's trained on the ***English***, ***Wikipedia***, and ***BookCorpus*** datasets).

When you use these tools, you therefor need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won't make this intrinsic bias disappear. 

# **Summary**

In this chapter, you saw how to approach different NLP rasks using the high-level `pipline()` function from 🤗 Transformerss. You also saw how ro search for and use models in the Hub, as well as how to use inference API to test the models directly in your browser. 

We discussed how transformer models work at a high level, and talked about the importance of transfer learning and fine-tuning. A key aspect is that you can use the full architecture or only the encoder or decoder, depending on what kind of task you aim to solve. The following table summarizes this: 

| Model | Examples | Tasks | 
| ---- | ---- | ---- |
| Encoder | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering|
| Decoder | CTRL, GPT, GPT-2, Transformer XL | Text generation | 
| Encoder-decoder | BART, T5, Marian, mBART | Summarization, translation, generative question answering| 