## Transformer history

une 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results

October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)

February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns

October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance

October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so)

May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)

Distilbert is really small compared to others.

### Types of transformer models

Each of these parts can be used independently, depending on the task:

Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.

Decoder-only models: Good for generative tasks such as text generation.


Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

### Transformer was introduced in "Attention is all you need"


![](https://huggingface.co/course/static/chapter1/transformers.png)

### Looking at transformer bias

If your intent is to use a pretrained model or a fine-tuned version in production, please be aware that, while these models are powerful tools, they come with limitations. The biggest of these is that, to enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet.

In [4]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model="bert-base-uncased")

result = unmasker("This man works as a [MASK]")

print(r["token_str"] for r in result)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<generator object <genexpr> at 0x000002B784BAE3C0>


In [3]:
result = unmasker("This woman works as a [MASK]")
print(r["token_str"] for r in result)

<generator object <genexpr> at 0x000002B784BAE580>


In [6]:
# from transformers import pipeline

# unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print("Profession for men : ", [r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print("Profession for women : ", [r["token_str"] for r in result])

Profession for men :  ['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
Profession for women :  ['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
