# Types of Transformers

There are three main types of transformers. They are:
- Encoder Only
- Decoder Only
- Encoder Decoder
![types](res/6_transformers_types.png)

## Encoder-only Transformers - BERT Family

**Architecture**: 
Encoder-only Transformers consist of a stack of encoder layers. Each layer has two main components:
1. Multi-head self-attention mechanism
2. Feedforward neural network

**Key Characteristics**:
- Bidirectional context: Can look at the entire input sequence in both directions.
- Suitable for tasks that require understanding of the entire input context.

**Examples**: 
BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, ALBERT

### **Use Cases**:

1. **Text Classification**
   - Sentiment Analysis
   - Topic Classification
   - Spam Detection

   Example (Sentiment Analysis with BERT):

In [1]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

text = "I absolutely loved this movie! The acting was superb."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1)

print(f"Sentiment: {'Positive' if prediction == 1 else 'Negative'}")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment: Negative


**Why:** BERT is particularly effective for sentiment analysis because it can understand context and nuance in language. Its bidirectional nature allows it to capture relationships between words in both directions, which is crucial for understanding sentiment.

2. **Named Entity Recognition (NER)**
   - Identifying and classifying named entities (e.g., person names, organizations, locations) in text.

   Example (NER with BERT):

In [2]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Apple Inc. is planning to open a new store in New York City."

ner_results = nlp(example)
print(ner_results)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity': 'B-ORG', 'score': 0.9996038, 'index': 1, 'word': 'Apple', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.9994128, 'index': 2, 'word': 'Inc', 'start': 6, 'end': 9}, {'entity': 'B-LOC', 'score': 0.99944466, 'index': 12, 'word': 'New', 'start': 46, 'end': 49}, {'entity': 'I-LOC', 'score': 0.9993468, 'index': 13, 'word': 'York', 'start': 50, 'end': 54}, {'entity': 'I-LOC', 'score': 0.9995746, 'index': 14, 'word': 'City', 'start': 55, 'end': 59}]


**Why:** Encoder-only Transformers excel at NER tasks because they can understand the context around each word, which is crucial for accurately identifying and classifying named entities.

3. **Question Answering**
   - Extracting answers from a given context based on questions.

   Example (Question Answering with BERT):

In [6]:
from transformers import DistilBertTokenizer, DistilBertModel
import torch
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
model = DistilBertModel.from_pretrained('distilbert-base-cased-distilled-squad')

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

print(outputs)

BaseModelOutput(last_hidden_state=tensor([[[ 1.1810, -0.4073,  0.9986,  ..., -0.7445,  0.0380, -0.5510],
         [ 1.6172, -0.6785,  1.6932,  ..., -0.8216, -0.2387, -0.6187],
         [ 2.0840, -0.5496,  1.3313,  ..., -0.7791,  0.1698, -0.3950],
         ...,
         [ 0.2879, -0.1813,  1.2631,  ..., -0.2022,  0.4699,  0.5535],
         [ 0.6069, -0.1943,  0.7584,  ..., -0.5106, -0.4027, -0.4910],
         [ 1.0183, -0.8215,  0.9088,  ..., -0.8094,  0.8372, -0.2027]]]), hidden_states=None, attentions=None)


**Why:** The bidirectional nature of encoder-only Transformers allows them to understand the relationship between the question and the context, making them highly effective for question-answering tasks.

## Decoder-only Transformers - GPT family

**Architecture**: 
Decoder-only Transformers consist of a stack of decoder layers. Each layer typically includes:
1. Masked multi-head self-attention mechanism
2. Feedforward neural network
3. Final Head/ Softmax layer - Projection

**Key Characteristics**:
- Autoregressive: Generates output sequentially, one element at a time.
- Unidirectional context: Each position can only attend to previous positions.

**Examples**: 
GPT (Generative Pre-trained Transformer) series, CTRL

### **Usecases**

**Text Generation / Text Completion**
   - Creative Writing
   - Code Completion
   - Chatbots

   Example (Text Generation with GPT-2):

In [7]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

prompt = "In a world where AI has become sentient,"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In a world where AI has become sentient, it's hard to imagine a better time to be a human.

"We're going to have to start thinking about how we're doing things," said Dr. Michael S. Hirsch, a professor of psychology at the University of California, San Francisco. "We need to think about what we can do to make it better."
...
, which is a collection of essays by the author, including a series of short stories by


**Why:** Decoder-only Transformers are ideal for text generation because they can generate coherent and contextually relevant text by predicting the next token based on the previous ones. This autoregressive nature allows them to maintain consistency and coherence in longer generated sequences.

## Encoder-Decoder Transformers - T5 Family

**Architecture**: 
Encoder-Decoder Transformers, also known as sequence-to-sequence (seq2seq) Transformers, consist of both encoder and decoder stacks. The architecture includes:
1. Encoder stack (similar to encoder-only Transformers)
2. Decoder stack (similar to decoder-only Transformers)
3. Cross-attention mechanism connecting encoder and decoder

**Key Characteristics**:
- Suitable for tasks that transform one sequence into another.
- Can handle input and output sequences of different lengths.
- Combines benefits of both encoder and decoder architectures.

**Examples**: 
T5 (Text-to-Text Transfer Transformer), BART, Marian

### **Use Cases**

1. **Machine Translation:**
    - Translating text from one language to another

In [1]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

input_text = "translate English to German: The house is wonderful."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(translated_text)

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Das Haus ist wunderbar.


**Why:** Encoder-Decoder Transformers are ideal for machine translation because they can encode the meaning of the input sequence in one language and decode it into another language. The encoder captures the context of the input, while the decoder generates the translation.

2. **Text Summarization**
   - Generating concise summaries of longer texts

In [2]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

long_text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to natural intelligence displayed by animals including humans. AI research has been defined as the field of study of intelligent agents, which refers to any system that perceives its environment and takes actions that maximize its chance of achieving its goals. The term "artificial intelligence" had previously been used to describe machines that mimic and display "human" cognitive skills that are associated with the human mind, such as "learning" and "problem-solving". This definition has since been rejected by major AI researchers who now describe AI in terms of rationality and acting rationally, which does not limit how intelligence can be articulated.
"""

input_text = "summarize: " + long_text
input_ids = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True).input_ids

outputs = model.generate(input_ids, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(summary)

AI research has been defined as the field of study of intelligent agents. the term "artificial intelligence" had previously been used to describe machines that mimic and display "human" cognitive skills associated with the human mind.


**Why:** Encoder-Decoder Transformers are well-suited for summarization tasks because they can encode the important information from a long input text and decode it into a concise summary. The encoder captures the key points, while the decoder generates a coherent summary.

## Conclusion

Transformers have revolutionized the field of NLP and continue to find applications in various domains. Each type of Transformer - encoder-only, decoder-only, and encoder-decoder - has its strengths and is suited for different types of tasks. 

- **Encoder-only Transformers** excel at understanding and analyzing text, making them ideal for classification, named entity recognition, and question answering tasks. 
- **Decoder-only Transformers** are powerful for text generation tasks, including creative writing, code completion, and language modeling. 
- **Encoder-Decoder Transformers** combine the strengths of both architectures, making them suitable for tasks that involve transforming one sequence into another, such as translation, summarization, question answering and certain types of conversational AI.