In [4]:
%%html
<style>
table {float:left}
</style>

## Step 1: Identify Task Type.
- **Understand Text**.
  - Encoder-Only Transformers.
- **Generative Tasks**.
  - Decoder-Only Transformers.
- **Transform Text to Text**.
  - Encoder-Decoder Transformers.

## Step 2: Computational Resources.
- **Smaller models for less resources.**
- **Number of Parameters**:
  - Small: <100M.
  - Medium: 100M ~ 500M.
  - Large: >1B.
- **Training and Inference Time**:
  - Training Time : Important for fine-tuning.
  - Inference Time : Important for deployment.
  - `PyTorchBenchmark ` : ðŸ¤— library for measuring training and inference time.

## Step 3: Fine-Tuning Requirements.
- In general, larger model requires larger training data for fine-tuning.
- **Small**:
  - ~ 10,000.
  - Examples:
    - Few-shot learning tasks.
    - Domain-specific tasks with limited labeled data (e.g., legal documents, rare languages).
  - Use pre-trained models without extensive fine-tuning.
  - Smaller models (e.g., DistilBERT, T5-Small) can reduce the risk of overfitting.
  - Consider data augmentation.
- **Medium**:
  - ~ 1M.
  - Examples:
    - General-purpose datasets like SST-2 (67k sentences for sentiment analysis).
    - Tasks like NER or extractive QA with moderate data availability.
  - Find domain or task-specific base model and data for fine-tuning.
  - Also consider evaluation on pre-trained objectives.
- **Large**:
  - 1M+.
  - Examples:
    - Massive datasets like C4 (T5 pre-training, ~750GB of text).
    - Language modeling datasets (e.g., OpenAI's WebText for GPT).
  - Consider distributed training across multiple GPUs or TPUs.

## Step 4: Pretraining Data.
- **General-Purpose Pretraining**: 
   - Models like **BERT**, **GPT-2**, or **DistilBERT** are trained on broad datasets and can be applied to most NLP tasks.
- **Domain-Specific Pretraining**: 
   - If your task requires domain-specific knowledge (e.g., medical, legal, or financial tasks), look for models like **BioBERT** (biomedical), **LegalBERT** (legal), or fine-tune a general model on your domain-specific data.
 
## Additional Checklist.

### Consider Inference Speed.
   - **Real-time applications** (chatbots, recommendation systems): 
     - Choose models optimized for fast inference like **DistilGPT-2**, **MobileBERT**, or **DistilBERT**.
   - **Moderate speed required** (general tasks with balanced performance): 
     - **BERT**, **RoBERTa**, or **GPT-2**.
   - **Slower models** (if not time-sensitive but need better accuracy): 
     - **T5**, **Pegasus**, or **GPT-3**.

### Multilingual Model (If Needed).
   - **Multilingual Tasks**: 
     - If your task involves multiple languages, choose models like **XLM-RoBerta**, **mBERT**, or **MarianMT**.
   - **Single Language Task**: 
     - Use monolingual models like **BERT** or **GPT-2** if you're focused on one language.


# 2. Type of Transformers.

## 2.1. Encoder-Only Models.
- **Overview**: These models utilize only the encoder stack of the transformer architecture. They focus on understanding and contextualizing input text.
- **Examples**: BERT, RoBERTa, DistilBERT.
- **Strengths**:
  - Ideal for tasks requiring deep understanding of text, such as text classification and named entity recognition (NER).
  - Trained using Masked Language Modeling (MLM), which captures bidirectional context effectively.
- **Limitations**:
  - Not suitable for tasks requiring text generation, like story writing or dialogue generation.
  - Produces static embeddings rather than generating novel text.


| **Model**            | **HF Shortcut**  | **Pretraining Objectives** | **Tokenizer**      | **Parameters** | **Max Seq. Length** | **Training Speed** | **Inference Speed** | **Pretraining Data** | **Fine-Tuning Requirements** | **Task-Specific Performance** | **Special Features**                | **Useful Tasks**                   |
|----------------------|------------------|----------------------------|--------------------|----------------|---------------------|--------------------|---------------------|---------------------|----------------------------|-----------------------------|-----------------------------------|-------------------------------------|
| BERT (2018)         | `bert-base-uncased` | MLM, NSP                   | WordPiece          | 110M           | 512                 | Moderate          | Moderate            | General             | Moderate                   | Strong                     | Bidirectional context           | Text classification, NER, QA       |
| RoBERTa (2019)      | `roberta-base`   | MLM                        | Byte-Pair          | 125M           | 512                 | Moderate          | Moderate            | Larger corpus       | Moderate                   | Strong                     | No NSP; improved over BERT      | Text classification, NER, QA       |
| DistilBERT (2019)   | `distilbert-base-uncased` | MLM                        | WordPiece          | 66M            | 512                 | Fast              | Fast                | General             | Low                        | Moderate                   | Smaller, faster version of BERT | Text classification, NER, QA       |
| ALBERT (2019)       | `albert-base-v2` | MLM, Sentence Ordering     | SentencePiece      | 12M            | 512                 | Fast              | Moderate            | General             | Low                        | Moderate                   | Parameter sharing, memory-efficient | Text classification, NER, QA       |
| XLNet (2019)        | `xlnet-base-cased` | Permuted LM                | SentencePiece      | 117M           | 512                 | Slow              | Moderate            | General             | High                       | Strong                     | Handles permuted context        | Text classification, QA            |
| Electra (2020)      | `electra-base-discriminator` | Replaced Token Detection   | WordPiece          | 110M           | 512                 | Moderate          | Moderate            | General             | Moderate                   | Strong                     | Efficient pretraining           | Text classification, QA            |
| XLM-RoBERTa (2019)  | `xlm-roberta-base` | MLM                        | SentencePiece      | 270M           | 512                 | Slow              | Moderate            | Multilingual        | High                       | Strong (multilingual)      | Multilingual BERT               | Multilingual tasks, QA, NER        |
| MobileBERT (2020)   | `mobilebert-uncased` | MLM, NSP                   | WordPiece          | 25M            | 512                 | Fast              | Fast                | General             | Low                        | Moderate                   | Optimized for mobile            | Text classification, NER, QA       |
| CamemBERT (2019)    | `camembert-base` | MLM                        | SentencePiece      | 110M           | 512                 | Moderate          | Moderate            | French              | Moderate                   | Strong (French)            | French variant of RoBERTa       | French text processing, NER, QA    |
| LayoutLM (2020)     | `layoutlm-base-uncased` | MLM                        | WordPiece          | 113M           | 512                 | Moderate          | Moderate            | Document Layout     | High                       | Specialized                | For document understanding      | Document analysis, form extraction |
| Longformer (2020)   | `longformer-base-4096` | MLM                        | WordPiece          | 149M           | 4096                | Slow              | Moderate            | General             | High                       | Strong                     | Efficient attention for long texts | Long document QA, summarization   |



## 2.3. Decoder-Only Models.
- **Overview**: These models employ only the decoder stack and are optimized for autoregressive text generation.
- **Examples**: GPT series (GPT-2, GPT-3, etc.), GPT-Neo.
- **Strengths**:
  - Excels in generative tasks, such as text completion, story generation, and conversational AI.
  - Predicts the next token in a sequence, making it highly effective for language modeling.
- **Limitations**:
  - Less effective for tasks requiring bidirectional understanding of text, such as NER or extractive QA.
  - Slower inference for longer sequences due to sequential token generation.

| **Model**            | **HF Shortcut**  | **Pretraining Objectives** | **Tokenizer**      | **Parameters** | **Max Seq. Length** | **Training Speed** | **Inference Speed** | **Pretraining Data** | **Fine-Tuning Requirements** | **Task-Specific Performance** | **Special Features**                | **Useful Tasks**                   |
|----------------------|------------------|----------------------------|--------------------|----------------|---------------------|--------------------|---------------------|---------------------|----------------------------|-----------------------------|-----------------------------------|-------------------------------------|
| GPT-2 (2019)        | `gpt2`           | Autoregressive             | Byte-Pair          | 117M           | 1024                | Moderate          | Slow                | General             | Moderate                   | Moderate                   | Text generation focus           | Text generation, summarization     |
| CTRL (2019)         | `ctrl`            | Autoregressive             | Byte-Pair          | 1.6B           | 512                 | Slow              | Slow                | General             | High                       | Strong                     | Controlled text generation      | Controlled text generation         |
| DialoGPT (2019)     | `microsoft/DialoGPT-medium` | Autoregressive             | Byte-Pair          | 345M           | 1024                | Moderate          | Moderate            | Conversational      | Moderate                   | Moderate                   | Specialized for dialogues       | Dialogue systems, text generation  |


## 2.4. Encoder-Decoder Models.
- **Overview**: These models use both the encoder and decoder stacks, offering a balanced approach for understanding and generation.
- **Examples**: T5, BART, mT5.
- **Strengths**:
  - Versatile, suitable for a wide range of tasks such as summarization, translation, and generative QA.
  - Combines strong text understanding (encoder) with powerful text generation (decoder).
- **Limitations**:
  - Higher computational cost due to the dual-stack architecture.
  - Requires careful fine-tuning to achieve optimal results for specific tasks.


| **Model**            | **HF Shortcut**  | **Pretraining Objectives** | **Tokenizer**      | **Parameters** | **Max Seq. Length** | **Training Speed** | **Inference Speed** | **Pretraining Data** | **Fine-Tuning Requirements** | **Task-Specific Performance** | **Special Features**                | **Useful Tasks**                   |
|----------------------|------------------|----------------------------|--------------------|----------------|---------------------|--------------------|---------------------|---------------------|----------------------------|-----------------------------|-----------------------------------|-------------------------------------|
| T5 (2019)           | `t5-base`        | Text-to-Text               | SentencePiece      | 220M           | 512                 | Moderate          | Moderate            | General             | Moderate                   | Strong                     | Unified text-to-text tasks      | Text-to-text tasks, summarization  |
| BART (2020)         | `facebook/bart-base` | Denoising Objectives       | Byte-Pair          | 140M           | 1024                | Moderate          | Moderate            | General             | Moderate                   | Strong                     | Flexible for text generation    | Summarization, QA                  |
| MarianMT (2020)     | `Helsinki-NLP/opus-mt-en-ro` | Translation Task           | SentencePiece      | 61M            | 512                 | Moderate          | Moderate            | Multilingual        | Moderate                   | Moderate                   | Specialized for translation     | Translation                        |
| Pegasus (2020)      | `google/pegasus-large` | Gap Sentence Generation    | SentencePiece      | 568M           | 512                 | Slow              | Moderate            | Summarization       | High                       | Strong                     | Specialized for summarization   | Summarization                      |


# 3. Criteria.

## 3.1. **Type.**  
  - **Encoder-Only.**
    - Processes input sequences to extract meaningful representations. 
    - Understanding tasks : like text classification, NER, and text similarity.  
  - **Decoder-Only.**
    - Predicts tokens autoregressively based on the previous context.  
    - Generative tasks : text generation, summarization, and dialogue systems.
  - **Encoder-Decoder.**
    - Encodes input sequences and generates outputs.   
    - Versatile for both, but specialized for input-output mapping : such as translation, summarization, and generative QA.
    - As both parts are required, it can be slower and more resource-intensive, especially for simpler tasks like text classification or NER.

## 3.2. **Pretraining Objectives.**  
  - **Masked Language Modeling (MLM).**  
    - Helps understand bidirectional context by predicting randomly masked tokens.  
    - Commonly used in models like BERT and RoBERTa.  
  - **Next Sentence Prediction (NSP).**  
    - Trains the model to understand relationships between sentences.  
    - Used in BERT but removed in RoBERTa for optimization.  
  - **Autoregressive Language Modeling.**  
    - Predicts the next token based on previous tokens, ideal for generation tasks.  
    - Used in decoder-only models like GPT.  
  - **Text-to-Text.**  
    - Converts all tasks into a unified text-to-text format, simplifying multi-task learning.  
    - Central to models like T5.  
  - **Sentence Ordering.**  
    - Aims to predict the correct order of shuffled sentences.  
    - Used in ALBERT for improved coherence understanding.  
  - **Replaced Token Detection.**  
    - Identifies tokens replaced by a generator network instead of standard masking.  
    - Makes training more efficient in Electra.  
  - **Gap Sentence Generation (GSG).**  
    - Requires the model to generate text to fill gaps in input.  
    - Emphasizes abstractive tasks and is used in Pegasus.  
  - **Permutation Language Modeling (PLM).**  
    - Predicts tokens from permuted input sequences, enabling bidirectional dependency.  
    - Used in XLNet for improved autoregressive and bidirectional learning.  
  - **Denoising Objectives.**  
    - Reconstructs corrupted inputs by applying noise transformations like deletion or masking.  
    - Central to BART for robust generative performance.  

## 3.3. **Tokenizer.**  
  - **WordPiece.**  
    - Breaks text into subwords, balancing vocabulary size and granularity.  
    - Useful Tasks : Text classification, NER, QA.  
    - Typical Models : BERT, DistilBERT.  
  - **Byte-Pair Encoding (BPE).**  
    - Merges frequent character pairs into subwords, commonly used in generative models.  
    - Useful Tasks : Text generation, summarization.  
    - Typical Models : GPT, RoBERTa.  
  - **SentencePiece.**  
    - Uses unsupervised learning to tokenize text, supporting multilingual applications.  
    - Useful Tasks : Translation, summarization.  
    - Typical Models : T5, XLM-RoBERTa.  

## 3.4. **Parameters.**
  - **Small.**
    - Less than 50M parameters; faster and more resource-efficient.
    - Useful Tasks: Simple classification, small-scale tasks.
    - Typical Models: DistilBERT, MobileBERT.
  - **Medium.**
    - 50â€“200M parameters; balances performance and efficiency.
    - Useful Tasks: General-purpose tasks, moderate complexity.
    - Typical Models: BERT, RoBERTa, MarianMT.
  - **Large.**
    - More than 200M parameters; captures complex patterns but requires more resources.
    - Useful Tasks: Complex tasks, high accuracy demands.
    - Typical Models: GPT-2, T5, Pegasus.

## 3.5. **Max Sequence Length.**
  - **Short.**
    - Less than 512 tokens; sufficient for most tasks.
    - Useful Tasks: Text classification, sentiment analysis.
  - **Medium.**
    - Up to 512 tokens; common for models like BERT.
    - Useful Tasks: Named Entity Recognition (NER), QA.
  - **Long.**
    - Over 512 tokens; necessary for long documents or contexts (e.g., Longformer).
    - Useful Tasks: Long document summarization, document analysis.

## 3.6. **Training Speed.**
  - **Fast.**
    - Models like DistilBERT or MobileBERT, optimized for quick fine-tuning.
    - Useful Tasks: Low-resource tasks, rapid deployment.
  - **Moderate.**
    - Standard models like BERT or RoBERTa.
    - Useful Tasks: General-purpose tasks with a balance of efficiency.
  - **Slow.**
    - Larger models like RoBERTa-large or Pegasus, requiring more time to fine-tune.
    - Useful Tasks: Complex tasks with significant resources.

## 3.7. **Inference Speed.**
  - **Real-Time Optimized.**
    - Models like MobileBERT, suitable for low-latency applications.
    - Useful Tasks: Real-time classification, dialogue systems.
  - **General-Purpose.**
    - Models like BERT or RoBERTa, offering a balance of speed and accuracy.
    - Useful Tasks: General-purpose inference tasks.
  - **Slow.**
    - Generative models like GPT-2 or Pegasus, which may not suit real-time tasks.
    - Useful Tasks: Creative text generation, summarization.

## 3.8. **Pretraining Data.**
  - **General-Purpose.**
    - Trained on datasets like Wikipedia and BooksCorpus.
    - Useful Tasks: Text classification, QA.
  - **Multilingual.**
    - Trained on diverse languages, enabling cross-lingual tasks (e.g., XLM-RoBERTa).
    - Useful Tasks: Multilingual text processing, translation.
  - **Domain-Specific.**
    - Tailored for specific fields like biomedical or legal text.
    - Useful Tasks: Specialized tasks in specific domains (e.g., medical text analysis).

## 3.9. **Fine-Tuning Requirements.**
  - **Low.**
    - Small models for simple tasks (e.g., DistilBERT).
    - Useful Tasks: Simple text classification, sentiment analysis.
  - **Moderate.**
    - Mid-size models for general-purpose tasks (e.g., BERT).
    - Useful Tasks: General-purpose tasks, moderate complexity.
  - **High.**
    - Large models for complex tasks or when fine-tuning requires significant data.
    - Useful Tasks: Complex language generation, translation, summarization.

## 3.10. **Task-Specific Performance.**
  - **Text Classification.**
    - Tasks like IMDb sentiment analysis, where BERT excels.
  - **Text Generation.**
    - Tasks like dialogue systems or story generation, suited for GPT-like models.
  - **Summarization.**
    - Best handled by models like BART or Pegasus.
  - **Translation.**
    - Specialized models like MarianMT or encoder-decoder architectures like T5.


# 4. Examples.

## Example 1: Sentiment Analysis.
- **Task**: Classify customer reviews as positive, negative, or neutral.
- **Dataset Size**: 50,000 labeled examples.
- **Resources**: Single GPU with 8GB VRAM.
- **Deployment Needs**: Moderate inference speed.
- **Model Choice**: **DistilBERT**, **ALBERT**, or **MobileBERT**.
  - Encoder-only models are ideal for classification tasks.
  - Lightweight models are chosen due to limited resources and small dataset size.
  - **Recommendation**: 
    - Use **DistilBERT** for simplicity.
    - Use **ALBERT** for better memory efficiency.
    - Use **MobileBERT** for faster inference.

---

## Example 2: Text Summarization.
- **Task**: Summarize news articles into concise summaries.
- **Dataset Size**: 1 million articles with summaries.
- **Resources**: GPU cluster for training but limited resources for deployment.
- **Deployment Needs**: High-quality summaries; inference speed is secondary.
- **Model Choice**: **T5** or **Pegasus**.
  - Encoder-decoder models are suitable for summarization.
  - **Recommendation**: 
    - Use **Pegasus** for specialized summarization tasks.
    - Use **T5** for broader flexibility across tasks.

---

## Example 3: Multilingual Machine Translation.
- **Task**: Translate texts between English, French, and German.
- **Dataset Size**: 10 million parallel sentences.
- **Resources**: Access to TPUs for training.
- **Deployment Needs**: Fast inference for real-time translation.
- **Model Choice**: **MarianMT**.
  - Designed explicitly for translation with strong multilingual capabilities.
  - Optimized for fast inference, meeting real-time deployment needs.

---

## Example 4: RTX 3070.

### Fine-Tuning (Training Mode):
- **With 32-bit precision**:
  - Supports ~300M to 400M parameters (e.g., BERT-Base, RoBERTa-Base).
- **With 16-bit mixed precision**:
  - Supports ~600M to 700M parameters (e.g., T5-Base, GPT-2 Medium).
- **Batch size and sequence length impact**:
  - Larger batch sizes or longer sequences will reduce the maximum parameter count.
  - Smaller batch sizes or shorter sequences allow for slightly larger models.
- **BERT-Base** (~110M parameters):
  - Easily fits with moderate batch sizes (16â€“32) and sequence lengths (up to 512 tokens).
- **GPT-2 Medium** (~345M parameters):
  - Fits with mixed precision and smaller batch sizes (~8â€“16).

### Inference Mode:
- Supports ~1B parameters (e.g., GPT-2 XL) for single-sequence inference.
- Inference requires less memory since there are no gradients or optimizer states.
- **T5-Large** (~770M parameters):
  - Possible for inference but may require reducing batch size or sequence length.
- **GPT-2 XL** (~1.5B parameters):
  - Might work for inference with single sequences using optimizations like offloading.