### Introduction to Transformers

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

#### Question1: Understanding Pipelines

1. What is a pipeline in Hugging Face Transformers? What does it abstract away from the user?

A Hugging Face Transformers pipeline is a robust, modular, and high-level API that consolidates all steps required for machine learning inference *(covering natural language, vision, or audio)* inside a single callable object for user simplicity and efficiency.

Enhanced Details on Pipeline Architecture

- **Compositional Structure:** Each pipeline manages the lifecycle and orchestration of a model, its configuration, tokenizer or processor, device assignment (CPU/GPU), and any feature extractors or postprocessing layers needed for the target task.​

- Three-Step Execution:
    - **Preprocessing:** Raw inputs are automatically converted to model-compatible representations *(tokenization for text, feature extraction for images/audio, and possible chunking for long inputs)*.

    - **Inference:** The processed inputs are moved to the correct device, efficiently batched if configured, and passed through the model *(including using quantized weights or mixed-precision where specified)*.

    - **Postprocessing:** Raw model outputs *(e.g., logits, hidden states)* are decoded or mapped into user-friendly formats, such as classification labels, generated text, or extracted features, depending on the pipeline class.

- **Advanced Configuration:** Pipelines support explicit batch size tuning, model quantization, device mapping across multi-GPU setups, custom model cards, chunked data processing, and advanced error recovery for out-of-memory issues.

By wrapping input handling, device logic, batching, and output formatting, pipelines let users avoid the manual drudgery of model loading, weight management, tokenization, tensor placement, and inference code *(making experimentation, deployment, and switching domains virtually seamless)*.

2. Visit the [Pipeline Tutorial — Hugging Face](https://huggingface.co/docs/transformers/main/en/pipeline_tutorial) and list at least 3 other tasks (besides text-classification) that are available.

Beyond text-classification, the Hugging Face Transformers pipeline API supports a wide range of tasks across different domains, as seen in the official documentation.

Example Tasks Available in Pipeline

- **Summarization:** Produces concise summaries from longer passages or documents, commonly used for abstracts, meetings, or articles.

- **Automatic Speech Recognition (ASR):** Converts spoken audio files into written text, supporting return of word-level timestamps for transcripts.

- **Image Classification:** Assigns labels or classes to images, enabling automated recognition of objects, scenes, or other categories in visual data.

These tasks demonstrate the versatility of the pipeline API, which also extends to modalities like vision and audio for even broader applications.

3. What happens when we do not specify a model in the pipeline? How can we specify a specific model?

If no model is specified in the pipeline, Hugging Face Transformers will automatically load a default pretrained model that is recommended for the given task. This selection is based on models that are generally well-suited or commonly used for that task *(for example, using a small, fast text-classification model for text-classification tasks)*.

To specify a specific model, use the model argument in the pipeline call. We can provide the model’s name (as listed on the Hugging Face Hub) or a path to the model. For example:

In [3]:
import torch
from transformers import pipeline

classifier = pipeline(task="text-classification", 
                      model="distilbert-base-uncased-finetuned-sst-2-english")

Device set to use cpu


#### Question 2: Text Classification Deep Dive

1. What is the default model used for text-classification? Search for it on the [Hugging Face Model Hub](https://huggingface.co/models). What dataset was this model fine-tuned on? What kind of text does it work best with?

The default model used for text-classification in the Hugging Face Transformers pipeline is `distilbert-base-uncased-finetuned-sst-2-english`.

This model was fine-tuned on the **Stanford Sentiment Treebank version 2 (SST-2) dataset**. SST-2 is a well-known benchmark dataset in NLP for sentence-level binary sentiment analysis.

- Contains movie review sentences, each annotated as either “positive” or “negative” sentiment.

- It is the binary version of the original Stanford Sentiment Treebank, which included more nuanced sentiment classes.

Best Suited Text Types

- Short to moderately long sentences or snippets in English.

- Examples include user reviews, social media posts, single-sentence feedback, or other text fragments expressing sentiment, especially similar to those found in movie reviews.

The model is optimized for general sentiment analysis in discrete, clear English sentences where emotion or polarity is evident.

2. The output includes a `score` field. What does this score represent? What range of values can it have?

The `score` field in the Hugging Face text-classification pipeline output represents the model’s estimated probability *(confidence)* that the predicted label is correct for the given input.

- The score is the output of a **softmax function** applied to the model’s raw predictions *(logits)* for each label.

- The **score** is always a floating-point number $p \in [0,1]$.

In [4]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.901546


3. Find a different text-classification model on the Hugging Face Model Hub that classifies emotions. What is its name?

One example of a text-classification model on the Hugging Face Model Hub that classifies emotions is`j-hartmann/emotion-english-distilroberta-base`. This model is fine-tuned to predict Ekman's six basic emotions *(anger, disgust, fear, joy, neutral, sadness, surprise)* from English text.

#### Question3: Named Entity Recognition (NER)

1. What does the `aggregation_strategy="simple"` parameter do in the NER pipeline? Check the [Token Classification Documentaition](https://huggingface.co/docs/transformers/main/en/tasks/token_classification).

The parameter `aggregation_strategy="simple"` in the Hugging Face NER *(Named Entity Recognition)* pipeline controls how adjacent tokens labeled as part of an entity are grouped together. With `"simple"`, the pipeline merges contiguous tokens that share the same NER tag (such as all "B-ORG" and "I-ORG"), reconstructing them into full word or phrase entities.

- It groups tokens with the same predicted entity label into contiguous entity spans. This is especially important when a word is split into subwords by the tokenizer, ensuring that subword tokens are combined into full, meaningful entities rather than left as individual pieces.

- Use `aggregation_strategy="simple"` when you want each entity in your results to represent a complete detected word or phrase, not just sub-parts or single tokens.

2. What do the entity types mean?

Here are the meanings of the most common NER entity types:
- **ORG for Organization:** Companies, institutions, government agencies, teams, and other organized groups.

- **PER for Person:** Names of people, including first and last names as well as titles.

- **LOC for Location:** Geographical locations such as countries, cities, mountains or rivers.

- **MISC for Moscellaneous:** Entities that do not fall into the previous categories; this can include events, works of art, nationalities, products, or any other specific “named” things.

In [5]:
ner_tagger = pipeline(task="ner",
                      model="dbmdz/bert-large-cased-finetuned-conll03-english",
                      aggregation_strategy="simple")

outputs = ner_tagger(text)
pd.DataFrame(outputs)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.879011,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.55657,Mega,208,212
4,PER,0.590255,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.49835,##icons,259,264
7,MISC,0.775362,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


3. Why do some words appear with `##` prefix (like `##tron` and `##icons`)? What does this indicate about tokenization?

The prefix `##` prefix in tokens indicates that the tokenizer is using **subword tokenization**, specifically the WordPiece algorithm *(used by BERT and compatible models)*.

- Tokens starting with `##` are not whole words but are **subword units** or **suffixes**.

- This prefix means the token should be attached to the previous token without a space, representing a continuation of a word.

- It allows the model to handle **uncommon** or **unknown words**, splitting them into known pieces. It enables robust processing of rare, compound, or morphologically complex words with a limited vocabulary.

4. The model seems to have split "Megatron" and "Decepticons" incorrectly. Why might this happen? What does this tell you about the model's training data?

When the model splits words like "Megatron" and "Decepticons" into subword tokens *(`Mega`, `##tron`, `Decept`, `##icons`)*, it means these **whole words were not present in the tokenizer's vocabulary** because they did not appear with sufficient frequency in the model's trainind data.

- **Absence in Training Data:** Both words are proper nouns *(from Transformers fiction)* and likely rare or missing in large, general-domain corpora like the Stanford Sentiment Treebank, which are commonly used for pretraining.

- **Tokenizer Vocabulary Limits:** The tokenizer can only include a finite vocabulary *(usually 30,000–50,000 subword units)*. It prioritizes frequent subwords, so rare full words *(especially specialized names)* must be composed from subpieces.

The model was **primarily trained on general text** and not specifically on pop culture, fiction, or specialized domains. Words not seen often during training *(like "Megatron" or "Decepticons")* will be split into known subwords, indicating their rareness in the original dataset.

5. Find the model card for **dbmdz/bert-large-cased-finetuned-conll03-english**. What is the CoNLL-2003 dataset?

The model card for **dbmdz/bert-large-cased-finetuned-conll03-english** describes it as a large, case-sensitive BERT model specifically fine-tuned for NER using the CoNLL-2003 dataset.

The **CoNLL-2003 dataset** is a widely-used benchmark for training and evaluating NER models. It consists of thousands of sentences drawn from news articles *(Reuters RCV1)* and is annotated for four types of named entities: PER, ORG, LOC and MISC. The dataset’s format is standardized for NER research and includes part-of-speech, chunking, and NER tags for each token.

#### Question 4: Question Answering Systems

1. What type of question answering is this? Check the [Question Answering Documentation](https://huggingface.co/docs/transformers/main/en/tasks/question_answering).

In [6]:
reader = pipeline(task="question-answering",
                  model="distilbert-base-cased-distilled-squad")

question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

Device set to use cpu


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


This pipeline is performing **extractive question answering**.

- **Extractive QA** means the model selects and returns the exact span of text from the provided context that best answer the question, rather than generating new text in its own words.

- The answer will always be a substring of the `context` input.

2. The model outputs `start` and `end` indices. What do these represent? Why are they important?

The `start` and `end` indices in the output of an extractive question answering model indicate the positions *(character offsets)* in the context text where the model’s answer begins and ends.

- **Start index:** The position of the first character of the predicted answer span within the context string.

- **End index:** The position after the last character (exclusive) of the answer span within the context.

For example, if `start=335` and `end=358`, then the answer corresponds to `context[335:358]`, which would extract “an exchange of Megatron” from the text.

Role of Start–End Indices in Extractive QA

- They allow you to directly **extract the exact substring** from the context that answers the question.

- They make post-processing and highlighting of the answer easy and reliable, especially in large texts.

- These indices ensure the answer is traceable and explainable *(crucial for downstream applications and validation)*.

3. What is the SQuAD dataset?

The **SQuAD** *(Stanford Question Answering Dataset)* is a large-scale reading comprehension dataset developed by Stanford University. It consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a continuous segment of text *(a span)* from the corresponding passage or sometimes no answer if the question is unanswerable.

- It includes over 100,000 question-answer pairs on 500+ Wikipedia articles.

- It is used primarily to train and benchmark extractive question answering models, where the model must find the exact span of text answering the question within a given passage.

- SQuAD 1.1 contains answerable questions only, while SQuAD 2.0 extends it with unanswerable ones to challenge models to detect when no answer is possible.

The model `distilbert-base-cased-distilled-squad` is fine-tuned on SQuAD for extractive question answering, meaning it learns to predict the start and end positions of the answer spans within the provided context.

4. Try to think of a question this model CANNOT answer based on the text. Why would it fail?

A question this model cannot answer could be for example: **What is Bumblebee’s email address?** or **What is Bumblebee’s order number?**

This question cannot be answered because extractive QA models can only select answers that are explicit spans of text in the given context, and none of these details appear anywhere in the email; they are simply not stated.

5. What's the difference between extractive and generative question answering? Find an example of a generative QA model on the Hugging Face Model Hub.

Extractive question answering and generative question answering differ in how they produce answers from a context.

- **Extractive QA:** The model **selects a span of text directly from the provided context** as the answer. It outputs start and end indices pointing into the context, and the answer is always a substring of that context *(no new wording)*.

- **Generative QA:** The model **generates an answer in natural language**, possibly paraphrasing or synthesizing information, rather than copying an exact span. Often implemented with encoder-decoder or decoder-only LLMs, which can also handle open-ended questions.

n example of a generative QA model is `consciousAI/question-answering-generative-t5-v1-base-s-q-c`, a T5-based model fine-tuned to generate answer text given a question and context.

#### Question 5: Text Summarization

1. What is the difference between extractive and abstractive summarization? Check the [Summarization Documentation](https://huggingface.co/docs/transformers/main/en/tasks/summarization).

**Extractive summarization** selects and assembles sentences or phrases directly from the source text to create the summary, it copies the most relevant pieces without generating new sentences.

**Abstractive summarization** uses generative AI to paraphrase, compress, and synthesize the source content, producing new sentences that may not appear in the original. This allows more concise, human-like, and coherent summaries, but may sometimes introduce small inaccuracies.

2. What is the default model used for summarization? Search for it on the Hugging Face Model Hub and determine:

    - Is it an extractive or abstractive model?
    - What architecture does it use? 
    - What dataset was it trained on?

In [7]:
summarizer = pipeline(task="summarization",
                      model="sshleifer/distilbart-cnn-12-6")

outputs = summarizer(text, max_length=60, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

Device set to use cpu


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead. As a lifelong enemy of the Decepticons, I hope you can understand


The default model used for summarization on Hugging Face pipeline is `sshleifer/distilbart-cnn-12-6`, it is designed for abstractive summarization, meaining it writes new sentences in its own words rather than copying passages verbatim from the input.

This model is built using the **DistilBART architecture**, which is a distilled, smaller, and faster version of BART *(a well-known transformer-based sequence-to-sequence model)*. The architecture consists of 12 encoder layers and 6 decoder layers, which allows it to efficiently process input text, encode its meaning, and then generate a summary.

For training, the model was fine-tuned on the **CNN/DailyMail dataset**. This dataset contains over 300,000 news articles paired with human-written summaries, serving as a benchmark for English-language text summarization models. By learning from this corpus, DistilBART is able to generalize to new inputs and produce high-quality abstractive summaries for journalistic and informational text.

3. What do the `max_length` and `min_length` parameters control? What happens if `min_length` > `max_length`?

The `min_length` and `max_length`parameters in the Hugging Face summarization pipeline let you control the possible size of the generated summary in terms of tokens.

- `max_length` determines the **maximum number of tokens** allowed in the generated summary, so the model will never produce a summary longer than this limit.

- `min_length`specifies the **minimum number of tokens** that the summary must contain, ensuring that the model does not output something too short to be meaningful.

*Remark:* Both parameters are measured in tokens, which include words and punctuation marks.

If we set `min_length` to a value greater than `max_length`, the model will not be able to generate any sequence that satisfies both constraints. In practice, this will trigger an error *(usually a ValueError)* or a warning message *(as "Your min_length=56 must be inferior than your max_length=45.")* because it is logically impossible to have a summary with a length at the same time greater than `min_length` and less than or equal to `max_length`. So, we should always ensure that `min_length` is less than or equal to `max_length` in your summarization settings.

4. The parameter `clean_up_tokenization_spaces=True` is used. What does this parameter do? Why might it be useful for summarization?

The parameter `clean_up_tokenization_spaces=True` in Hugging Face’s summarization pipeline controls whether extra spaces, artifacts, or formatting issues *(introduced during tokenization and decoding)* are removed from the output summary.

Effects of Enabling clean_up_tokenization_spaces

- Eliminate unnecessary spaces that may appear near punctuation marks *(like before commas, periods, or hyphens)* due to the way tokens are split and rejoined.

- Fix formatting problems such as duplicated spaces and unnatural gaps so the output looks more like human-written text.

- Ensure contractions, dashes, and joined words are displayed as expected (e.g., “state - of - the - art” becomes “state-of-the-art”)

Improving Summary Formatting

- It enhance the readability and fluency of generated summaries, making them cleaner and better suited for downstream applications or human presentation.

- It reduces the likelihood of distracting tokenization artifacts in your final output, allowing the summary to look more professional and natural.

In summary, `clean_up_tokenization_spaces=True`is valuable because it automates post-processing to ensure summarized text is well-formatted, clear, and free of tokenization errors.

5. Find two different summarization models on the Hub and compare their architectures:
- One optimized for short texts (like news articles)
- One that can handle longer documents

Model for Short Texts: `facebook/bart-large-cnn`

This model is built with the BART *(Bidirectional and Auto-Regressive Transformer)* architecture, which uses an encoder-decoder setup to generate abstractive summaries. It is specifically fine-tuned on the CNN/DailyMail news dataset, where summaries typically condense a few paragraphs of news articles into concise, human-written headlines or bullet points. The training data shapes the model to cover short and medium-length news content, producing fluent and clear summaries intended for quick news digestion.

Model for Long Documents: `allenai/led-base-16384`

For lengthy documents, such as research papers or multipage reports, the model `allenai/led-base-16384` *(Longformer Encoder-Decoder, LED)* is better suited. This architecture modifies the standard transformer attention mechanism and allows processing of up to 16,384 tokens, making it capable of ingesting much larger text inputs without slicing them into small chunks. LED models are often trained on scientific documents or datasets designed for summarizing long content, such as the PubMed or arXiv document collections, so they learn to distill complex, lengthy material into coherent summaries that cover key ideas across several pages.

Architecture and Training Data Comparison

| Model Name                | Architecture            | Optimized For      | Training Data          |
|---------------------------|--------------------------|---------------------|-------------------------|
| facebook/bart-large-cnn   | BART encoder-decoder     | News, short texts   | CNN/DailyMail           |
| allenai/led-base-16384    | Longformer Encoder-Decoder | Long documents    | PubMed/arXiv, scientific |

- BART focuses on handling shorter context windows and generates high-quality, abstractive summaries when the document is a few hundred words long.

- LED incorporates sparse attention and memory optimizations to accept thousands of tokens at once, making it well suited for multi-page summaries and summarizing books or research articles.

#### Question 6: Machine Translation

1. What is the architecture behind the Helsinki-NLP/opus-mt-en-de model? Look it up on the [Hugging Face Model Hub](https://huggingface.co/Helsinki-NLP/opus-mt-en-de).

- What does "OPUS" stand for?
- What does "MT" stand for?

The `Helsinki-NLP/opus-mt-en-de` model relies on the MarianMT *(Marian Machine Translation)* architecture, a neural machine translation framework based on the transformer model and designed for efficient multi-language translation tasks.​

**OPUS** stands for the **Open Parallel Corpus**, which is a collection of open-access, multilingual parallel texts used for training machine translation models. The OPUS project, managed at the University of Helsinki, gathers large corpora from sources like movie subtitles, Wikipedia, technical documentation, and many other domains, supporting wide language coverage and multilingual NLP research.

**MT** stands for **Machine Translation**. In this context, "Opus-MT" denotes a series of NMT *(Neural Machine Translation)* models trained using the Marian framework and OPUS data. These models are built to provide automated translation between a huge variety of language pairs using high-quality, openly available parallel corpora.

In [8]:
translator = pipeline("translation_en_to_de", 
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Device set to use cpu


Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.


2. How would you find a model to translate from English to French? Visit the [translation documentation](https://huggingface.co/docs/transformers/main/en/tasks/translation) and the Hugging Face Model Hub to find at least 2 different models.

For English-to-French translation on Hugging Face *(without using a Helsinki-NLP model)*, consider these two alternatives:

The `AventIQ-AI/MarianMT-Text-Translation-AI-Model-en-fr model` is a general-purpose MarianMT model fine-tuned on English–French sentence pairs. Its architecture is a **sequence-to-sequence transformer** derived from the Marian project. This model is particularly suited for translating documents, emails, and educational content. It performs well with standard language but is not optimized for informal slang or highly specialized vocabulary. The training data comes from a carefully cleaned parallel **English-French corpus** that is not related to OPUS-MT.

Another option is `facebook/nllb-200-distilled-600M`, built on the **NLLB *(No Language Left Behind)* architecture**. This is a large, multilingual transformer model supporting more than 200 languages, and it uses the **Mixture of Experts strategy** for efficient translation even between non-English pairs, including English and French. NLLB is trained over a diverse set of web-mined multilingual corpora and relies on BCP-47 tags like `eng_Latn` *(English)* and `fra_Latn` *(French)* for specifying sources and targets, making it a robust choice for varied, real-world translation tasks on the Hub.

Architecture and Training Data Comparison

| Model Name                | Architecture            | Training Data        | Notable Features         |
|---------------------------|--------------------------|---------------------|-------------------------|
| AventIQ-AI/MarianMT-Text-Translation-AI-Model-en-fr   | MarianMT  | Cleaned English-French pairs | Accurate for general, formal texts |
| facebook/nllb-200-distilled-600M   | NLLB | Multilingual web/mined data   | 200+ languages, robust on varied text |

3. What is the difference between bilingual and multilingual translation models? What are the advantages and the disadvantages of each?

A **bilingual translation model** is trained to translate between one specific language pair *(for example, English ↔ French)*, whereas a **multilingual translation model** is trained to handle many language pairs within a single model *(for example, English ↔ 200 other languages)*.

Bilingual models

**Advantages:**
- Often higher quality for that specific pair, especially on high‑resource pairs, because all parameters are specialized on one direction or pair.

- Simpler behavior and debugging: errors and domain shifts are easier to attribute and analyze.

- Less risk of negative transfer between unrelated languages.

**Disadvantages:**
- Poor scalability: you need a separate model for each pair, which becomes unmanageable when you support many languages.

- Limited cross‑lingual transfer: low‑resource languages cannot benefit from data in other languages as easily.

- Operational overhead: more models to store, update, and deploy.

Multilingual models

**Advantages:**
- Parameter sharing across languages dramatically improves efficiency: one model can cover dozens or hundreds of language pairs.

- Better support for low‑resource languages, which can benefit from transfer learning from high‑resource languages.

- Easier deployment and maintenance: a single model to serve many directions.

**Disadvantages:**
- Capacity must be shared across all languages; some pairs can underperform relative to strong bilingual baselines, especially if model size is limited.

- Potential for interference or negative transfer between very different languages or scripts.

- Training and data balancing are more complex (must avoid overfitting to high‑resource languages while still serving them well).

In practice, bilingual models are often preferred when you need the best possible quality on a small set of language pairs, while multilingual models are preferred when coverage *(many languages)* and operational simplicity are more important.

4. In the code, we specify the task as `"translation_en_to_de"`. How does this relate to the model we are loading?

The argument `"translation_en_to_de`" tells the pipeline to build a **translation pipeline form English to German**. Internally, this selects the “translation” pipeline class and configures it with the expectation that inputs are English sentences and outputs should be German sentences. The model name `Helsinki-NLP/opus-mt-en-de` refers to a specific MarianMT-based model that has been trained precisely for **English → German** translation. Its identifier encodes this direction: `en-de` indicates English as the source language and German as the target language.

Because the task and the model’s training direction are consistent, the pipeline can safely assume that:
- The tokenizer and encoder are configured to process English input text.

- Any default pre- and post-processing *(such as language tags, special tokens, or text normalization)* is appropriate for this language pair.

If the task and model direction did not match *(for example, if you used `"translation_en_to_de"` with a model trained for German → English)*, the pipeline might still run but would produce incorrect or nonsensical results, since the model’s parameters are specialized for the opposite direction. Thus, the task string is a high-level declaration of the intended language direction, while the model name specifies a concrete parameterization that must be compatible with that declaration.

5. The output shows a warning about `sacremoses`. What is this library used for in NLP? Check the [MarianMT documentation](https://huggingface.co/docs/transformers/model_doc/marian).

The warning about `sacremoses` refers to a small preprocessing library used to replicate classic **Moses-style text normalization and tokenization** in Python.

In NLP, SacreMoses provides implementations of operations such as lowercasing with language-specific rules, punctuation normalization, de-tokenization, and other text cleaning steps that were standard in the Moses statistical machine translation pipeline. For MarianMT models *(like the Helsinki-NLP OPUS-MT family)*, these steps are used to make the input and output text consistent with how the data looked during training, which helps maintain translation quality and reproducibility.

6. Find a multilingual model *(like mBART or M2M100)* that can translate between multiple language pairs. How many language pairs does it support?

A good example is `facebook/m2m100_418M`, a many‑to‑many multilingual translation model that is available on the Hugging Face Model Hub. It supports **100 languages**, and because it can  translate between any pair among them, it covers **9,900 distinct translation directions *(language pairs with directionality)***.

#### Question 7: Text Generation

1. What is the default model used for text generation in the code below? Look it up on the Hub and answer:
- What architecture does GPT-2 use?
- How many parameters does the base GPT-2 model have?
- What type of generation does it perform?

The default model here is `openai-community/gpt2`, the base GPT-2 model on Hugging Face Hub.

Model architecture

GPT‑2 uses a **decoder‑only transformer** architecture, also called a causal *(unidirectional)* transformer language model. It consists of stacked self‑attention and feed‑forward blocks that only attend to past tokens, not future ones.

Number of parameters

The base GPT‑2 checkpoint *(the one simply named `gpt2`)* has about **124 million parameters** *(often referred to as “117M” in the original release, but reported as 124M in the Hugging Face card)*.

Type of generation

GPT‑2 performs **autoregressive generation**. At training time, it is optimized to predict the next token given all previous tokens; at inference time, it generates text one token at a time, feeding each newly generated token back into the model to sample the next.

In [17]:
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

generator = pipeline("text-generation", 
                     model="openai-community/gpt2",
                     pad_token_id=50256)

response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response

outputs = generator(prompt, max_new_tokens=200) #max_length=200
print(outputs[0]['generated_text'])

Device set to use cpu


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. I would like to know if you know more about our service. Please let me know if we can arrange an exchange of Megatron for you.

The following quote from my customer service representative is from my review of the Optimus Prime action figure:

"Hi. I was a bit stunned when I saw the Optimus Prime action figure from your online store. I was hoping you could make me happy, but I was not able to

2. Why do we use set_seed(42) before generation? What would happen without it? Check the [generation documentation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation).

The `set_seed(42)` is used to make the text generation process reproducible. It initializes the random number generators *(Python, NumPy, PyTorch, and the Transformers library)* with a fixed seed so that all stochastic operations during generation *(sampling next tokens, dropout if any, etc.)* follow the same random sequence each time the code is run.

In text generation, methods like sampling *(`do_sample=True`, `top_k`, `top_p`, `temperature`, etc.)* rely on randomness to choose among candidate next tokens. With a fixed seed, the same prompt and the same generation parameters will produce identical output sequences across runs, which is crucial for debugging, experimentation, and comparing decoding settings in a controlled way. Without calling `set_seed(42)`, each execution would typically yield different continuations of the prompt, even if the code and model are unchanged, because the underlying random draws would vary from run to run.

3. The code uses `max_length=200`. What other parameters can control text generation? Research and explain:
- `temperature`
- `top_k`
- `do_sample`

The `max_length` controls how long the generated sequence can be, but several other parameters shape how the model chooses tokens. Three key ones are `temperature`, `top_k`, and `do_sample`.

Temperature

`temperature` rescales the logits before sampling, directly affecting how "peaked" or "flat" the probability distribution over next tokens is.

- When `temperature < 1` *(for example 0.2–0.7)*, the distribution is sharpened. High‑probability tokens become even more likely and low‑probability tokens are suppressed. Outputs become **more deterministic** and conservative.

- When `temperature > 1` *(for example 1.1–1.5)*, the distribution is flattened. The model more often picks less likely tokens, increasing diversity and **creativity** but also the **risk of incoherent outputs**.

- When `temperature = 1.0`, logits are unchanged: behavior is governed by other sampling parameters.

Formally, if $p_i$ is the softmax probability for token $i$, using temperature $T$ replaces logits $z_i$ with $z_i/T$ before softmax, which changes the entropy of the distribution.

Top k

`top_k` implements **top‑k sampling**, which restricts sampling at each step to only the k most probable tokens.

- With `top_k = 0` or `None`, no top‑k filtering is applied: all tokens are considered *(subject to other constraints)*.

- With top_k = 50 *(the common default)*, only the 50 highest‑probability tokens are kept: their probabilities are renormalized to sum to 1, and the next token is sampled from that restricted set.

- Smaller `top_k` values make generations more focused and less diverse; larger values allow more variability but can introduce more noise.

This is often combined with `temperature` to balance fluency and diversity.

Sampling

`do_sample` switches between greedy/beam search decoding and stochastic sampling.

- If `do_sample=False` *(default)*, the model uses deterministic decoding:
    - With `num_beams=1`, it is greedy decoding: at each step, it picks the single most probable next token.

    - With `num_beams>1`, it performs beam search, exploring several likely continuations but still deterministically choosing the best‑scoring sequence.

- If `do_sample=True`, the model samples the next token from the *(possibly filtered)* probability distribution:
    - `temperature`, `top_k`, `top_p`, and related parameters control that distribution.

    - This makes each run potentially different, even with the same prompt, unless you fix the random seed.

In practice, for fact‑style or highly constrained outputs you typically use `do_sample=False` with low temperature *(or leave it at the default)* and possibly beam search. For creative or open‑ended continuation, you enable `do_sample=True` and tune `temperature` and `top_k` to trade off coherence and diversity.

4. Looking at the output, you can see a warning about truncation. What does this mean? Why is the input being truncated?

This warning means that the input text *(prompt)* you provided is longer than the limit set by max_length, but you did not explicitly tell the tokenizer to truncate it.  As a result, Hugging Face's pipeline automatically shortens the input so that it fits within `max_length`, removing tokens from the prompt as needed. By default, it uses the `"longest_first"` truncation strategy, which removes tokens from the longest part of the input if you are encoding paired sequences.

Truncation is necessary because models like GPT-2, have a maximum input length that cannot be exceeded *(e.g., 1024 or 2048 tokens)*. If your prompt is longer, only the first *(or last)* `max_length` tokens are used and the rest are discarded. This ensures the input is compatible with the model's architecture and avoids runtime errors.

5. What does `pad_token_id` being set to `eos_token_id` mean? Why is this necessary for GPT-2?

For GPT-2, `pad_token_id` being set to `eos_token_id` means that whenever the model or pipeline needs a padding token *(for batching or alignment)*, it will use the same token ID that marks “end of sequence” *(EOS)*, which for GPT‑2 is 50256.

Why this happens for GPT-2?

GPT‑2 is a **decoder-only causal language model** that was originally trained **without a dedicated padding token**. It only has an EOS token to signal the end of a sequence. When using GPT‑2 in batched generation with modern libraries:

- Padding is required to create tensors of uniform length across different prompts.

- If no separate pad token is defined, the library falls back to using the EOS token as the padding token, and then masks those padded positions in the attention mask so the model does not treat them as real content.

So `pad_token_id = eos_token_id = 50256` is a pragmatic choice that:

- Avoids shape/attention-mask errors during batching and generation.

- Ensures the model can still run “open-ended” generation, while ignoring padded positions via the attention mask.

Why it is necessary?

Without any `pad_token_id`:
- The generation utilities cannot reliably build attention masks for batches containing sequences of different lengths.

- We would see warnings or potentially incorrect behavior, because the model would not know which positions are padding and which are real tokens.

By reusing the EOS token as padding and masking those positions, the library enables GPT‑2 to participate in padded, batched generation even though it was not originally designed with a distinct pad token.

6. What are the trade-offs between model size and generation quality?

Larger language models generally produce higher-quality, more coherent generations, but they come with significant computational and practical costs.

**Quality vs. size**

- Larger models (more parameters) usually:
    - Capture more nuanced patterns and long-range dependencies in data.
    
    - Generate more fluent, coherent, and contextually appropriate text, especially on complex or open-ended tasks.

    - Are more robust across domains and languages because they have greater capacity to memorize and generalize.

- Smaller models:
    - Tend to be less coherent on long contexts, more repetitive, and more likely to miss subtle constraints.

    - Degrade faster when pushed out of distribution or asked to follow intricate instructions.

However, beyond a certain size, quality gains can become marginal for many everyday tasks, especially if prompts are short or the domain is simple.

**Computation, latency, and cost**

- Larger models:
    - Require much more GPU/CPU memory, making them harder or impossible to run on consumer hardware.

    - Are slower per token and per request, increasing latency.

    - Cost more to serve in production (more GPUs, energy), and are harder to scale for many concurrent users.

- Smaller models:
    - Fit on modest GPUs or even CPUs.

    - Respond faster and can serve more users per machine.

    - Are easier to fine-tune and iterate on, with shorter training and experimentation cycles.

This is why many applications use distilled or quantized models: they trade a modest loss in quality for large gains in efficiency.

**Control, safety, and behavior**

- Large models:
    - Can follow complex instructions and constraints more reliably, but can also hallucinate convincingly, which raises safety and trust issues.

    - Are harder to fully understand and debug, and alignment is more challenging.

- Small models:
    - Are often simpler to interpret and constrain, but their limited capacity may cause underfitting, brittle behavior, or failure on nuanced tasks.

Practical rule of thumb

- For lightweight tasks *(classification, short replies, simple QA, narrow domains)*, a smaller or medium-sized model often gives an excellent quality–cost balance.

- For rich dialog, creative writing, multi-step reasoning, or broad-domain assistants, a larger model *(or a strong medium model with good prompting and retrieval)* is typically worth the extra cost.

Choosing model size is ultimately an engineering trade-off: maximize quality subject to constraints on latency, hardware, cost, and deployment environment.

7. Change the model inside the pipeline to see other models.

The `EleutherAI/gpt-neo-125M` is a small open-source **causal language model** from the GPT‑Neo family, designed as a GPT‑3–style transformer but with only **125 million parameters**. It uses a **decoder‑only transformer architecture** with masked self‑attention and is trained in an autoregressive fashion: at each step it predicts the next token given all previous tokens. The model was trained on **The Pile**, a large, diverse 800+GB English corpus, for roughly 300 billion tokens, which gives it reasonable coverage of general-domain English text and enables it to learn useful semantic and syntactic representations.

In [14]:
generator = pipeline("text-generation",
                    model="EleutherAI/gpt-neo-125M",
                    pad_token_id=50256)

response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response

outputs = generator(prompt, max_new_tokens=50) #max_length=200
print(outputs[0]['generated_text'])

Device set to use cpu


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. The Megatron figure I was ordered from is a little too large for the size of the Optimus Prime figure that you ordered. The Optimus Prime figure that you ordered was too small for the size of the Optimus Prime figure that you ordered. (The same


The `facebook/opt-125m` is a 125 million parameter **causal language model** released by Meta AI as part of the OPT *(Open Pretrained Transformers)* family. It uses a **decoder-only transformer architecture** and is trained with a causal language modeling objective—predicting the next token given the previous context, just like GPT-2 and GPT-3. The model was trained on a large, diverse corpus of about 180 billion tokens *(approximately 800GB of primarily English text, with a small proportion of multilingual data)*, using a GPT-2 style byte-level BPE tokenizer with a vocabulary of 50,272 tokens.

In [11]:
generator = pipeline(
    "text-generation",
    model="facebook/opt-125m",
)

response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response

outputs = generator(prompt, max_new_tokens=200)
print(outputs[0]["generated_text"])

Device set to use cpu


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. We were unable to send you a correct product. The problem is that a few items in your order were mixed up. If you still have any questions about your order, please contact us directly. Thank you.
