# Lab 3 - Hello World Transformers ü§ó

This notebook explores the Hugging Face Transformers library through various NLP tasks using pre-trained models.

## Quick Overview of Transformer Applications

Sample text for testing various pipeline tasks:


In [19]:
%pip install transformers pandas torch sentencepiece sacremoses -q

Note: you may need to restart the kernel to use updated packages.


In [20]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""


## Text Classification

### Question 1: Understanding Pipelines

**1. What is a `pipeline` in Hugging Face Transformers?**

A `pipeline` is a high-level API that abstracts the complexity of using pre-trained models. It handles tokenization, model inference, and post-processing in a single callable object.

**2. Three other available tasks:**
- `question-answering`
- `summarization`
- `named-entity-recognition (ner)`

**3. Default model behavior:**

When no model is specified, the pipeline uses a default model for that task. To specify a model: `pipeline("task", model="model-name")`


In [21]:
from transformers import pipeline

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")


Device set to use cpu


In [22]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)


Unnamed: 0,label,score
0,NEGATIVE,0.901547


### Question 2: Text Classification Deep Dive

**1. Default model:** `distilbert-base-uncased-finetuned-sst-2-english`

**2. Training dataset:** SST-2 (Stanford Sentiment Treebank) - movie reviews for sentiment analysis.

**3. Score meaning:** The score represents the model's confidence (probability) for the predicted label, ranging from 0 to 1.

**4. Emotion classification model:** `bhadresh-savani/distilbert-base-uncased-emotion` classifies into 6 emotions (sadness, joy, love, anger, fear, surprise).

---

## Named Entity Recognition


In [23]:
ner_tagger = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.879011,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.55657,Mega,208,212
4,PER,0.590255,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.49835,##icons,259,264
7,MISC,0.775362,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


### Question 3: Named Entity Recognition

**1. `aggregation_strategy="simple"`:** Merges tokens belonging to the same entity, computing a simple average of their scores.

**2. Entity types:**
- **ORG:** Organization
- **PER:** Person
- **LOC:** Location
- **MISC:** Miscellaneous

**3. `##` prefix:** Indicates subword tokens from WordPiece tokenization. Words not in vocabulary are split into subwords.

**4. Splitting issue:** "Megatron" and "Decepticons" are fictional names not in the training data (CoNLL-2003, which uses news articles), so the model struggles with them.

**5. CoNLL-2003:** A benchmark dataset for NER containing news articles annotated with person, organization, location, and miscellaneous entities.

---

## Question Answering


In [24]:
reader = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])


Device set to use cpu


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


### Question 4: Question Answering Systems

**1. Type:** Extractive QA - the model extracts a span from the context rather than generating new text.

**2. `start` and `end` indices:** Character positions in the original text where the answer is located.

**3. SQuAD:** Stanford Question Answering Dataset - 100k+ reading comprehension questions based on Wikipedia articles.

**4. Unanswerable questions:** Questions requiring reasoning beyond the text or world knowledge not in the context.

**5. Generative QA example:** `google/flan-t5-base` can generate answers rather than extract spans.

---

## Summarization


In [25]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
outputs = summarizer(text, max_length=56, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])


Device set to use cpu


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead. As a lifelong enemy of the Decepticons, I


### Question 5: Text Summarization

**1. Extractive vs Abstractive:**
- **Extractive:** Selects and concatenates important sentences from the original text
- **Abstractive:** Generates new text that captures the main ideas

**2. Default model:** `sshleifer/distilbart-cnn-12-6`
- Abstractive model using DistilBART architecture
- Trained on CNN/DailyMail news summarization dataset

**3. `max_length` and `min_length`:** Control the output summary length in tokens. If `min_length > max_length`, it raises an error.

**4. `clean_up_tokenization_spaces`:** Removes extra spaces around punctuation for cleaner output.

**5. Two summarization models:**
- Short texts: `facebook/bart-large-cnn`
- Long documents: `google/long-t5-tglobal-base` (handles up to 16k tokens)

---

## Translation


In [26]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])


Device set to use cpu


Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket √∂ffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie k√∂nnen mein Dilemma verstehen. Um das Problem zu l√∂sen, Ich fordere einen Austausch von Megatron f√ºr die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen √ºber diesen Kauf. Ich erwarte, von Ihnen bald zu h√∂ren. Aufrichtig, Bumblebee.


### Question 6: Machine Translation

**1. Helsinki-NLP/opus-mt-en-de architecture:**
- MarianMT (based on Marian NMT framework)
- OPUS = Open Parallel Corpus
- MT = Machine Translation

**2. English to French models:**
- `Helsinki-NLP/opus-mt-en-fr`
- `facebook/mbart-large-50-many-to-many-mmt`

**3. Bilingual vs Multilingual:**
- **Bilingual:** Trained on one language pair, typically higher quality for that pair
- **Multilingual:** Handles multiple languages, more versatile but may sacrifice some quality

---

## Text Generation


In [29]:
from transformers import set_seed
set_seed(42)

generator = pipeline("text-generation", model="gpt2", pad_token_id=50256)
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer Service Response:\n" + response
outputs = generator(prompt, max_new_tokens=150, do_sample=True, top_k=50, top_p=0.95, no_repeat_ngram_size=2, truncation=True)
print(outputs[0]['generated_text'])


Device set to use cpu


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer Service Response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. The order status was not correct. Upon further review, it is clear that my order received defective parts. However, the order number was correct on the page and I was able to see the parts in stock. As I said, there is no exchange and it appears that the original order does not appear to have been filled. Therefore, you may be able (or please) contact me at a later time to clarify the status

### Question 7: Text Generation

**1. `set_seed(42)`:** Ensures reproducibility by fixing the random number generator state. Without it, each run produces different outputs due to sampling randomness.

**2. Generation parameters:**
- **temperature:** Controls randomness (lower = more deterministic, higher = more creative)
- **top_k:** Limits sampling to top k most probable tokens
- **do_sample:** Enables sampling vs greedy decoding

**3. `pad_token_id = eos_token_id`:** GPT-2 doesn't have a dedicated pad token, so we use the end-of-sequence token for padding to avoid warnings.
