# TP 3: Hello World Transformers

### Apolline HADJAL

In [1]:
!pip install transformers



In [2]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

## Question 1

In [3]:
from transformers import pipeline
classifier = pipeline("text-classification") #here if we wanted to specify a model we would just have to add model="model-name" in the parameters

import pandas as pd
outputs = classifier(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Unnamed: 0,label,score
0,NEGATIVE,0.901546


1. *What is a pipeline ?* In Hugging Face Transformers a pipeline is an API that abstracts away model loading, tokenization, inference, and post-processing. It handles preprocessing the input, running it through the model, and formatting the output automatically.

2. *3 other tasks available:*

- question-answering
- summarization/text-generation
- translation

3. *What happens without specifying a model?* When there is no model specified the pipeline uses a default model for the task, like in the code above.

## Question 2

Here is what we get after having run the first cell:

1. Default model: distilbert-base-uncased-finetuned-sst-2-english (it's in the output)

2. Dataset: In the model model it says it was fine-tuned on SST-2 (Stanford Sentiment Treebank) english.

3. Score field: The score field represents the model's confidence in its predictions, here it is pretty confident with a score of 90% (ranges from 0 to 1)

4. This is another emotion classification model: j-hartmann/emotion-english-distilroberta-base or bhadresh-savani/distilbert-base-uncased-emotion

## Question 3

In [4]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.55657,Mega,208,212
4,PER,0.590256,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775362,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


1. aggregation_strategy="simple": Groups together tokens that belong to the same entity (e.g., "New York" as one entity instead of two separate tokens).

2. Entity types:

- ORG = Organization
- MISC = Miscellaneous
- LOC = Location
- PER = Person


3. prefix: Indicates subword tokens. The tokenizer splits words into smaller pieces. "##tron" means "tron" is a continuation of the previous token "Mega".

4. Why split incorrectly? "Megatron" and "Decepticons" are fictional names likely not in the training data (CoNLL-2003 uses real-world news). The model breaks them into subwords it recognizes.

5. CoNLL-2003 dataset: A benchmark dataset for NER containing news articles annotated with named entities (persons, locations, organizations, miscellaneous).

## Question 4

In [5]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


1. aggregation_strategy="simple": Groups together tokens that belong to the same entity (e.g., "New York" as one entity instead of two separate tokens).

2. Entity types:

- ORG = Organization
- MISC = Miscellaneous
- LOC = Location
- PER = Person


3. prefix: Indicates subword tokens. The tokenizer splits words into smaller pieces. "##tron" means "tron" is a continuation of the previous token "Mega".

4. Why split incorrectly? "Megatron" and "Decepticons" are fictional names likely not in the training data (CoNLL-2003 uses real-world news). The model breaks them into subwords it recognizes.

5. CoNLL-2003 dataset: A benchmark dataset for NER containing news articles annotated with named entities (persons, locations, organizations, miscellaneous).

## Question 5

In [6]:
#pip install --upgrade torch>=2.6

In [7]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu
Your min_length=56 must be inferior than your max_length=45.


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.


1. Extractive vs Abstractive: Extractive selects key sentences from the original text. Abstractive generates new sentences that capture the meaning (like paraphrasing).

2. Default model: sshleifer/distilbart-cnn-12-6
- Type: Abstractive
- Architecture: BART (distilled version)
- Dataset: CNN/DailyMail news articles

3. max_length/min_length: Control output length in tokens. If min_length > max_length, you get a warning and generation stops at max_length.

4. clean_up_tokenization_spaces: Removes extra spaces around punctuation for cleaner output.

5. Challenge models:
- Short texts: facebook/bart-large-cnn
- Long documents: google/pegasus-large or allenai/led-base-16384

## Question 6

In [8]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.


Architecture: MarianMT (transformer-based)

OPUS = Open Parallel Corpus
MT = Machine Translation


English to French models:

Helsinki-NLP/opus-mt-en-fr
t5-base (with task prefix)


Bilingual vs Multilingual: Bilingual trains on one language pair (better quality for that pair). Multilingual handles many pairs (more versatile but potentially lower quality per pair).
Task name relation: translation_en_to_de specifies the translation direction, matching the model's training (English to German).
sacremoses: Library for tokenization and text normalization in NLP, particularly for preprocessing in Moses-style translation systems.
Challenge - Multilingual model: facebook/mbart-large-50-many-to-many-mmt supports 50 languages (1,225 language pairs).

## Question 7

In [9]:
from transformers import set_seed
set_seed(42)

generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. I would like to know if you know more about our service. Please let me know if we can arrange an exchange of Megatron for you.

The following quote from my customer service representative is from my review of the Optimus Prime action figure:

"Hi. I was a bit stunned when I saw the Optimus Prime action figure from your online store. I was hoping you could make me happy, but I was not able to

Default model: openai-community/gpt2

Architecture: Decoder-only transformer
Parameters: 124M (base model)
Type: Autoregressive generation


set_seed(42): Ensures reproducible results. Without it, generation would be different each time due to randomness.
Other parameters:

temperature: Controls randomness (lower = more deterministic)
top_k: Considers only top k tokens at each step
do_sample: If True, samples probabilistically; if False, uses greedy decoding


Truncation warning: Input exceeds model's maximum length (1024 tokens for GPT-2), so it's cut off.
pad_token_id = eos_token_id: GPT-2 doesn't have a padding token, so the end-of-sequence token is reused. Necessary for batch processing.
Trade-offs: Larger models generate higher quality, more coherent text but are slower and require more memory/compute.