# Transformers

[Hugging Face Docs Link](https://huggingface.co/docs/transformers/en/quicktour)

- The pipeline function

In [None]:
%pip install transformers datasets evaluate tensorflow tensorflow-metal

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [21]:
from transformers import pipeline

- **Text classification (sentiment analysis)**

In [22]:
classifier = pipeline("sentiment-analysis")

result = classifier("I feel really good today.")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


[{'label': 'POSITIVE', 'score': 0.999870777130127}]


In [23]:
classifier = pipeline("sentiment-analysis")

results = classifier("I feel really good today.")
for result in results:
    print(f"label: {result['label']}, with score: {result['score'] * 100:.2f}%")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


label: POSITIVE, with score: 99.99%


In [24]:
classifier = pipeline("sentiment-analysis")

results = classifier(["I've been waiting for a HuggingFace course my whole life.", "I hope I don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


label: POSITIVE, with score: 0.9598
label: NEGATIVE, with score: 0.7657


- **Zero-shot classification (classify text into given category without specific training using zero-shot learning)**

In [29]:
classifier = pipeline("zero-shot-classification")
result = classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

# print(f"Labels: {result['labels']}")
# print(f"Scores: {result['scores']}")

for label, score in zip(result['labels'], result['scores']):
    print(f"{label}: {score*100:.2f}%")

No model was supplied, defaulted to FacebookAI/roberta-large-mnli and revision 2a8f12d (https://huggingface.co/FacebookAI/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


education: 95.62%
business: 2.70%
politics: 1.68%


- **Text generation**

In [33]:
generator = pipeline('text-generation')
generator("Hello I am a transformer and I am used for") 
# print(result[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello I am a transformer and I am used for my electric light bulb and I am also home decorating. I found this lamp in my room and I think it looks good! I am a very happy camper and most recently a camper that'}]

In [34]:
generator = pipeline('text-generation', model='distilgpt2')
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=2)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, using the language. I had a few words for my second language at the back in 2012, so I just"},
 {'generated_text': "Hello, I'm a language model, but I don't have any language. There are lots of languages to build and write that together; there are"}]

- **Text completion (mask filling)**

In [36]:
unmasker = pipeline("fill-mask")
unmasker("This is a good <mask>.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForMaskedLM.

All the weights of TFRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


[{'score': 0.09290468692779541,
  'token': 386,
  'token_str': ' start',
  'sequence': 'This is a good start.'},
 {'score': 0.07752888649702072,
  'token': 1114,
  'token_str': ' idea',
  'sequence': 'This is a good idea.'}]

- **Token classification (identify entities such as person, organisation, location in a sentence)** 

In [37]:
ner = pipeline('ner', grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

- **Question answering**

In [38]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="What is the capital of France?", 
    context="France is the capital of France.")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9357569217681885, 'start': 0, 'end': 6, 'answer': 'France'}

- **Summarization**

In [39]:
summarizer = pipeline("summarization")
summarizer("""The Great Barrier Reef, stretching over 2,300 kilometers along Australia's eastern coast, is the world's largest coral reef ecosystem and the only living structure visible from space. Home to more than 1,500 species of tropical fish, 400 types of hard coral, one-third of the world's soft corals, 134 species of sharks and rays, and more than 30 species of whales and dolphins, it is one of Earth's most complex natural ecosystems. The reef faces multiple threats including climate change, ocean acidification, pollution from agricultural runoff, and coastal development. Scientists estimate that over 50% of the reef's coral cover has been lost since the 1980s, highlighting the urgent need for conservation efforts to protect this natural wonder for future generations. The reef also generates approximately $6.4 billion annually through tourism and supports over 64,000 jobs in Australia. Indigenous communities have deep cultural connections to the reef, with their traditional knowledge playing a crucial role in understanding and preserving the ecosystem. Recent restoration projects include coral farming, crown-of-thorns starfish control programs, and innovative heat-resistant coral breeding experiments. The reef's health serves as a critical indicator of global ocean health and climate change impacts, making its preservation not just an Australian priority but a global imperative.""")

No model was supplied, defaulted to google-t5/t5-small and revision df1b051 (https://huggingface.co/google-t5/t5-small).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

I0000 00:00:1732471112.127939 7377814 service.cc:145] XLA service 0x41302f430 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1732471112.128423 7377814 service.cc:153]   StreamExecutor device (0): Host, Default Version
2024-11-24 18:58:32.178192: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1732471112.309159 7377814 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[{'summary_text': "the reef is home to more than 1,500 species of tropical fish, 400 types of hard coral, one-third of the world's soft corals, 134 species of sharks and rays, and more than 30 species of whales and dolphins . the reef faces multiple threats including climate change, ocean acidification, pollution from agricultural runoff, and coastal development ."}]

- **Translation**

In [41]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Je suis un étudiant en informatique.")


All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-fr-en.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


ValueError: This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed in order to use this tokenizer.

- **Speech recognition**

In [None]:
from datasets import load_dataset, Audio

# iterate over an entire dataset
speech_recognizer = pipeline('automatic-speech-recognition', model='facebook/wav2vec2-base-960h')

# load an audio dataset
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

# make sure the sampling rate of the dataset matches the sampling rate facebook/wav2vec2-base-960h was trained on
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))

result = speech_recognizer(dataset[:4]["audio"])
print([d["text"] for d in result])
