# Transformers

[Hugging Face Docs Link](https://huggingface.co/docs/transformers/en/quicktour)

- The pipeline function

In [5]:
%pip install transformers datasets evaluate tensorflow tensorflow-metal

Note: you may need to restart the kernel to use updated packages.


In [6]:
from transformers import pipeline

- **Text classification (sentiment analysis)**

In [7]:
classifier = pipeline("sentiment-analysis")

result = classifier("I feel really good today.")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


[{'label': 'POSITIVE', 'score': 0.999870777130127}]


In [8]:
classifier = pipeline("sentiment-analysis")

results = classifier("I feel really good today.")
for result in results:
    print(f"label: {result['label']}, with score: {result['score'] * 100:.2f}%")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


label: POSITIVE, with score: 99.99%


In [9]:
classifier = pipeline("sentiment-analysis")

results = classifier(["I've been waiting for a HuggingFace course my whole life.", "I hope I don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


label: POSITIVE, with score: 0.9598
label: NEGATIVE, with score: 0.7657


- **Zero-shot classification (classify text into given category without specific training using zero-shot learning)**

In [10]:
classifier = pipeline("zero-shot-classification")
result = classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

# print(f"Labels: {result['labels']}")
# print(f"Scores: {result['scores']}")

for label, score in zip(result['labels'], result['scores']):
    print(f"{label}: {score*100:.2f}%")

No model was supplied, defaulted to FacebookAI/roberta-large-mnli and revision 2a8f12d (https://huggingface.co/FacebookAI/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


education: 95.62%
business: 2.70%
politics: 1.68%


- **Text generation**

In [11]:
generator = pipeline('text-generation')
generator("Hello I am a transformer and I am used for") 
# print(result[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello I am a transformer and I am used for powering things around in my garage. Why are you so passionate about this project? I recently stopped by to check out the transformer, and it looks gorgeous (in black?). The thing I like about it'}]

In [12]:
generator = pipeline('text-generation', model='distilgpt2')
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=2)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, so when I'm thinking about Python there are some ways of representing both of these kinds of statements.\n\n"},
 {'generated_text': "Hello, I'm a language model, I'm a compiler, I'm running the compiler.\n\nHere are your instructions for creating a compiler."}]

- **Text completion (mask filling)**

In [13]:
unmasker = pipeline("fill-mask")
unmasker("This is a good <mask>.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForMaskedLM.

All the weights of TFRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


[{'score': 0.09290468692779541,
  'token': 386,
  'token_str': ' start',
  'sequence': 'This is a good start.'},
 {'score': 0.07752888649702072,
  'token': 1114,
  'token_str': ' idea',
  'sequence': 'This is a good idea.'}]

- **Token classification (identify entities such as person, organisation, location in a sentence)** 

In [14]:
ner = pipeline('ner', grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

- **Question answering**

In [15]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="What is the capital of France?", 
    context="France is the capital of France.")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.


{'score': 0.9357569217681885, 'start': 0, 'end': 6, 'answer': 'France'}

- **Summarization**

In [16]:
summarizer = pipeline("summarization")
summarizer("""The Great Barrier Reef, stretching over 2,300 kilometers along Australia's eastern coast, is the world's largest coral reef ecosystem and the only living structure visible from space. Home to more than 1,500 species of tropical fish, 400 types of hard coral, one-third of the world's soft corals, 134 species of sharks and rays, and more than 30 species of whales and dolphins, it is one of Earth's most complex natural ecosystems. The reef faces multiple threats including climate change, ocean acidification, pollution from agricultural runoff, and coastal development. Scientists estimate that over 50% of the reef's coral cover has been lost since the 1980s, highlighting the urgent need for conservation efforts to protect this natural wonder for future generations. The reef also generates approximately $6.4 billion annually through tourism and supports over 64,000 jobs in Australia. Indigenous communities have deep cultural connections to the reef, with their traditional knowledge playing a crucial role in understanding and preserving the ecosystem. Recent restoration projects include coral farming, crown-of-thorns starfish control programs, and innovative heat-resistant coral breeding experiments. The reef's health serves as a critical indicator of global ocean health and climate change impacts, making its preservation not just an Australian priority but a global imperative.""")

No model was supplied, defaulted to google-t5/t5-small and revision df1b051 (https://huggingface.co/google-t5/t5-small).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
I0000 00:00:1733056233.886644  108414 service.cc:145] XLA service 0x107f86470 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1733056233.887305  108414 service.cc:153]   StreamExecutor device (0): Host, Default Version
2024-12-01 13:30:33.942031: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRE

[{'summary_text': "the reef is home to more than 1,500 species of tropical fish, 400 types of hard coral, one-third of the world's soft corals, 134 species of sharks and rays, and more than 30 species of whales and dolphins . the reef faces multiple threats including climate change, ocean acidification, pollution from agricultural runoff, and coastal development ."}]

- **Translation**

In [17]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Je suis un étudiant en informatique.")


All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-fr-en.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


[{'translation_text': "I'm a computer student."}]

- **Speech recognition**

In [18]:
# from datasets import load_dataset, Audio

# # iterate over an entire dataset
# speech_recognizer = pipeline('automatic-speech-recognition', model='facebook/wav2vec2-base-960h')

# # load an audio dataset
# dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

# # make sure the sampling rate of the dataset matches the sampling rate facebook/wav2vec2-base-960h was trained on
# dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))

# result = speech_recognizer(dataset[:4]["audio"])
# print([d["text"] for d in result])


- **Model and Tokenizer in the pipeline**

In [19]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

In [20]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some layers from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [26]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")
print(f"Label: {result[0]['label']}, Score: {result[0]['score'] * 100:.2f}%")

Label: 5 stars, Score: 72.73%
