# **🤗 Datasets, Models and Pipelines**

Learning Objectives
1. Use a variety of existing models for a variety of common applications.
2. Understand basic prompt engineering.
3. Understand search vs. sampling for LLM inference.
4. Get familiar with the main Hugging Face abstractions: datasets, pipelines, tokenizers, and models.

The goal of this notebook is to get your feet wet with several LLM applications and to show how easy it can be to get started with LLMs.  
As you go through the examples, note the datasets, models, APIs, and options used.  These simple examples can be starting points when you need to build your own application.

In [2]:
from datasets import load_dataset

from transformers import pipeline

## **Task 1 - Summarization**

Summarization can take two forms:
- **`Extractice`** - Selecting representations piece of text
- **`Abstractive`** - Generating new text (i.e. novel) summaries

Here we will use a model which does *abstractive* summarization.

In this section, we will use:
- **Data**: [xsum](https://huggingface.co/datasets/xsum) dataset, which provides a set of BBC articles and summaries.
- **Model**: [t5-small](https://huggingface.co/t5-small) model, which has 60 million parameters (242MB for PyTorch).  T5 is an encoder-decoder model created by Google which supports several tasks such as summarization, translation, Q&A, and text classification.  For more details, see the [Google blog post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html), [code on GitHub](https://github.com/google-research/text-to-text-transfer-transformer), or the [research paper](https://arxiv.org/pdf/1910.10683.pdf).

**Steps**
1. Load the data
```python
from datasets import load_dataset

dataset = load_dataset('xsum')
```
2. Define the pipeline by specifying the task and model
```python
from transformers import pipeline

summarizer = pipeline(
                task="summarization",
                model="t5-small"
)
```
3. Use `summarizer` to summarize the articles
```python
summarizer(article)
```

In [4]:
# Step 1 - Load the data

# Note: We specify cache_dir to use predownloaded data.
xsum_dataset = load_dataset("xsum")

# xsum_dataset = load_dataset("xsum", split="train")

In [5]:
xsum_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [6]:
xsum_sample = xsum_dataset['train'].select(range(10))

xsum_sample

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 10
})

In [7]:
xsum_sample.to_pandas()

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490


In [8]:
# Define the pipeline by specifying the task and model

summarizer = pipeline(
    task="summarization",
    model="t5-small",
    min_length=20,
    max_length=40,
    truncation=True
)

# min_length and max_length specify the length of summary to generate

# If we donot set truncation=True, we get the following warning during inference:
# Token indices sequence length is longer than the specified maximum 
# sequence length for this model (541 > 512). Running this sequence 
# through the model will result in indexing errors

In [9]:
# Passing a single article or a batch of articles

summarizer(xsum_sample["document"][0])

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]

In [10]:
# Passing a batch of articles

results = summarizer(xsum_sample["document"])

In [11]:
results

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'},
 {'summary_text': 'a fire alarm went off at the Holiday Inn in Hope Street on Saturday . guests were asked to leave the hotel . the two buses were parked side-by-side in'},
 {'summary_text': 'Sebastian Vettel will start third ahead of team-mate Kimi Raikkonen . stewards only handed Hamilton a reprimand after governing body said "n'},
 {'summary_text': 'the 67-year-old is accused of committing the offences between March 1972 and October 1989 . he denies all the charges, including two counts of indecency'},
 {'summary_text': 'a man receiving psychiatric treatment at the clinic threatened to shoot himself and others . the incident comes amid tension in Istanbul following several attacks in crowded areas .'},
 {'summary_text': 'Gregor Townsend gave a debut to powerhouse wing Taqele Naiyaravor

In [12]:
# Comparing the generate summaries with the ground truth

import pandas as pd

generated_summaries = pd.DataFrame(results)

generated_summaries

Unnamed: 0,summary_text
0,the full cost of damage in Newton Stewart is s...
1,a fire alarm went off at the Holiday Inn in Ho...
2,Sebastian Vettel will start third ahead of tea...
3,the 67-year-old is accused of committing the o...
4,a man receiving psychiatric treatment at the c...
5,Gregor Townsend gave a debut to powerhouse win...
6,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,the 25-year-old was hit by a motorbike during ...
8,gundogan will not be fit for the start of the ...
9,the crash happened about 07:20 GMT at the junc...


In [13]:
sample_df = xsum_sample.to_pandas()

In [14]:
sample_df["generated_summaries"] = generated_summaries['summary_text']

sample_df

Unnamed: 0,document,summary,id,generated_summaries
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142,the full cost of damage in Newton Stewart is s...
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035,a fire alarm went off at the Holiday Inn in Ho...
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548,Sebastian Vettel will start third ahead of tea...
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422,the 67-year-old is accused of committing the o...
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984,a man receiving psychiatric treatment at the c...
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833,Gregor Townsend gave a debut to powerhouse win...
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467,the 25-year-old was hit by a motorbike during ...
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845,gundogan will not be fit for the start of the ...
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490,the crash happened about 07:20 GMT at the junc...


In [15]:
# As we have a sample containing 10 data points, idx can be 0 to 9

idx = 8
print("Full Document:\n", sample_df['document'][idx])
print()
print("Original Summary:\n", sample_df['summary'][idx])
print()
print("Generated Summary:\n", sample_df['generated_summaries'][idx])

Full Document:
 Gundogan, 26, told BBC Sport he "can see the finishing line" after tearing cruciate knee ligaments in December, but will not rush his return.
The German missed the 2014 World Cup following back surgery that kept him out for a year, and sat out Euro 2016 because of a dislocated kneecap.
He said: "It is heavy mentally to accept that."
Gundogan will not be fit for the start of the Premier League season at Brighton on 12 August but said his recovery time is now being measured in "weeks" rather than months.
He told BBC Sport: "It is really hard always to fall and fight your way back. You feel good and feel ready, then you get the next kick.
"The worst part is behind me now. I want to feel ready when I am fully back. I want to feel safe and confident. I don't mind if it is two weeks or six."
Gundogan made 15 appearances and scored five goals in his debut season for City following his £20m move from Borussia Dortmund.
He is eager to get on the field again and was impressed at 

## **Task 2 - Sentiment Analysis**

Sentiment analysis is a text classification task of estimating whether a piece of text is positive, negative, or another "sentiment" label.  The precise set of sentiment labels can vary across applications.

**Background reading**: See the Hugging Face [task page on text classification](https://huggingface.co/tasks/text-classification) or [Wikipedia on sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis).

In this section, we will use:
- **Data**: [poem sentiment](https://huggingface.co/datasets/poem_sentiment) dataset, which provides lines from poems tagged with sentiments `negative` (0), `positive` (1), `no_impact` (2), or `mixed` (3).
- **Model**: [fine-tuned version of BERT](https://huggingface.co/nickwong64/bert-base-uncased-poems-sentiment).  BERT, or Bidirectional Encoder Representations from Transformers, is an encoder-only model from Google usable for 11+ tasks such as sentiment analysis and entity recognition.  For more details, see this [Hugging Face blog post](https://huggingface.co/blog/bert-101) or the [Wikipedia page](https://en.wikipedia.org/wiki/BERT_&#40;language_model&#41;).

**Steps**
1. Load the data
```python
from datasets import load_dataset

dataset = load_dataset('poem_sentiment')
```
2. Define the pipeline by specifying the task and model
```python
from transformers import pipeline

classifier = pipeline(
                task="text-classification",
                model="nickwong64/bert-base-uncased-poems-sentiment"
```
3. Use `classifier` to classify the sentiment of a poem
```python
classifier(text)
```

In [16]:
poem_dataset = load_dataset("poem_sentiment", split='train')

Downloading data: 100%|████████████████████████████████████████████████████████████| 35.6k/35.6k [00:00<00:00, 149kB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████| 6.34k/6.34k [00:00<00:00, 26.2kB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████| 6.16k/6.16k [00:00<00:00, 26.1kB/s]


Generating train split:   0%|          | 0/892 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/105 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/104 [00:00<?, ? examples/s]

In [17]:
poem_sample = poem_dataset.select(range(10))

poem_sample.to_pandas()

Unnamed: 0,id,verse_text,label
0,0,with pale blue berries. in these peaceful shad...,1
1,1,"it flows so long as falls the rain,",2
2,2,"and that is why, the lonesome day,",0
3,3,"when i peruse the conquered fame of heroes, an...",3
4,4,of inward strife for truth and liberty.,3
5,5,the red sword sealed their vows!,3
6,6,and very venus of a pipe.,2
7,7,"who the man, who, called a brother.",2
8,8,"and so on. then a worthless gaud or two,",0
9,9,to hide the orb of truth--and every throne,2


We load the pipeline using the task `text-classification` since we want to classify text with a fixed set of labels.

In [18]:
sentiment_classifier = pipeline(
    task="text-classification",
    model="nickwong64/bert-base-uncased-poems-sentiment"
)

In [33]:
results = sentiment_classifier(poem_sample["verse_text"])

results

[{'label': 'positive', 'score': 0.9965937733650208},
 {'label': 'no_impact', 'score': 0.9987409710884094},
 {'label': 'negative', 'score': 0.995965838432312},
 {'label': 'mixed', 'score': 0.9687354564666748},
 {'label': 'mixed', 'score': 0.9759674668312073},
 {'label': 'mixed', 'score': 0.9665797352790833},
 {'label': 'no_impact', 'score': 0.9986388087272644},
 {'label': 'no_impact', 'score': 0.9986108541488647},
 {'label': 'negative', 'score': 0.9965572357177734},
 {'label': 'no_impact', 'score': 0.9985186457633972}]

In [34]:
# Comparing the generate summaries with the ground truth

import pandas as pd

predictions = pd.DataFrame(results)

predictions

Unnamed: 0,label,score
0,positive,0.996594
1,no_impact,0.998741
2,negative,0.995966
3,mixed,0.968735
4,mixed,0.975967
5,mixed,0.96658
6,no_impact,0.998639
7,no_impact,0.998611
8,negative,0.996557
9,no_impact,0.998519


In [35]:
sample_df = poem_sample.to_pandas()

sample_df

Unnamed: 0,id,verse_text,label
0,0,with pale blue berries. in these peaceful shad...,1
1,1,"it flows so long as falls the rain,",2
2,2,"and that is why, the lonesome day,",0
3,3,"when i peruse the conquered fame of heroes, an...",3
4,4,of inward strife for truth and liberty.,3
5,5,the red sword sealed their vows!,3
6,6,and very venus of a pipe.,2
7,7,"who the man, who, called a brother.",2
8,8,"and so on. then a worthless gaud or two,",0
9,9,to hide the orb of truth--and every throne,2


In [36]:
sample_df['label'].value_counts()

label
2    4
3    3
0    2
1    1
Name: count, dtype: int64

In [38]:
sentiment_labels = {0: "negative", 1: "positive", 2: "no_impact", 3: "mixed"}

sample_df['label'] = sample_df['label'].map(sentiment_labels)

sample_df

Unnamed: 0,id,verse_text,label
0,0,with pale blue berries. in these peaceful shad...,positive
1,1,"it flows so long as falls the rain,",no_impact
2,2,"and that is why, the lonesome day,",negative
3,3,"when i peruse the conquered fame of heroes, an...",mixed
4,4,of inward strife for truth and liberty.,mixed
5,5,the red sword sealed their vows!,mixed
6,6,and very venus of a pipe.,no_impact
7,7,"who the man, who, called a brother.",no_impact
8,8,"and so on. then a worthless gaud or two,",negative
9,9,to hide the orb of truth--and every throne,no_impact


In [39]:
sample_df['predictions'] = predictions['label']

sample_df

Unnamed: 0,id,verse_text,label,predictions
0,0,with pale blue berries. in these peaceful shad...,positive,positive
1,1,"it flows so long as falls the rain,",no_impact,no_impact
2,2,"and that is why, the lonesome day,",negative,negative
3,3,"when i peruse the conquered fame of heroes, an...",mixed,mixed
4,4,of inward strife for truth and liberty.,mixed,mixed
5,5,the red sword sealed their vows!,mixed,mixed
6,6,and very venus of a pipe.,no_impact,no_impact
7,7,"who the man, who, called a brother.",no_impact,no_impact
8,8,"and so on. then a worthless gaud or two,",negative,negative
9,9,to hide the orb of truth--and every throne,no_impact,no_impact


## **Task 3 - Translation**

Translation models may be designed for specific pairs of languages, or they may support more than two languages.  We will see both below.

**Background reading**: See the Hugging Face [task page on translation](https://huggingface.co/tasks/translation) or the [Wikipedia page on machine translation](https://en.wikipedia.org/wiki/Machine_translation).  

In this section, we will use:  
- **Data**: We will use some example hard-coded sentences.  However, there are a variety of [translation datasets](https://huggingface.co/datasets?task_categories=task_categories:translation&sort=downloads) available from Hugging Face.
- **Models**:
    * [Helsinki-NLP/opus-mt-en-es](https://huggingface.co/Helsinki-NLP/opus-mt-en-es) is used for the first example of English ("en") to Spanish ("es") translation.  This model is based on [Marian NMT](https://marian-nmt.github.io/), a neural machine translation framework developed by Microsoft and other researchers.  See the [GitHub page](https://github.com/Helsinki-NLP/Opus-MT) for code and links to related resources.
    * [t5-small](https://huggingface.co/t5-small) model, which has 60 million parameters (242MB for PyTorch).  T5 is an encoder-decoder model created by Google which supports several tasks such as summarization, translation, Q&A, and text classification.  For more details, see the [Google blog post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html), [code on GitHub](https://github.com/google-research/text-to-text-transfer-transformer), or the [research paper](https://arxiv.org/pdf/1910.10683.pdf).  For our purposes, it supports translation for English, French, Romanian, and German.

**Steps**
1. Define the pipeline by specifying the task and model
```python
from transformers import pipeline

translator = pipeline(
            task="translation",
            model="Helsinki-NLP/opus-mt-en-es"
)
```
2. Use `translator` to translate the language of input english text
```python
translator(text)
```

In [41]:
# sacremoses is for the translation model `Helsinki-NLP/opus-mt-en-es`

! pip install sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
     -------------------------------------- 897.5/897.5 KB 3.8 MB/s eta 0:00:00
Installing collected packages: sacremoses
Successfully installed sacremoses-0.1.1


You should consider upgrading via the 'C:\Users\DELL\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [3]:
! pip install transformers[sentencepiece]

# Make sure to restart the kernel after the installation

Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.2.0-cp39-cp39-win_amd64.whl (991 kB)
     -------------------------------------- 991.5/991.5 KB 4.5 MB/s eta 0:00:00
Collecting protobuf
  Using cached protobuf-4.25.3-cp39-cp39-win_amd64.whl (413 kB)
Installing collected packages: sentencepiece, protobuf
Successfully installed protobuf-4.25.3 sentencepiece-0.2.0


You should consider upgrading via the 'C:\Users\DELL\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [19]:
en_to_es_translation_pipeline = pipeline(
    task="translation",
    model="Helsinki-NLP/opus-mt-en-es"
)

In [20]:
en_to_es_translation_pipeline(
    "Hi, Hope you are doing good."
)

[{'translation_text': 'Hola, espero que lo estés haciendo bien.'}]

Other models are designed to handle multiple languages.  Below, we show this with `t5-small`.  Note that, since it supports multiple languages (and tasks), we give it an explicit instruction to translate from one language to another.

In [21]:
t5_small_pipeline = pipeline(
    task="text2text-generation",
    model="t5-small",
    max_length=50
)

In [22]:
t5_small_pipeline(
    "translate English to French: Hi, Hope you are doing good."
)

# This is an example where we are using prompt and input sentence.
# T5 is instruction tuned LLM, meaning it can understand the instructions and perform tasks

[{'generated_text': 'Bonjour, espérons que vous allez bien.'}]

In [23]:
t5_small_pipeline(
    "translate English to Romanian: Hi, Hope you are doing good."
)

[{'generated_text': 'Îmi exprim speranţa că veţi face bine.'}]

In [24]:
en_to_hi_translation_pipeline = pipeline(
    task="translation",
    model="Helsinki-NLP/opus-mt-en-hi"
)

In [26]:
en_to_hi_translation_pipeline(
    "Hello, Hope you are doing good."
)

[{'translation_text': 'हैलो, आशा है कि आप अच्छा कर रहे हैं।'}]

## **Task 4 - Zero-shot classification**
Zero-shot classification (or zero-shot learning) is the task of classifying a piece of text into one of a few given categories or labels, without having explicitly trained the model to predict those categories beforehand.  The idea appeared in literature before modern LLMs, but recent advances in LLMs have made zero-shot learning much more flexible and powerful.

**Background reading**: See the Hugging Face [task page on zero-shot classification](https://huggingface.co/tasks/zero-shot-classification) or [Wikipedia on zero-shot learning](https://en.wikipedia.org/wiki/Zero-shot_learning).

In this section, we will use:  
* **Data**: a few example articles from the [xsum](https://huggingface.co/datasets/xsum) dataset used in the Summarization section above.  Our goal is to label news articles under a few categories.
* **Model**: [nli-deberta-v3-small](https://huggingface.co/cross-encoder/nli-deberta-v3-small), a fine-tuned version of the DeBERTa model.  The DeBERTa base model was developed by Microsoft and is one of several models derived from BERT; for more details on DeBERTa, see the [Hugging Face doc page](https://huggingface.co/docs/transformers/model_doc/deberta), the [code on GitHub](https://github.com/microsoft/DeBERTa), or the [research paper](https://arxiv.org/abs/2006.03654).

**Steps**
1. Load the data or create an input datapoint
```python
input_text = """sample input text"""
```
2. Define the pipeline by specifying the task and model
```python
from transformers import pipeline

zero_shot_pipeline = pipeline(
                        task="zero-shot-classification",
                        model="cross-encoder/nli-deberta-v3-small"
)
```
3. Use `zero_shot_pipeline` to create candidate output labels and predict on the input text.
```python
zero_shot_pipeline(input_text, candidate_labels=["politics", "finance", 
                                              "sports", "science and technology", 
                                              "pop culture", "breaking news"])
```

In [28]:
article = """
Simone Favaro got the crucial try with the last move of the game, following earlier touchdowns by Chris Fusaro, Zander Fagerson and Junior Bulumakau.
Rynard Landman and Ashton Hewitt got a try in either half for the Dragons.
Glasgow showed far superior strength in depth as they took control of a messy match in the second period.
Home coach Gregor Townsend gave a debut to powerhouse Fijian-born Wallaby wing Taqele Naiyaravoro, and centre Alex Dunbar returned from long-term injury, while the Dragons gave first starts of the season to wing Aled Brew and hooker Elliot Dee.
Glasgow lost hooker Pat McArthur to an early shoulder injury but took advantage of their first pressure when Rory Clegg slotted over a penalty on 12 minutes.
It took 24 minutes for a disjointed game to produce a try as Sarel Pretorius sniped from close range and Landman forced his way over for Jason Tovey to convert - although it was the lock's last contribution as he departed with a chest injury shortly afterwards.
Glasgow struck back when Fusaro drove over from a rolling maul on 35 minutes for Clegg to convert.
But the Dragons levelled at 10-10 before half-time when Naiyaravoro was yellow-carded for an aerial tackle on Brew and Tovey slotted the easy goal.
The visitors could not make the most of their one-man advantage after the break as their error count cost them dearly.
It was Glasgow's bench experience that showed when Mike Blair's break led to a short-range score from teenage prop Fagerson, converted by Clegg.
Debutant Favaro was the second home player to be sin-binned, on 63 minutes, but again the Warriors made light of it as replacement wing Bulumakau, a recruit from the Army, pounced to deftly hack through a bouncing ball for an opportunist try.
The Dragons got back within striking range with some excellent combined handling putting Hewitt over unopposed after 72 minutes.
However, Favaro became sinner-turned-saint as he got on the end of another effective rolling maul to earn his side the extra point with the last move of the game, Clegg converting.
Dragons director of rugby Lyn Jones said: "We're disappointed to have lost but our performance was a lot better [than against Leinster] and the game could have gone either way.
"Unfortunately too many errors behind the scrum cost us a great deal, though from where we were a fortnight ago in Dublin our workrate and desire was excellent.
"It was simply error count from individuals behind the scrum that cost us field position, it's not rocket science - they were correct in how they played and we had a few errors, that was the difference."
Glasgow Warriors: Rory Hughes, Taqele Naiyaravoro, Alex Dunbar, Fraser Lyle, Lee Jones, Rory Clegg, Grayson Hart; Alex Allan, Pat MacArthur, Zander Fagerson, Rob Harley (capt), Scott Cummings, Hugh Blake, Chris Fusaro, Adam Ashe.
Replacements: Fergus Scott, Jerry Yanuyanutawa, Mike Cusack, Greg Peterson, Simone Favaro, Mike Blair, Gregor Hunter, Junior Bulumakau.
Dragons: Carl Meyer, Ashton Hewitt, Ross Wardle, Adam Warren, Aled Brew, Jason Tovey, Sarel Pretorius; Boris Stankovich, Elliot Dee, Brok Harris, Nick Crosswell, Rynard Landman (capt), Lewis Evans, Nic Cudd, Ed Jackson.
Replacements: Rhys Buckley, Phil Price, Shaun Knight, Matthew Screech, Ollie Griffiths, Luc Jones, Charlie Davies, Nick Scott.
"""

In [29]:
zero_shot_pipeline = pipeline(
    task="zero-shot-classification",
    model="cross-encoder/nli-deberta-v3-small"
)

In [13]:
zero_shot_pipeline(article, candidate_labels=["politics", "finance", 
                                              "sports", "science and technology", 
                                              "pop culture", "breaking news"])

{'sequence': '\nSimone Favaro got the crucial try with the last move of the game, following earlier touchdowns by Chris Fusaro, Zander Fagerson and Junior Bulumakau.\nRynard Landman and Ashton Hewitt got a try in either half for the Dragons.\nGlasgow showed far superior strength in depth as they took control of a messy match in the second period.\nHome coach Gregor Townsend gave a debut to powerhouse Fijian-born Wallaby wing Taqele Naiyaravoro, and centre Alex Dunbar returned from long-term injury, while the Dragons gave first starts of the season to wing Aled Brew and hooker Elliot Dee.\nGlasgow lost hooker Pat McArthur to an early shoulder injury but took advantage of their first pressure when Rory Clegg slotted over a penalty on 12 minutes.\nIt took 24 minutes for a disjointed game to produce a try as Sarel Pretorius sniped from close range and Landman forced his way over for Jason Tovey to convert - although it was the lock\'s last contribution as he departed with a chest injury sh

In [14]:
article = """
The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we're neglected or forgotten," she said.
"That may not be true but it is perhaps my perspective over the last few days.
"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"
Meanwhile, a flood alert remains in place across the Borders because of the constant rain.
Peebles was badly hit by problems, sparking calls to introduce more defences in the area.
Scottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.
The Labour Party's deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.
He said it was important to get the flood protection plan right but backed calls to speed up the process.
"I was quite taken aback by the amount of damage that has been done," he said.
"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."
He said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.
Have you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.
"""

zero_shot_pipeline(article, candidate_labels=["politics", "finance", 
                                              "sports", "science and technology", 
                                              "pop culture", "breaking news"])

 'labels': ['breaking news',
  'politics',
  'pop culture',
  'science and technology',
  'sports',
  'finance'],
 'scores': [0.20821063220500946,
  0.1737898588180542,
  0.17375348508358002,
  0.15718074142932892,
  0.15456238389015198,
  0.13250291347503662]}

## **Task 5 - Few-shot learning**
In few-shot learning tasks, you give the model an instruction, a few query-response examples of how to follow that instruction, and then a new query.  The model must generate the response for that new query.  This technique has pros and cons: it is very powerful and allows models to be reused for many more applications, but it can be finicky and require significant prompt engineering to get good and reliable results.

**Background reading**: See the [Wikipedia page on few-shot learning](https://en.wikipedia.org/wiki/Few-shot_learning_&#40;natural_language_processing&#41;) or [this Hugging Face blog about few-shot learning](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api).

In this section, we will use:
* **Task**: Few-shot learning can be applied to many tasks.  Here, we will do sentiment analysis, which was covered earlier.  However, you will see how few-shot learning allows us to specify custom labels, whereas the previous model was tuned for a specific set of labels.  We will also show other (toy) tasks at the end.  In terms of the Hugging Face `task` specified in the `pipeline` constructor, few-shot learning is handled as a `text-generation` task.
* **Data**: We use a few examples, including a tweet example from the blog post linked above.
* **Model**: [gpt-neo-1.3B](https://huggingface.co/EleutherAI/gpt-neo-1.3B), a version of the GPT-Neo model discussed in the blog linked above.  It is a transformer model with 1.3 billion parameters developed by Eleuther AI.  For more details, see the [code on GitHub](https://github.com/EleutherAI/gpt-neo) or the [research paper](https://arxiv.org/abs/2204.06745).

**Steps**
1. Load the data or create an input datapoint
```python
customer_review = """sample input text"""
```
2. Define the pipeline by specifying the task and model
```python
from transformers import pipeline

few_shot_pipeline = pipeline(
                    task="text-generation",
                    model="EleutherAI/gpt-neo-1.3B",
                    max_new_tokens=10
)
```
3. Use `few_shot_pipeline` to send few-shots (i.e. examples) and predict on the input text.
```python
eos_token_id = few_shot_pipeline.tokenizer.encode("###")[0]

results = few_shot_pipeline(f"""
            For each tweet, describe its sentiment:
            
            [Tweet]: "I hate it when my phone battery dies."
            [Sentiment]: Negative
            ###
            [Tweet]: "My day has been 👍"
            [Sentiment]: Positive
            ###
            [Tweet]: "This is the link to the article"
            [Sentiment]: Neutral
            ###
            [Tweet]: "{customer_review}"
            [Sentiment]: 
            """,
            eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])
```

In [17]:
# few_shot_pipeline = pipeline(
#     task="text-generation",
#     model="EleutherAI/gpt-neo-1.3B",
#     max_new_tokens=10
# )

# # gpt-neo is a large language model
# # It consumes huge space in the storage: 5+GB

**Tip**: In the few-shot prompts below, we separate the examples with a special token "###" and use the same token to encourage the LLM to end its output after answering the query.  We will tell the pipeline to use that special token as the end-of-sequence (EOS) token below.

In [20]:
# # Get the token ID for "###", which we will use as the EOS token below.

# eos_token_id = few_shot_pipeline.tokenizer.encode("###")[0]

# print(eos_token_id)

In [21]:
# # Without any examples, the model output is inconsistent and usually incorrect.

# results = few_shot_pipeline("""
# For each tweet, describe its sentiment:

# [Tweet]: "This new music video was incredible"
# [Sentiment]:
# """,
#     eos_token_id=eos_token_id,
# )

# print(results[0]["generated_text"])

In [22]:
# # With only 1 example, the model may or may not get the answer right.

# results = few_shot_pipeline("""
# For each tweet, describe its sentiment:

# [Tweet]: "This is the link to the article"
# [Sentiment]: Neutral
# ###
# [Tweet]: "This new music video was incredible"
# [Sentiment]:
# """,
#     eos_token_id=eos_token_id,
# )

# print(results[0]["generated_text"])

In [23]:
# # With 1 example for each sentiment, the model is more likely to understand!

# results = few_shot_pipeline("""
# For each tweet, describe its sentiment:

# [Tweet]: "I hate it when my phone battery dies."
# [Sentiment]: Negative
# ###
# [Tweet]: "My day has been 👍"
# [Sentiment]: Positive
# ###
# [Tweet]: "This is the link to the article"
# [Sentiment]: Neutral
# ###
# [Tweet]: "This new music video was incredible"
# [Sentiment]:""",
#     eos_token_id=eos_token_id,
# )

# print(results[0]["generated_text"])

In [24]:
# # This example sometimes works and sometimes does not, when sampling.  Too abstract?

# results = few_shot_pipeline("""
# Given a word describing how someone is feeling, suggest a description of that person.  The description should not include the original word.

# [word]: happy
# [description]: smiling, laughing, clapping
# ###
# [word]: nervous
# [description]: glancing around quickly, sweating, fidgeting
# ###
# [word]: sleepy
# [description]: heavy-lidded, slumping, rubbing eyes
# ###
# [word]: confused
# [description]:""",
#     eos_token_id=eos_token_id,
# )

# print(results[0]["generated_text"])

In [25]:
# # We override max_new_tokens to generate longer answers.
# # These book descriptions were taken from their corresponding Wikipedia pages.

# results = few_shot_pipeline("""
# Generate a book summary from the title:

# [book title]: "Stranger in a Strange Land"
# [book description]: "This novel tells the story of Valentine Michael Smith, a human who comes to Earth in early adulthood after being born on the planet Mars and raised by Martians, and explores his interaction with and eventual transformation of Terran culture."
# ###
# [book title]: "The Adventures of Tom Sawyer"
# [book description]: "This novel is about a boy growing up along the Mississippi River. It is set in the 1840s in the town of St. Petersburg, which is based on Hannibal, Missouri, where Twain lived as a boy. In the novel, Tom Sawyer has several adventures, often with his friend Huckleberry Finn."
# ###
# [book title]: "Dune"
# [book description]: "This novel is set in the distant future amidst a feudal interstellar society in which various noble houses control planetary fiefs. It tells the story of young Paul Atreides, whose family accepts the stewardship of the planet Arrakis. While the planet is an inhospitable and sparsely populated desert wasteland, it is the only source of melange, or spice, a drug that extends life and enhances mental abilities.  The story explores the multilayered interactions of politics, religion, ecology, technology, and human emotion, as the factions of the empire confront each other in a struggle for the control of Arrakis and its spice."
# ###
# [book title]: "Blue Mars"
# [book description]:""",
#     eos_token_id=eos_token_id,
#     max_new_tokens=50,
# )

# print(results[0]["generated_text"])


## **Search and sampling in inference**

You may see parameters like `num_beams`, `do_sample`, etc. specified in Hugging Face pipelines.  These are inference configurations.

LLMs work by predicting (generating) the next token, then the next, and so on.  The goal is to generate a high probability sequence of tokens, which is essentially a search through the (enormous) space of potential sequences.

To do this search, LLMs use one of two main methods:
* **Search**: Given the tokens generated so far, pick the next most likely token in a "search."
   * **Greedy search** (default): Pick the single next most likely token in a greedy search.
   * **Beam search**: Greedy search can be extended via beam search, which searches down several sequence paths, via the parameter `num_beams`.
* **Sampling**: Given the tokens generated so far, pick the next token by sampling from the predicted distribution of tokens.
   * **Top-k sampling**: The parameter `top_k` modifies sampling by limiting it to the `k` most likely tokens.
   * **Top-p sampling**: The parameter `top_p` modifies sampling by limiting it to the most likely tokens up to probability mass `p`.

You can toggle between search and sampling via parameter `do_sample`.

For more background on search and sampling, see [this Hugging Face blog post](https://huggingface.co/blog/how-to-generate).

We will illustrate these various options below using our summarization pipeline.

Recall the `xsum` dataset from the **Summarization** section before.

In [30]:
print("Sample Article:\n", xsum_sample["document"][0])

Sample Article:
 The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate th

In [31]:
# We previously called the summarization pipeline using the default inference configuration.
# This does greedy search.

%time summarizer(xsum_sample["document"][0])

CPU times: total: 8.86 s
Wall time: 2.37 s


[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]

In [32]:
# We can instead do a beam search by specifying num_beams.
# This takes longer to run, but it might find a better (more likely) sequence of text.

%time summarizer(xsum_sample["document"][0], num_beams=100)

CPU times: total: 1min 33s
Wall time: 24.6 s


[{'summary_text': 'many businesses and householders were affected by flooding in Newton Stewart . the water breached a retaining wall, flooding many commercial properties . a flood alert remains in place across'}]

In [33]:
# Alternatively we could use sampling - It will be more random (stochastic)

%time summarizer(xsum_sample["document"][0], do_sample=True)

CPU times: total: 10.1 s
Wall time: 2.66 s


[{'summary_text': 'many businesses and householders were affected by flooding in Newton Stewart . the water breached a retaining wall, flooding many commercial properties . a flood alert remains in place across'}]

In [34]:
# We can modify sampling to be more greedy by limiting 
# sampling to the top_k or top_p most likely next token

%time summarizer(xsum_sample["document"][0], do_sample=True, top_k=10, top_p=0.8)

CPU times: total: 9.66 s
Wall time: 2.51 s


[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]