# Large language models (LLM) and Hugging Face

A large language model (LLM) is a deep learning algorithm that can perform various tasks. They can recognize, translate, predict, or generate text or other content. Additionally, they can understand protein structures, write software code, perform text classification, answer questions, and document summarization, as well as tasks like translation, chatbots, and AI assistants.

Healthcare and Science: Large language models have the ability to understand proteins, molecules, DNA, and RNA. This position allows LLMs to assist in developing vaccines, finding cures for illnesses, and improving preventative care medicines. LLMs are also used as medical chatbots to perform patient intakes or basic diagnoses.

Large language models use **transformer models** and are trained using massive datasets. They must be pre-trained and then fine-tuned. For more information about the transformers, you can check my [link](https://github.com/burcuozek/Transformersrepo/blob/main/TransformersHuggingFace.ipynb).

Basically, transformers consist of an **encoder and a decoder with self-attention mechanisms**. Self-attention enables the transformer model to consider different parts of the sequence or the entire context of a sentence to generate predictions.

The main components are: 
- Embedding layer (captures the semantic and syntactic meaning of the input, enabling the model to understand context).
- Feedforward layer (helps to discern the user's intent from the text input).
- Attention mechanism (enables the model to interpret input sequences effectively).


In this project, we will see the following applications of LLM:
1. Summarization
2. Sentiment analysis
3. Zero-shot classification
4. Few-shot learning

In [2]:
# !pip install sacremoses==0.0.53
# !pip install -U accelerate --quiet

In [3]:
from datasets import load_dataset
from transformers import pipeline

# 1 - Summarization
We will use  [xsum](https://huggingface.co/datasets/EdinburghNLP/xsum) dataset, which provides a set of BBC articles and summaries.

As a model, we will use [t5-small](https://huggingface.co/t5-small) model, which is an encoder-decoder model created by Google.

In [4]:
!mkdir cache

In [5]:
xsum_dataset = load_dataset(
    "xsum", version="1.2.0", cache_dir="../working/cache/"
)  # Note: We specify cache_dir to use predownloaded data.
xsum_dataset  # The printed representation of this object shows the `num_rows` of each dataset split.

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [6]:
xsum_sample = xsum_dataset["train"].select(range(10))
display(xsum_sample.to_pandas())

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490


## How to use Hugging Face pipeline tool?
We will the Hugging Face pipeline tool to load a pre-trained model. It has the following inputs:

- task: specifies the primary task.
- model: defines the pre-trained model from the Hugging Face Hub.
- min_length, max_length: determines the lengths of the generated summaries.
- truncation: fixes the limits on the length of input sequences.

In [7]:
summarizer = pipeline(
    task="summarization",
    model="t5-small",
    min_length=20,
    max_length=40,
    truncation=True,
    model_kwargs={"cache_dir": "../working/cache/"},
)  # Note: We specify cache_dir to use predownloaded models.

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [8]:
# Apply to 1 article
summarizer(xsum_sample["document"][0])

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]

In [9]:
# Apply to a batch of articles
results = summarizer(xsum_sample["document"])

In [10]:
# Display the generated summary side-by-side with the reference summary and original document.
# We use Pandas to join the inputs and outputs together in a nice format.
import pandas as pd

display(
    pd.DataFrame.from_dict(results)
    .rename({"summary_text": "generated_summary"}, axis=1)
    .join(pd.DataFrame.from_dict(xsum_sample))[
        ["generated_summary", "summary", "document"]
    ]
)

Unnamed: 0,generated_summary,summary,document
0,the full cost of damage in Newton Stewart is s...,Clean-up operations are continuing across the ...,"The full cost of damage in Newton Stewart, one..."
1,a fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,A fire alarm went off at the Holiday Inn in Ho...
2,Sebastian Vettel will start third ahead of tea...,Lewis Hamilton stormed to pole position at the...,Ferrari appeared in a position to challenge un...
3,the 67-year-old is accused of committing the o...,A former Lincolnshire Police officer carried o...,"John Edward Bates, formerly of Spalding, Linco..."
4,a man receiving psychiatric treatment at the c...,An armed man who locked himself into a room at...,Patients and staff were evacuated from Cerahpa...
5,Gregor Townsend gave a debut to powerhouse win...,Defending Pro12 champions Glasgow Warriors bag...,Simone Favaro got the crucial try with the las...
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,the 25-year-old was hit by a motorbike during ...,Welsh cyclist Luke Rowe says changes to the sp...,Belgian cyclist Demoitie died after a collisio...
8,gundogan will not be fit for the start of the ...,Manchester City midfielder Ilkay Gundogan says...,"Gundogan, 26, told BBC Sport he ""can see the f..."
9,the crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,The crash happened about 07:20 GMT at the junc...


# 2- Sentiment analysis

We will use [poem sentiment](https://huggingface.co/datasets/poem_sentiment) dataset, which provides lines from poems tagged with sentiments negative (0), positive (1), no_impact (2), or mixed (3).

We will use [fine-tuned version of BERT](https://huggingface.co/nickwong64/bert-base-uncased-poems-sentiment) which is an encoder-only model from Google usable for 11+ tasks such as sentiment analysis and entity recognition. 

In [11]:
poem_dataset = load_dataset(
    "poem_sentiment", version="1.0.0", cache_dir="../working/cache/"
)
poem_sample = poem_dataset["train"].select(range(10))
display(poem_sample.to_pandas())

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.10k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.51k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.44k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/892 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/105 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/104 [00:00<?, ? examples/s]

Unnamed: 0,id,verse_text,label
0,0,with pale blue berries. in these peaceful shad...,1
1,1,"it flows so long as falls the rain,",2
2,2,"and that is why, the lonesome day,",0
3,3,"when i peruse the conquered fame of heroes, an...",3
4,4,of inward strife for truth and liberty.,3
5,5,the red sword sealed their vows!,3
6,6,and very venus of a pipe.,2
7,7,"who the man, who, called a brother.",2
8,8,"and so on. then a worthless gaud or two,",0
9,9,to hide the orb of truth--and every throne,2


In [12]:
sentiment_classifier = pipeline(
    task="text-classification",
    model="nickwong64/bert-base-uncased-poems-sentiment",
    model_kwargs={"cache_dir": "../working/cache/"},
)

config.json:   0%|          | 0.00/923 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/923 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/348 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [13]:
results = sentiment_classifier(poem_sample["verse_text"])
# Display the predicted sentiment side-by-side with the ground-truth label and original text.
# The score indicates the model's confidence in its prediction.

# Join predictions with ground-truth data
joined_data = (
    pd.DataFrame.from_dict(results)
    .rename({"label": "predicted_label"}, axis=1)
    .join(pd.DataFrame.from_dict(poem_sample).rename({"label": "true_label"}, axis=1))
)



In [14]:
# Change label indices to text labels
sentiment_labels = {0: "negative", 1: "positive", 2: "no_impact", 3: "mixed"}
joined_data = joined_data.replace({"true_label": sentiment_labels})

display(joined_data[["predicted_label", "true_label", "score", "verse_text"]])

Unnamed: 0,predicted_label,true_label,score,verse_text
0,positive,positive,0.996594,with pale blue berries. in these peaceful shad...
1,no_impact,no_impact,0.998741,"it flows so long as falls the rain,"
2,negative,negative,0.995966,"and that is why, the lonesome day,"
3,mixed,mixed,0.968735,"when i peruse the conquered fame of heroes, an..."
4,mixed,mixed,0.975967,of inward strife for truth and liberty.
5,mixed,mixed,0.96658,the red sword sealed their vows!
6,no_impact,no_impact,0.998639,and very venus of a pipe.
7,no_impact,no_impact,0.998611,"who the man, who, called a brother."
8,negative,negative,0.996557,"and so on. then a worthless gaud or two,"
9,no_impact,no_impact,0.998519,to hide the orb of truth--and every throne


# 3- Zero-shot classification

Zero-shot classification (or zero-shot learning) is the task of classifying a piece of text into one of a few given categories or labels, **without having explicitly trained the model to predict those categories beforehand**. 

We will use the [xsum](https://huggingface.co/datasets/EdinburghNLP/xsum) dataset. We aim to label news articles under a few categories.

We will use [nli-deberta-v3-small model](https://huggingface.co/cross-encoder/nli-deberta-v3-small), which is a fine-tuned version of the DeBERTa model (developed by Microsoft)

In [17]:
# !pip install protobuf

In [18]:
zero_shot_pipeline = pipeline(
    task="zero-shot-classification",
    model="cross-encoder/nli-deberta-v3-small",
    model_kwargs={"cache_dir": "../working/cache/"},
)



In [19]:
def categorize_article(article: str) -> None:
    """
    This helper function defines the categories (labels) which the model must use to label articles.
    Note that our model was NOT fine-tuned to use these specific labels,
    but it "knows" what the labels mean from its more general training.

    This function then prints out the predicted labels alongside their confidence scores.
    """
    results = zero_shot_pipeline(
        article,
        candidate_labels=[
            "politics",
            "finance",
            "sports",
            "science and technology",
            "pop culture",
            "breaking news",
        ],
    )
    # Print the results nicely
    del results["sequence"]
    display(pd.DataFrame(results))

In [22]:
categorize_article(
    """
The AI Doctor Is In. Here's How ChatGPT May Pave a New Era of Self-Diagnosis
The chatbot is more than fun to use: It may be the new health assistant for those who need it most in 2024 and beyond.
Katie Sarvela was sitting in her bedroom in Nikiksi, Alaska, on top of a moose-and-bear-themed bedspread, when she entered some of her earliest symptoms into ChatGPT. 

The ones she remembers describing to the chatbot include half of her face feeling like it's on fire, then sometimes being numb, her skin feeling wet when it's not wet and night blindness. 

ChatGPT's synopsis? 

"Of course it gave me the 'I'm not a doctor, I can't diagnose you,'" Sarvela said. But then: multiple sclerosis. An autoimmune disease that attacks the central nervous system. 
Now 32, Sarvela started experiencing MS symptoms when she was in her early 20s. She gradually came to suspect it was MS, but she still needed another MRI and lumbar puncture to confirm what she and her doctor suspected. While it wasn't a diagnosis, the way ChatGPT jumped to the right conclusion amazed her and her neurologist, according to Sarvela. 


ChatGPT is an AI-powered chatbot that scrapes the internet for information and then organizes it based on which questions you ask, all served up in a conversational tone. It set off a profusion of generative AI tools throughout 2023, and the version based on the GPT-3.5 large language model is available to everyone for free. The way it can quickly synthesize information and personalize results raises the precedent set by "Dr. Google," the researcher's term describing the act of people looking up their symptoms online before they see a doctor. More often we call it "self-diagnosing." 

For people like Sarvela, who've lived for years with mysterious symptoms before getting a proper diagnosis, having a more personalized search to bounce ideas off of may help save precious time in a health care system where long wait times, medical gaslighting, potential biases in care, and communication gaps between doctor and patient lead to years of frustration. 


"""
)

Unnamed: 0,labels,scores
0,science and technology,0.435002
1,breaking news,0.158277
2,pop culture,0.142483
3,sports,0.103507
4,finance,0.083671
5,politics,0.07706


# 4- Few-shot learning

In few-shot learning tasks, you give the model an instruction, a few query-response examples of how to follow that instruction, and then a new query. 

Our aim is to do sentiment analysis. But few-shot learning can be applied to many tasks. In these examples, we will see how few-shot learning allows us to specify custom labels, whereas the previous model was tuned for a specific set of labels.

We will use some tweet examples as a dataset. 
We will use  [gpt-neo-1.3B](https://huggingface.co/EleutherAI/gpt-neo-1.3B) model

In [23]:
# We will limit the response length for our few-shot learning tasks.
few_shot_pipeline = pipeline(
    task="text-generation",
    model="EleutherAI/gpt-neo-1.3B",
    max_new_tokens=10,
    model_kwargs={"cache_dir": "../working/cache/"},
)

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

In [24]:
# Tip: In the few-shot prompts below, we separate the examples with a special token "###" 
# and use the same token to encourage the LLM to end its output after answering the query.
# We will tell the pipeline to use that special token as the end-of-sequence (EOS) token below.

# Get the token ID for "###", which we will use as the EOS token below.
eos_token_id = few_shot_pipeline.tokenizer.encode("###")[0]

In [25]:
# Without any examples, the model output is inconsistent and usually incorrect.
results = few_shot_pipeline(
    """For each tweet, describe its sentiment:

[Tweet]: "This new music video was incredible"
[Sentiment]:""",
    eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.


For each tweet, describe its sentiment:

[Tweet]: "This new music video was incredible"
[Sentiment]: "Liked, interesting"
[Tweet]:


In [26]:
# With only 1 example, the model may or may not get the answer right.
results = few_shot_pipeline(
    """For each tweet, describe its sentiment:

[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]:""",
    eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.


For each tweet, describe its sentiment:

[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]: Neutral
###


In [27]:
# With 1 example for each sentiment, the model is more likely to understand!
results = few_shot_pipeline(
    """For each tweet, describe its sentiment:

[Tweet]: "I hate it when my phone battery dies."
[Sentiment]: Negative
###
[Tweet]: "My day has been 👍"
[Sentiment]: Positive
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]:""",
    eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.


For each tweet, describe its sentiment:

[Tweet]: "I hate it when my phone battery dies."
[Sentiment]: Negative
###
[Tweet]: "My day has been 👍"
[Sentiment]: Positive
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]: Positive
###


### Prompt engineering is a new but critical technique for working with LLMs. As you use more general and powerful models, constructing good prompts becomes more critical. 

# Details about Hugging Face

- Search and sampling to generate text
- Auto* loaders for tokenizers and models
- Model-specific loaders

## 1- Search and sampling in inference

LLMs work by predicting the next token, then the next, and so on. The goal is to generate a high-probability sequence of tokens,

To do this search, LLMs use one of two main methods:

1 - Search: Given the tokens generated so far, pick the next most likely token in a "search."
- Greedy search (default): Pick the single next most likely token in a greedy search.
- Beam search: Greedy search can be extended via beam search, which searches down several sequence paths via the parameter num_beams.

2 - Sampling: Given the tokens generated so far, pick the next token by sampling from the predicted distribution of tokens.
- Top-K sampling: The parameter top_k modifies sampling by limiting it to the k most likely tokens.
- Top-p sampling: The parameter top_p modifies sampling by limiting it to the most likely tokens up to probability mass p.


We can choose sampling or search by **do_sample** parameter

In [28]:
# This does greedy search.
summarizer(xsum_sample["document"][0])

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]

In [29]:
# We can instead do a beam search by specifying num_beams.
# This takes longer to run, but it might find a better (more likely) sequence of text.
summarizer(xsum_sample["document"][0], num_beams=10)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]

In [30]:
# Alternatively, we could use sampling.
summarizer(xsum_sample["document"][0], do_sample=True)

[{'summary_text': 'many businesses and householders were affected by flooding in Newton Stewart . the water breached a retaining wall, flooding many commercial properties . a flood alert remains in place across'}]

In [31]:
# We can modify sampling to be more greedy by limiting sampling to the top_k or top_p most likely next tokens.
summarizer(xsum_sample["document"][0], do_sample=True, top_k=10, top_p=0.8)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining'}]

## 2- Auto* loaders for tokenizers and models

While a **pipeline is a quick** way to set up an LLM for a given task, the slightly lower-level abstractions **model and tokenizer** permit a bit **more control** over options. Following is the way to do that:

- Given input articles.
- Tokenize them (converting to token indices).
- Apply the model on the tokenized data to generate summaries (represented as token indices).
- Decode the summaries into human-readable text.

More information about the model and tokenizer can be found from this [link](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM). 

In [32]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [33]:
# Load the pre-trained tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("t5-small", cache_dir="../working/cache/")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small", cache_dir="../working/cache/")

In [34]:
# For summarization, T5-small expects a prefix "summarize: ", so we prepend that to each article as a prompt.
articles = list(map(lambda article: "summarize: " + article, xsum_sample["document"]))
display(pd.DataFrame(articles, columns=["prompts"]))
# Tokenize the input
inputs = tokenizer(
    articles, max_length=1024, return_tensors="pt", padding=True, truncation=True
)
print("input_ids:")
print(inputs["input_ids"])
print("attention_mask:")
print(inputs["attention_mask"])

Unnamed: 0,prompts
0,summarize: The full cost of damage in Newton S...
1,summarize: A fire alarm went off at the Holida...
2,summarize: Ferrari appeared in a position to c...
3,"summarize: John Edward Bates, formerly of Spal..."
4,summarize: Patients and staff were evacuated f...
5,summarize: Simone Favaro got the crucial try w...
6,"summarize: Veronica Vanessa Chango-Alverez, 31..."
7,summarize: Belgian cyclist Demoitie died after...
8,"summarize: Gundogan, 26, told BBC Sport he ""ca..."
9,summarize: The crash happened about 07:20 GMT ...


input_ids:
tensor([[21603,    10,    37,  ...,     0,     0,     0],
        [21603,    10,    71,  ...,     0,     0,     0],
        [21603,    10, 21945,  ..., 18002,    21,     1],
        ...,
        [21603,    10, 21768,  ...,     0,     0,     0],
        [21603,    10,  9982,  ...,     0,     0,     0],
        [21603,    10,    37,  ...,     0,     0,     0]])
attention_mask:
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


In [35]:
# Generate summaries
summary_ids = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    num_beams=2,
    min_length=0,
    max_length=40,
)
print(summary_ids)

tensor([[    0,     8,   423,   583,    13,  1783,    16, 20126, 16496,    19,
           341,   271, 14841,     3,     5,   186,  7540,    16,   158,    15,
          2296,     7,  5718,  2367, 14621,  4161,    57,  4125,   387,     3,
             5,     3,     9,  8347,  5685,  3048,    16,   286,   640,     8],
        [    0,  1472,  6196,   877,   326,    44,     8,  9108,    86,    29,
            16,  6000,  1887,    30,  1856,     3,     5,  2554,   130,  1380,
            12,  1175,     8,  1595,     3,     5,    80,    13,     8,   192,
         14264,    19,    45, 13692,    63,     6,     8,   119,    45, 20576],
        [    0,     3,   849,  2239,     7,   163, 14014,     3,    60,  8234,
           232,   227,     3, 19585,   643,   845,   150,  8033,    47,   787,
            30,   213,     3,    88,   225,  2447,     3,     5,     3,   849,
          2239,     7,   497,     3,    31,    29,    32,   964,  8033,    47],
        [    0,     8,     3,  3708,    18,  1201

In [36]:
# Decode the generated summaries
decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
display(pd.DataFrame(decoded_summaries, columns=["decoded_summaries"]))

Unnamed: 0,decoded_summaries
0,the full cost of damage in Newton Stewart is s...
1,fire alarm went off at the Holiday Inn in Hope...
2,stewards only handed reprimand after governing...
3,the 67-year-old is accused of committing the o...
4,a man receiving treatment at the clinic threat...
5,Gregor Townsend gave a debut to powerhouse win...
6,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,the 25-year-old was hit by a motorbike during ...
8,gundogan says he can see the finishing line af...
9,the crash happened about 07:20 GMT at the junc...


## 3- Model-specific tokenizer and model loaders

We can also more directly load specific tokenizer and model types, rather than relying on Auto* classes to choose the right ones for you.

In [37]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small", cache_dir="../working/cache/")
model = T5ForConditionalGeneration.from_pretrained(
    "t5-small", cache_dir="../working/cache/"
)
# The tokenizer and model can then be used similarly to how we used the ones loaded by the Auto* classes.
inputs = tokenizer(
    articles, max_length=1024, return_tensors="pt", padding=True, truncation=True
)
summary_ids = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    num_beams=2,
    min_length=0,
    max_length=40,
)
decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

display(pd.DataFrame(decoded_summaries, columns=["decoded_summaries"]))

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Unnamed: 0,decoded_summaries
0,the full cost of damage in Newton Stewart is s...
1,fire alarm went off at the Holiday Inn in Hope...
2,stewards only handed reprimand after governing...
3,the 67-year-old is accused of committing the o...
4,a man receiving treatment at the clinic threat...
5,Gregor Townsend gave a debut to powerhouse win...
6,"Veronica Vanessa Chango-Alverez, 31, was kille..."
7,the 25-year-old was hit by a motorbike during ...
8,gundogan says he can see the finishing line af...
9,the crash happened about 07:20 GMT at the junc...


### Comparison of Summarization

In [None]:
### PIPELINE

summarizer = pipeline(
    task="summarization",
    model="t5-small",
    min_length=20,
    max_length=40,
    truncation=True,
    model_kwargs={"cache_dir": "../working/cache/"},
) 

In [None]:
## Auto* loaders for tokenizers and models

# Load the pre-trained tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("t5-small", cache_dir="../working/cache/")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small", cache_dir="../working/cache/")

inputs = tokenizer(
    articles, max_length=1024, return_tensors="pt", padding=True, truncation=True
)

summary_ids = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    num_beams=2,
    min_length=0,
    max_length=40,
)

In [None]:
## Model-specific tokenizer and model loaders

tokenizer = T5Tokenizer.from_pretrained("t5-small", cache_dir="../working/cache/")
model = T5ForConditionalGeneration.from_pretrained(
    "t5-small", cache_dir="../working/cache/"
)
# The tokenizer and model can then be used similarly to how we used the ones loaded by the Auto* classes.
inputs = tokenizer(
    articles, max_length=1024, return_tensors="pt", padding=True, truncation=True
)
summary_ids = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    num_beams=2,
    min_length=0,
    max_length=40,
)

### References

References: 
- https://www.kaggle.com/code/aliabdin1/llm-01-how-to-use-llms-with-hugging-face
- https://www.elastic.co/what-is/large-language-models