This notebook is part of the [tutorial](https://github.com/dhruvdcoder/unity_project/blob/main/summarization-hf.md) created for getting started with [Unity](https://unity.rc.umass.edu/). Please follow the steps mentioned [here](https://github.com/dhruvdcoder/unity_project/blob/main/summarization-hf.md) before running this notebook on Unity. 

Note: This tutorial uses fragments from the [HuggingFace Notebooks](https://huggingface.co/docs/transformers/notebooks), and we do not claim any rights.

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.


We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [3]:
import transformers
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [4]:
from datasets import load_dataset, load_metric
raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Found cached dataset xsum (/home/dhruveshpate_umass_edu/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71)
100%|██████████| 3/3 [00:02<00:00,  1.18it/s]
  metric = load_metric("rouge")


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [5]:
# Look at the number of instances in the dataset
print(f" Instances:\n {raw_datasets}")

 Instances:
 DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})


To access an actual element, you need to select a split first, then give an index:

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [6]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"Team USA are due to host the International Ice Hockey Federation World Championships at the end of the month.\nBut the champions say are boycotting in an ongoing dispute about equitable support with the men's sport.\nThe hashtag #BeBoldForChange has been used by fans to support the players.\nHailing the team's stand, King, who has long pushed for equality in sport, tweeted: ""Being a world class athlete should not be a part time job.""\nKing, who co-founded the Women's Tennis Association tour and won 12 Grand Slam singles titles, succeeded in her campaign for the US Open to award equal prize money in the men's and women's tournaments in 1973.\nThe US women's ice hockey team have won six of the past eight world championships and achieved medals in every Olympics, including gold in 1998.\nThey point to the $3.5m (Â£2.8m) that USA Hockey spends annually on its men's team development program, and say there is no comparable setup for the women.\nSeveral players said USA Hockey has paid players only $1,000 a month during their six-month Olympic residency period. They say they only have contracts in Olympic years and are seeking a deal that covers them during the 3.5 years in between.\nThey are are being represented by the same law firm who represented the US women's soccer team in an equal pay dispute in 2016.\nThe boycott threat follows 14 months of failed negotiations between the team's players and USA Hockey.\n""We're unanimously united as a player pool,'' forward Hilary Knight said.\n""We're asking for equitable support and marketing and visibility and promotion in programming but also in some financial support.\n""It's 2017 and those things are not unreasonable.""\nThe team have been in a public war of words with USA Hockey, labelling its statement about an increased support offer as ""completely misleading and dishonest''.\nUSA Hockey President Jim Smith said: ""USA Hockey's role is not to employ athletes and we will not do so.""\nThe hockey board now say they are trying to field a team without the boycotting players.\nCaptain Meghan Duggan said it was ""one of the hardest decisions we've had to make as a team, I think, in all of our careers.\n""Being willing to stand up and sacrifice an opportunity like that - to host a world championship on home soil, to defend a gold medal - I think it just shows how passionate we are and how serious we are.''",Tennis legend Billie Jean King has backed the US women's national ice hockey team in their high profile equal pay dispute.,39296376
1,"Mr Kenny stood down as the party's leader at midnight, but will stay on as PM until his successor is chosen.\nLeo Varadkar and Simon Coveney are considered favourites to lead the party but the winner would then face a Dáil (parliament) vote to become taoiseach.\nThe nominations deadline is expected to be 17:00 local time on Saturday.\nFine Gael's ruling body - the executive council - will meet on Thursday evening to finalise its plans for the party's leadership contest.\nMr Varadkar, the social protection minister, and Mr Coveney, the housing minister, are expected to announce their candidacies shortly.\nHowever, Tánaiste (Deputy Prime Minister) Frances Fitzgerald has ruled herself out of the contest.\nMs Fitzgerald, who is also the current justice minister, said in a statement that she had ""seriously considered contesting the leadership election"".\nHowever, she added: ""I have decided that entering the contest is not the right decision for me.""\nMs Fitzgerald also paid tribute to Mr Kenny saying: ""His work on behalf of the country and our party has been immense and extraordinary.""\nAnother Fine Gael veteran, Finance Minister Michael Noonan, announced his decision to step down from Cabinet after the leadership election.\nMr Noonan also confirmed that he will not contest the next general election.\nIn order to get their name on the ballot paper, prospective candidates must secure signatures from at least eight members of Fine Gael's parliamentary party.\nThe parliamentary party comprises:\nThere had been speculation in the media for months about Mr Kenny's future.\nIt followed a series of scandals involving An Garda Síochána (Irish police force) and the party's disappointing performance in the 2016 general election.\nAs he resigned, Mr Kenny asked the executive council to ""expedite the process"" and chose a new party leader by Friday 2 June.\nSpeaking to Irish broadcaster RTÉ, Fine Gael chairman Martin Heydon said that meant the timescale would be tighter than had originally been planned, but would be manageable.\n""Anybody who wants to be nominated will need 10% of the parliamentary party to nominate them, and by Saturday evening we'll know how many runners are in the field,"" he said.\nAfter nominations close, Fine Gael will then choose its leader through an electoral college system in which weighed votes are given to different branches of the party.\nMr Heydon told RTÉ that while Fine Gael members have the power to pick the new leader of their party, Dáil TDs must then vote on whether or not they become taoiseach.\n""If we have a new leader of Fine Gael appointed on 2 June, the Dáil doesn't sit until the following week, for the June bank holiday, so there would be a 10-day period there before the Dáil would be back.""\nHe said the next opportunity for the Dáil to vote on a new taoiseach would be Tuesday 13 June.\nOver the last year, Mr Kenny led a minority government, which was propped up by an alliance of independent TDs and required the support of the opposition party - Fianna Fáil - to pass its budgets.\nMr Heydon added: ""We have a lot of partners in government - between the Independent Alliance, our supply and confidence arrangement with Fianna Fáil.\n""And I think that it is right that a new leader coming in would be given the time and space to be able to consult all of those parties, let them known their vision and our plans, to get that process in place.""\nMr Kenny leaves the Fine Gael leadership as the party's most successful taoiseach.\nIn a statement announcing his retirement, he said it had been a ""huge honour and privilege"" to lead the party over the course of 15 years.",The process of replacing outgoing Taoiseach (Irish Prime Minister) Enda Kenny begins officially later with a meeting of his Fine Gael party.,39955707
2,"The company will raise standard tariff electricity prices by 10.8% from 31 March, while gas prices will increase by 4.7%.\nIt means a typical dual fuel annual bill will rise by an average of 7.8%, or Â£86.\nScottish Power said in a statement that about a third of its customers - or about 1.1 million homes - would be affected by the increases.\nIt attributed the move in part to rises in energy wholesale markets and compulsory non-energy costs, including the upgrade to smart meters.\nScottish Power's announcement came as British Gas said it would extend a price freeze for its customers on its standard energy tariff until August.\nLast month, Npower faced a backlash after it said it would raise standard tariff electricity prices by 15% from 16 March, and gas prices by 4.8%.\nScottish Power's UK retail director, Colin McNeill, said: ""This increase will apply to one in three of our customers, and we continue to work hard to move even more customers to our fixed price deals.\n""We will be writing to all those affected, outlining the changes and encouraging more loyal customers to move to a deal that best suits them.\n""This price change follows months of cost increases that have already led to significant rises in fixed price products that now unfortunately have to be reflected in standard prices.""",Scottish Power has announced a sharp increase in energy prices.,38930236
3,"He has told senior Trump administration staff about the company's technology.\nUntil March Mr Luckey worked at Facebook, which paid $2bn (Â£1.55bn) for Oculus, the VR firm he founded.\nHe told the New York Times there was a need for a ""new kind"" of defence company using ""superior technology"" to protect troops and citizens.\nThe paper quoted insiders who said it planned to use sensors similar to those found on autonomous vehicles to monitor activity around fences and walls.\nSmart software would be able to tell the difference between things that can be ignored, such as birds and other animals, and those, like drones, that demand attention.\nDetails about the new firm, including its name, are scant.\nFormer staff from Oculus who have also left the company are believed to have been recruited for the new start-up.\nTech news site The Verge speculated that the firm could either be linked to Mr Luckey's support for Texas senator Ted Cruz, who has regularly called for improvements to border controls, or could be a smart business move.\nIn April, Mr Luckey hosted a fundraising event for Mr Cruz to help the politician's efforts to be re-elected in 2018.\nMr Luckey is also known to have funded a pro-Trump online advocacy group and gave cash to help pay for President Trump's inauguration ceremony.","Virtual reality pioneer Palmer Luckey has founded a start-up concentrating on technology to police borders and large events, reports the New York Times.",40158899
4,"Lyn has directed programmes including Doctor Who, Happy Valley and Broadchurch.\nThe award is presented to someone who has made a significant contribution to international feature films or network television.\nThe announcement was made on Thursday at a party to celebrate the nominees for this year's Bafta Cymru Awards.\nDylan Thomas biopic Set Fire to the Stars has received seven nominations.\nMeanwhile, Doctor Who's Peter Capaldi is in the running for best actor - up against Hinterland's Richard Harrington and Rhys Ifans for his role in Dan y Wenallt - the Welsh language film of Dylan Thomas's Under Milk Wood.\nThe ceremony will take place at St David's Hall in Cardiff on 27 September, hosted by BBC Radio 1 and Radio Cymru presenter Huw Stephens.",The director Euros Lyn will receive the Siân Phillips Award at this year's Bafta Cymru Awards ceremony.,34285007


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [8]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hi there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.5, recall=0.5, fmeasure=0.5), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [9]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [10]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [11]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [12]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [13]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [14]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [15]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 37, 423, 583, 13, 1783, 16, 20126, 16496, 6, 80, 13, 8, 844, 6025, 4161, 6, 19, 341, 271, 14841, 5, 7057, 161, 19, 4912, 16, 1626, 5981, 11, 186, 7540, 16, 1276, 15, 2296, 7, 5718, 2367, 14621, 4161, 57, 4125, 387, 5, 15059, 7, 30, 8, 4653, 4939, 711, 747, 522, 17879, 788, 12, 1783, 44, 8, 15763, 6029, 1813, 9, 7472, 5, 1404, 1623, 11, 5699, 277, 130, 4161, 57, 18368, 16, 20126, 16496, 227, 8, 2473, 5895, 15, 147, 89, 22411, 139, 8, 1511, 5, 1485, 3271, 3, 21926, 9, 472, 19623, 5251, 8, 616, 12, 15614, 8, 1783, 5, 37, 13818, 10564, 15, 26, 3, 9, 3, 19513, 1481, 6, 18368, 186, 1328, 2605, 30, 7488, 1887, 3, 18, 8, 711, 2309, 9517, 89, 355, 5, 3966, 1954, 9233, 15, 6, 113, 293, 7, 8, 16548, 13363, 106, 14022, 84, 47, 14621, 4161, 6, 243, 255, 228, 59, 7828, 8, 1249, 18, 545, 11298, 1773, 728, 8, 8347, 1560, 5, 611, 6, 255, 243, 72, 1709, 1528, 161, 228, 43, 118, 4006, 91, 12, 766, 8, 3, 19513, 1481, 410, 59, 5124, 5, 96, 196, 17, 19, 1256, 68, 27, 103, 317, 132

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [16]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)



Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [17]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [18]:
# move model to GPU
batch_size = 4
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [19]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [20]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [21]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [22]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


TrainOutput(global_step=51012, training_loss=2.706265403571376, metrics={'train_runtime': 7325.8002, 'train_samples_per_second': 27.853, 'train_steps_per_second': 6.963, 'total_flos': 4.353357915271987e+16, 'train_loss': 2.706265403571376, 'epoch': 1.0})

In [30]:
# save model locally in the current directory
trainer.save_model('our-t5-small-finetuned-xsum')

## Inference

Great, now that you’ve finetuned a model, you can use it for inference!

Come up with some text you’d like to summarize. For T5, you need to prefix your input depending on the task you’re working on. For summarization you should prefix your input as shown below:

In [24]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for summarization with your model, and pass your text to it:

In [31]:
from transformers import pipeline

summarizer = pipeline("summarization", model="our-t5-small-finetuned-xsum")
summarizer(text)

Your max_length is set to 200, but you input_length is only 103. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': 'The US government has passed a bill to reduce the cost of prescription drugs and reduce the cost of health care. ... ... ...'}]

You can also manually replicate the results of the pipeline if you’d like:

Tokenize the text and return the input_ids as PyTorch tensors:

In [32]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("our-t5-small-finetuned-xsum")
inputs = tokenizer(text, return_tensors="pt").input_ids

Use the generate() method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the Text Generation API.

In [34]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("our-t5-small-finetuned-xsum")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Decode the generated token ids back into text:

In [35]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'The Affordable Care Act (ACA) is a measure to reduce the cost of prescription drugs and health care costs.'