You will need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies in your python environment. 
Create a new conda environment and select the kernel for that environment.

## Creating a new conda environment

1. Open a shell on unity. You can use the On Demand Portal for this. 
2. Then create a conda environment named 'hf-summarization' using `conda create -n hf-summarization python=3.9.7 pip ipykernel`. 
3. Activate the conda environment `conda activate hf-summarization`
4. Install packages using pip: `pip install datasets transformers rouge-score nltk` and `pip install torch --index-url https://download.pytorch.org/whl/cu118`.
5. Register your conda environment: `/modules/apps/ood/jupyterlab/bin/python -m ipykernel install --user --name hf-summarization`
6. Make sure that you load the right cuda version `module load cuda/11.8` before starting jupyter. This can be done using the job dashboard when you request a job for your jupyter notebook.

In [1]:
import transformers

  from .autonotebook import tqdm as notebook_tqdm


# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.


We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [2]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [3]:
from datasets import load_dataset, load_metric
raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Found cached dataset xsum (/home/dhruveshpate_umass_edu/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71)
100%|██████████| 3/3 [00:00<00:00, 190.70it/s]
  metric = load_metric("rouge")


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [4]:
# Look at the number of instances in the dataset
print(f" Instances:\n {raw_datasets}")

 Instances:
 DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})


To access an actual element, you need to select a split first, then give an index:

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [5]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"""I love Ireland,"" said Mr Trump, adding that he would visit the Republic of Ireland during his term in office.\nThe president told Mr Kenny he was his ""new friend"" and their governments would forge an even tighter bond.\nAfter their meeting on Thursday, Mr Kenny made an impassioned plea for the 50,000 ""undocumented"" Irish who live in US without legal permission.\n""This is what I said to your predecessor on a number of occasions - we would like this to be sorted,"" he told the president at a lunch event.\n""It would remove a burden off so many people that they can stand out in the light and say: 'Now I am free to contribute to America, as I know I can.'""\nNoting the presence at the lunch of Northern Ireland politicians Ian Paisley of the DUP and Sinn FÃ©in president Gerry Adams, the taoiseach said: ""We want to protect this peace process and I know you are going to work with us in that context also.""\nMr Trump quoted from what he said was an Irish proverb that he had heard ""many, many years ago"".\n""Always remember to forget the friends that proved untrue, but never forget to remember those that have stuck by you.""\nLater, Mr Kenny presented Mr Trump with a bowl of shamrocks for St Patrick's Day.\nMr Kenny had breakfast with US Vice President Mike Pence in Washington earlier in the day, in the company of their wives Fionnuala and Karen.\nThat followed his attendance of the Ireland Funds America gala dinner, which included a speech in which Mr Pence emphasised the commitment of America to the island of Ireland.\nHe said the US was pledged to securing the gains of the Northern Ireland peace process.\nAt the dinner, Mr Pence congratulated the people of Northern Ireland for turning out to vote in high numbers during the recent assembly election.\n""The advance of peace and prosperity in Northern Ireland is one of the great success stories of the past 20 years,"" he said.\n""We thank those unsung heroes in Northern Ireland who, day-in and day-out, do the difficult and important work - strengthening communities, educating children, building that brighter future for the emerald isle and all who call it home.""\nMr Pence also recalled his Irish grandfather Richard Michael Cawley, who emigrated to the US from County Sligo in 1923, and spoke with pride about his Irish heritage.\nHe added that he had thought about his grandfather during inauguration day in January.\n""The truth is that whatever honours I will receive over the course of my service as vice president, to receive an honour in the name of the Irish people and my Irish heritage will count as chief among,"" he said.\n""All that I am and all that I will ever be and all the service that I will ever make is owing to my Irish heritage.""\nMr Kenny presented Mr Pence with a roll book from a County Sligo school that included the name of his grandfather.\nHe said Ireland ""took special pride in the fact that, for the first time in the history of this great republic, one Irish American has succeeded another in the office of vice president"".\nThe Irish prime minister added that immigration was the main focus of his trip to the US.\nHe said he was pursuing a process where Irish people living in the US illegally can ""come in from the cold, and feel the warmth of this great country they have made their home"".\nA number of politicians from Northern Ireland are in America this week.\nMr Paisley, a DUP MP, who also attended the Ireland Funds America gala dinner, said that he expects little progress to be made during political talks at Stormont until the last minute.\nHe said he was hopeful the political parties could get ""the show back on the road"".",Taoiseach (Prime Minister) Enda Kenny has held talks with US President Donald Trump in the White House.,39289707
1,"Disagreement over when the event was first held means the carnival's 50th anniversary will be marked again this year, as it was in 2014 and 2015.\nThe first parades in the 1960s had many similarities to those held now.\nBBC News looks at how the event has developed into what organisers claim is the largest street party in Europe.\nIn 1959 a Caribbean style cabaret was held at St Pancras Town Hall to showcase the styles of carnival.\nOther Caribbean events were held in London during the 1960s before a parade was organised in Notting Hill by a team led by social worker Rhuanne Laslett.\nThe organisers said the aim of the festival was to bring cultures together through arts and Caribbean steel bands.\nBy the 1970s the festival was attracting increasing numbers of performers each year to west London.\nBut racial tension simmered at some events with a riot following the carnival in 1976.\nStatic sound systems, like this one seen in 1981, have become a major part of the event. There are now 38 separate sound systems as well as the World Music Stage at Powis Square.\nRelations between police and the organisers have improved over the years.\nThe carnival has always attracted a diverse crowd, like this punk in 1984.\nMasquerade bands, sound systems, steel bands, calypso and soca music are all carnival mainstays.\nThe costume troupes who parade each year are known as Mas or Masquerade bands.\nInternational stars have made appearances at the carnival, like Wyclef Jean in 2003.\nJouvert marks the start of the carnival on Sunday morning, where performers cover each other in paint and chocolate.\nVeteran DJ Norman Jay, seen here in 2006, has become a mainstay at the carnival with his Good Times sound system.\nAs well the costumes, music and dancing, the carnival is well known for its street food...\n...as well as the enormous queues for the carnival toilets.\nSamba band Batala have made numerous appearances and will be returning again in 2016.\nThe Sunday parade is more child-friendly and proves popular with families.\nRevellers will be hoping for an improvement in the weather compared to 2015.\nThis year's parades begin at 10:00 BST on Sunday and Monday.",Every year more than one million people descend on the streets of west London to enjoy two days of festivities at the Notting Hill Carnival.,37024463
2,"Since opening the doors to his famous Koto - Know One Teach One - restaurant in Hanoi in 2000, he has helped around 400 homeless children to become industrious cooks.\nAt his non-profit hospitality training centre he has passed on both cooking and life skills.\n""I came to Vietnam never wanting to start a project as big as Koto, I just wanted to make a difference,"" he recalls.\n""I look back now and realise that it has given me this incredible joy.""\nBorn in Ho Chi Minh City to a single mum with six children during the Vietnam war, Mr Pham lived in Australia from the age of eight before he returned to his homeland in the early 1990s.\nIt was there his Koto project was born after he stumbled across a group of children selling coconuts on the streets in 1996.\n""I found these street kids carrying coconuts and working 16 hours a day,"" he explained to the BBC World Service's Outlook programme. ""They were living from hand to mouth.\n""So I took them and 60 other kids to dinner for the next two weeks.""\nBut it was another three years before the idea for his restaurant first came to fruition.\n""At the time I thought I knew better,"" he admitted. ""I gave them fish everyday for that period but then they pulled me aside.""\n""They said: 'Look we trust you now but you can't keep on looking after us this way. We're going to need a job. We need you to show us how to fish for ourselves'.""\nFrom there, his Koto project was launched. Children not only learned how to cook but were taught lessons in life too.\n""The first thing you receive is housing and medical checks along with vaccinations,"" Mr Pham explained.\n""You learn about team building and life skills programmes, vocational training and English, which gives you the confidence to meet people.""\nInterest in his restaurant gathered pace and within months former US President Bill Clinton dropped by for a bite to eat with an entourage of 80 reporters.\nSo suspicious were the Vietnamese government following Mr Clinton's stop-off that they feared Mr Pham was a member of the CIA.\n""I think I was under watch for about three or four years after that,"" he laughs. ""But I'm glad we went through that phase because I've got the green light now to go on and do the wonderful things that Koto is doing.""\nMr Pham - a former travel agent in Melbourne - has no formal cooking or hospitality qualifications.\nThe only culinary skills he possesses he picked up as a boy making doughnuts and selling sandwiches.\n""The funny thing is I don't have any hospitality, development or psychology skills,"" he said. ""I'm just someone who is very passionate about what I do and I just want to make a difference.""\nListen again to Outlook\nDownload as a podcast\nMore from BBC World Service\nLooking back, Jimmy Pham admits that despite feeling a sense of achievement, his Koto project has been very difficult to deal with emotionally over the years.\n""I've seen visible changes in front of me,"" he added. ""Four hundred kids later, I'm seeing them with their own families and breaking the cycle of poverty which gives me great joy.\n""But it has also given me incredible sorrow and sadness because I've seen so much pain caused to a kid.""\nCorrection 11 Nov 2010: This story has been amended since it was first published to correct the mis-spelling of Jimmy Pham's name. An earlier version of the story referred to him as Jimmy Sham.",Meet Jimmy Pham - Vietnam's answer to Jamie Oliver.,11701796
3,"The public vote, which is called a referendum, will happen on Thursday, 23 June 2016.\nThe European Union is a group of 28 countries in Europe whose governments work together.\nIt was set up to help make trading between European countries easier, as well as travel and immigration.\nFind out more about the EU referendum in our guide here.\nEU laws affect many areas of our lives as well - like health and safety rules, and even how many fish we're allowed to catch.\nFor the last 40 years that the UK's been a member of the EU, there's been a debate about our role within it.\nSome feel that being part of this bigger club makes the UK richer and more important.\nOthers argue that the EU takes power away from the UK. They feel that people who aren't British shouldn't be making laws for this country.",The Prime Minister David Cameron is asking people to vote on whether the UK should remain a member of the European Union (or EU).,36378481
4,"Not only do you get an exciting virtual reality experience, but Daydream also gives off enough heat to keep your house warm throughout the winter.\nDevices get hot, I get that, but the Pixel XL smartphone running a full throttle VR experience is something else all together - the McDonald's apple pie of gadgetry, strapped to your face.\nIt's a reminder, I think, of the strain we're putting our devices under to deliver the kind of experiences we demand in mobile computing.\nAnd there is also a sense that when it comes to smartphone-powered VR, we've still got some way to go before the hardware can truly match our expectations.\nAs I found from playing around with Daydream this past week, small sacrifices mean it falls short of the immersive future promised by Google boss Sundar Pichai.\nDaydream, first announced by Mr Pichai in May, goes on sale on Thursday.\nThe product is pretty simple: a pair of lenses housed in fabric-covered plastic, which comes with a smart little remote control that you'll use to control various aspects of your VR experience.\nTo get it to do anything, you'll need to add a crucial component - a smartphone.\nIn this case, I was using Google's new Pixel XL, but the plan is for Daydream to support a wide range of phones that may or may not be made by Google.\nIt's worth stressing that Daydream isn't designed to compete with the high-end virtual reality hardware we've seen hit the market this year.\nInstead, the intention is to act as a sort of gateway drug to virtual reality - an experience that is ""good enough"" without breaking the bank.\nDaydream is competing with Gear VR, a collaboration between Samsung and Facebook-owned Oculus that works on the same principle: you slot your phone into a cheap-ish empty headset, but with Gear VR it must be a Samsung device.\nThis race is so important for Google and Facebook as they are both utterly desperate to get ahead in the major new marketplace of flogging virtual reality apps and content.\nGear VR has a head start. You can already check out live sport such as NBA matches and settle in for virtual big-screen Netflix viewing, which is oddly enjoyable.\nIt's early days for Daydream, but there are promising signs that good content is on its way, mostly thanks to services that Google already owns such as Street View and YouTube.\nThe latter's 360-degree video offerings lack any actual interactivity, but do provide some breathtaking visuals without having to wait too long for content to download.\nOne of Google's goals with Daydream is to make something that doesn't look and feel overwhelmingly nerdy.\nThe headset is covered in a nice, light fabric that at least on the surface looks more comfortable than Samsung's plasticky Gear VR.\nBut once you put it on, you immediately notice Daydream's significant design flaws - problems that don't seem to have been created by hardware limitations but simply by bad decisions.\nI'll start with the most serious. When you fire it up, chances are Daydream will be out of focus. Blurry images in VR are a terrible thing - it increases the already high risk of feeling disorientated or nauseous.\nOn Gear VR, there's a focus dial on top of the device that gives you a few millimetres of adjustment space to sharpen things up.\nBut on Daydream there is nothing of the sort - you're instead required to shift the device around your forehead until you get something that works with your eyes.\nThe resulting position, in my case, wasn't as comfortable as it should have been.\nDaydream's rigid shape seems to ignore that peoples faces aren't all the same.\nFor me, the headset didn't sit snugly over my nose, which meant a line of natural light crept up through the bottom at all times.\nFlaws like this ruin the VR effect of being transported to a different world.\nThe illusion never has a chance to take hold.\nThis problem is made worse by Daydream's limited field of view.\nYou never get away from the feeling that you're viewing this virtual world through a small, restrictive circle.\nThat potential world is made smaller still due to Daydream's solid back.\nWith Gear VR, the phone's rear camera can be turned on to offer a sort of augmented reality that lets you look out into the real world while keeping the headset on.\nBut with Daydream, the phone is tucked into a little pouch, blocking the ability for the rear camera to see anything.\nAnd as I mentioned, the Pixel XL gets extremely hot.\nWear it long enough and you'll really start to feel the heat, though Daydream doesn't fog up in the same way Gear VR does - a major plus there.\nWhere Google has got it very right is with its little controller.\nYou use it to interact with everything in Daydream, and it's a terrific bit of hardware that is instantly intuitive and delightfully flexible.\nIn WonderGlade, a game that plonks you in a cartoon theme park, the controller is used to full effect.\nIn one section, it's used to control a game of tilt and roll. In the next mini-game, it changes into a magic wand, then a fire hose, and then a putter.\nIt has a tendency to disconnect every now and then, which is a real frustration, but the amount of interaction it offers gives Daydream a big selling point over Gear VR, and even some of the high-end headsets.\nPeople who want to use VR for gaming will naturally still want a wireless gaming controller. But for the vast majority of uses, the simple remote nails it.\nI had higher hopes for Daydream when it was first announced in May. And I find it irritating that minor things, such as the focus adjuster, are missing and would have made a big difference.\nBut Google's strategy to make Daydream the entry level headset that supports a wide range of smartphones gives it an undeniable advantage over its rival.\nWe're still lacking a real killer VR app, and Daydream is already attracting bad ideas such as the Wall Street Journal's app, which mostly lets you read text and look at market data in VR - the most pointless adoption of technology since my grandmother used a tablet computer as a coaster.\nBut there are signals that truly groundbreaking applications will come in time.\nThe Guardian's Daydream app, for example, places you in a tiny cell in a US jail, a feeling of solitary confinement that no amount of written words could possibly emulate.\nFor most people, VR won't be something they discover via the high-end PC gaming set-ups required for the top experiences today. It'll be this cheap and cheerful option instead.\nRight now the technology isn't quite there, but it's getting closer - and with Daydream, Google has gained an early advantage.\nFollow Dave Lee on Twitter and on Facebook","Google's Daydream virtual reality kit comes with a remarkable ""two in one"" feature.",37936942


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [7]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hi there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.5, recall=0.5, fmeasure=0.5), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [8]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [9]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [10]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [11]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [12]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [13]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [14]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 37, 423, 583, 13, 1783, 16, 20126, 16496, 6, 80, 13, 8, 844, 6025, 4161, 6, 19, 341, 271, 14841, 5, 7057, 161, 19, 4912, 16, 1626, 5981, 11, 186, 7540, 16, 1276, 15, 2296, 7, 5718, 2367, 14621, 4161, 57, 4125, 387, 5, 15059, 7, 30, 8, 4653, 4939, 711, 747, 522, 17879, 788, 12, 1783, 44, 8, 15763, 6029, 1813, 9, 7472, 5, 1404, 1623, 11, 5699, 277, 130, 4161, 57, 18368, 16, 20126, 16496, 227, 8, 2473, 5895, 15, 147, 89, 22411, 139, 8, 1511, 5, 1485, 3271, 3, 21926, 9, 472, 19623, 5251, 8, 616, 12, 15614, 8, 1783, 5, 37, 13818, 10564, 15, 26, 3, 9, 3, 19513, 1481, 6, 18368, 186, 1328, 2605, 30, 7488, 1887, 3, 18, 8, 711, 2309, 9517, 89, 355, 5, 3966, 1954, 9233, 15, 6, 113, 293, 7, 8, 16548, 13363, 106, 14022, 84, 47, 14621, 4161, 6, 243, 255, 228, 59, 7828, 8, 1249, 18, 545, 11298, 1773, 728, 8, 8347, 1560, 5, 611, 6, 255, 243, 72, 1709, 1528, 161, 228, 43, 118, 4006, 91, 12, 766, 8, 3, 19513, 1481, 410, 59, 5124, 5, 96, 196, 17, 19, 1256, 68, 27, 103, 317, 132

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [15]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)



Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [16]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [17]:
# move model to GPU
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

  return torch._C._cuda_getDeviceCount() > 0


ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices.

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [None]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len,Runtime,Samples Per Second
1,2.7211,2.479327,28.3009,7.7211,22.243,22.2496,18.8225,326.3338,34.725


TrainOutput(global_step=12753, training_loss=2.7692033505520146, metrics={'train_runtime': 4909.3835, 'train_samples_per_second': 2.598, 'total_flos': 7.774481450954342e+16, 'epoch': 1.0, 'init_mem_cpu_alloc_delta': 335248, 'init_mem_gpu_alloc_delta': 242026496, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 2637782, 'train_mem_gpu_alloc_delta': 728138240, 'train_mem_cpu_peaked_delta': 138226182, 'train_mem_gpu_peaked_delta': 14677017088})

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```