# Fine tuning T5 summarization

Base on code from https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb

In [1]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/furyhawk/text_summarization/raw/master'
    os.system(f'wget {GIT_ROOT}/notebooks/setup.py')

%run -i setup.py


You are working on a local system.
Files will be searched relative to "..".


In [2]:
%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# to print output of all statements and not just the last
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# otherwise text between $ signs will be interpreted as formula and printed in italic
pd.set_option('display.html.use_mathjax', False)

# path to import blueprints packages
sys.path.append(BASE_DIR + '/packages')

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [3]:

! pip install datasets transformers rouge-score nltk



If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [4]:
from huggingface_hub import notebook_login

#notebook_login()

Then you need to install Git-LFS. Uncomment the following instructions:

In [5]:
#!apt install git-lfs

In [6]:
#!git clone https://github.com/jessevig/bertviz.git

In [7]:

import os.path
import pandas as pd
import numpy as np

## Load BBC News summary dataset

In [8]:
root_path = f'../data/BBC News Summary'


# root_path = f'/kaggle/input/bbc-news-summary/BBC News Summary'


def loadDataset(root_path):

    types_of_articles = ['business',
                         'entertainment', 'politics', 'sport', 'tech']
    df = pd.DataFrame(columns=['title', 'article', 'summary'])

    for type_of_article in types_of_articles:
        # type_of_article = 'business'  # entertainment, politices, sport, tech
        num_of_article = len(os.listdir(
            f"{root_path}/News Articles/{type_of_article}"))

        print(f'"Reading {type_of_article} articles"')
        dataframe = pd.DataFrame(columns=['title', 'article', 'summary'])

        for i in tqdm(range(num_of_article)):
            with open(f'{root_path}/News Articles/{type_of_article}/{(i+1):03d}.txt', 'r', encoding="utf8", errors='ignore') as f:
                article = f.read().partition("\n")
            with open(f'{root_path}/Summaries/{type_of_article}/{(i+1):03d}.txt', 'r', encoding="utf8", errors='ignore') as f:
                summary = f.read()

            dataframe.loc[i] = [article[0], article[2].replace(
                '\n', ' ').replace('\r', ''), summary]

        df = df.append(dataframe, ignore_index=True)

    return df


In [9]:
df = loadDataset(root_path)

"Reading business articles"


  0%|          | 0/510 [00:00<?, ?it/s]

"Reading entertainment articles"


  0%|          | 0/386 [00:00<?, ?it/s]

"Reading politics articles"


  0%|          | 0/417 [00:00<?, ?it/s]

"Reading sport articles"


  0%|          | 0/511 [00:00<?, ?it/s]

"Reading tech articles"


  0%|          | 0/401 [00:00<?, ?it/s]

In [10]:

sample_text = df['title'][0]
print( 'Title' + sample_text)
sample_text = df['article'][0]
print( 'article' + sample_text)
sample_text = df['summary'][0]
print( 'summary' + sample_text)

TitleAd sales boost Time Warner profit
article Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.  The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.  Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to

In [11]:
df.head()

Unnamed: 0,title,article,summary
0,Ad sales boost Time Warner profit,"Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors in Goo...","TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09b..."
1,Dollar gains on Greenspan speech,The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. And Alan Greenspan highlighted the US g...,The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.China's currency remains pegged to the dol...
2,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan. State-owned Rosneft bought the Yugansk unit for $9.3bn in a s...,Yukos' owner Menatep Group says it will ask Rosneft to repay a loan that Yugansk had secured on its assets.State-owned Rosneft bought the Yugansk unit for $9.3bn in a sale forced by Russia to part...
3,High fuel prices hit BA's profits,"British Airways has blamed high fuel prices for a 40% drop in profits. Reporting its results for the three months to 31 December 2004, the airline made a pre-tax profit of £75m ($141m) compared ...","Rod Eddington, BA's chief executive, said the results were ""respectable"" in a third quarter when fuel costs rose by £106m or 47.3%.To help offset the increased price of aviation fuel, BA last year..."
4,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard. Reports in the Wall Street Journal and the Financia...,"Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied ..."


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    2225 non-null   object
 1   article  2225 non-null   object
 2   summary  2225 non-null   object
dtypes: object(3)
memory usage: 52.3+ KB


In [13]:
#df.to_csv('bbc.csv')

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [14]:
import transformers

print(transformers.__version__)

4.12.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [BBC dataset](https://www.kaggle.com/pariza/bbc-news-summary) which contains BBC News Summary.

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [15]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [16]:
from datasets import Dataset, DatasetDict, load_metric
raw_datasets = Dataset.from_pandas(df)  # Load from dataframe created earlier

metric = load_metric("rouge")

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [17]:
raw_datasets

Dataset({
    features: ['title', 'article', 'summary'],
    num_rows: 2225
})

In [18]:
raw_datasets.info 

DatasetInfo(description='', citation='', homepage='', license='', features={'title': Value(dtype='string', id=None), 'article': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name=None, config_name=None, version=None, splits=None, download_checksums=None, download_size=None, post_processing_size=None, dataset_size=None, size_in_bytes=None)

Whether to use news headline or summary as label

In [19]:
# raw_datasets = raw_datasets.remove_columns("title")

## Splitting the dataset in train and test split

In [20]:
# 90% train, 10% test + validation
train_testvalid = raw_datasets.train_test_split(test_size=0.1)
# Split the 10% test + validation in half test, half validation
test_valid = train_testvalid["test"].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
raw_datasets = DatasetDict({
    "train": train_testvalid["train"],
    "test": test_valid["test"],
    "valid": test_valid["train"]})

raw_datasets


DatasetDict({
    train: Dataset({
        features: ['title', 'article', 'summary'],
        num_rows: 2002
    })
    test: Dataset({
        features: ['title', 'article', 'summary'],
        num_rows: 112
    })
    valid: Dataset({
        features: ['title', 'article', 'summary'],
        num_rows: 111
    })
})

To access an actual element, you need to select a split first, then give an index:

In [21]:
raw_datasets["train"][0]

{'summary': "Monsanto also has admitted to paying bribes to a number of other high-ranking officials between 1997 and 2002.A former senior manager at Monsanto directed an Indonesian consulting firm to give a $50,000 bribe to a high-level official in Indonesia's environment ministry in 2002.The US agrochemical giant Monsanto has agreed to pay a $1.5m (£799,000) fine for bribing an Indonesian official.Monsanto faced both criminal and civil charges from the Department of Justice and the SEC.Monsanto has agreed to pay $1m to the Department of Justice, adopt internal compliance measures, and co-operate with continuing civil and criminal investigations.Monsanto admitted one of its employees paid the senior official two years ago in a bid to avoid environmental impact studies being conducted on its cotton.",
 'title': 'Monsanto fined $1.5m for bribery',
 'article': ' The US agrochemical giant Monsanto has agreed to pay a $1.5m (£799,000) fine for bribing an Indonesian official.  Monsanto admi

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [22]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [23]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,title,article,summary
0,Ocean's Twelve raids box office,"Ocean's Twelve, the crime caper sequel starring George Clooney, Brad Pitt and Julia Roberts, has gone straight to number one in the US box office chart. It took $40.8m (£21m) in weekend ticket sales, according to studio estimates. The sequel follows the master criminals as they try to pull off three major heists across Europe. It knocked last week's number one, National Treasure, into third place. Wesley Snipes' Blade: Trinity was in second, taking $16.1m (£8.4m). Rounding out the top five was animated fable The Polar Express, starring Tom Hanks, and festive comedy Christmas with the Kranks. Ocean's Twelve box office triumph marks the fourth-biggest opening for a December release in the US, after the three films in the Lord of the Rings trilogy. The sequel narrowly beat its 2001 predecessor, Ocean's Eleven which took $38.1m (£19.8m) on its opening weekend and $184m (£95.8m) in total. A remake of the 1960s film, starring Frank Sinatra and the Rat Pack, Ocean's Eleven was directed by Oscar-winning director Steven Soderbergh. Soderbergh returns to direct the hit sequel which reunites Clooney, Pitt and Roberts with Matt Damon, Andy Garcia and Elliott Gould. Catherine Zeta-Jones joins the all-star cast. ""It's just a fun, good holiday movie,"" said Dan Fellman, president of distribution at Warner Bros. However, US critics were less complimentary about the $110m (£57.2m) project, with the Los Angeles Times labelling it a ""dispiriting vanity project"". A milder review in the New York Times dubbed the sequel ""unabashedly trivial"".","Ocean's Twelve, the crime caper sequel starring George Clooney, Brad Pitt and Julia Roberts, has gone straight to number one in the US box office chart.The sequel narrowly beat its 2001 predecessor, Ocean's Eleven which took $38.1m (£19.8m) on its opening weekend and $184m (£95.8m) in total.A remake of the 1960s film, starring Frank Sinatra and the Rat Pack, Ocean's Eleven was directed by Oscar-winning director Steven Soderbergh.Ocean's Twelve box office triumph marks the fourth-biggest opening for a December release in the US, after the three films in the Lord of the Rings trilogy.Soderbergh returns to direct the hit sequel which reunites Clooney, Pitt and Roberts with Matt Damon, Andy Garcia and Elliott Gould.A milder review in the New York Times dubbed the sequel ""unabashedly trivial""."
1,Kilroy-Silk quits 'shameful' UKIP,"Ex-chat show host Robert Kilroy-Silk has quit the UK Independence Party and accused it of betraying its supporters. The MEP said he was ashamed to have joined the party, which he labelled as a ""joke"". He plans to stand in the next general election but refused to confirm he is setting up a new political party called Veritas - Latin for truth. UKIP leader Roger Knapman said he would ""break open the champagne"", adding: ""It was nice knowing him, now 'goodbye'."" However, he did say the ex-chat show host had been ""quite useful initially"". ""He has remarkable ability to influence people but, sadly, after the (European) election it became clear that he was more interested in the Robert Kilroy-Silk Party than the UK Independence Party so it was nice knowing him, now 'goodbye',"" Mr Knapman told BBC Radio 4's Today programme. Mr Knapman rejected the idea Mr Kilroy-Silk posed a threat to UKIP and queried why he had failed to confirm rumours he was starting a new political party. Mr Kilroy-Silk explained his reasons to his East Midlands constituents at a meeting in Hinckley, Leicestershire. His decision came as UKIP officials began a process which could have triggered Mr Kilroy-Silk's expulsion. It marks the end of his membership of UKIP after just nine months. It began with a flood of publicity which helped UKIP into third place in last June's European elections but became dominated by rancour as he tried to take over the party leadership. Mr Kilroy-Silk accused his fellow UKIP MEPs of being content with growing fat ""sitting on their backsides"" in Brussels. He told BBC News 24: ""I tried to change the party, I nagged all the way through the summer to do things, to get moving because I thought it was criminal what they were doing, it was a betrayal."" Mr Kilroy-Silk also told Sky News there was ""masses of support"" for him to form a new party - something he has yet to confirm will happen. UKIP won 12 seats and 16.1% of the vote at the European elections on the back of its call for the UK to leave the European Union In his speech, Mr Kilroy-Silk says the result offered UKIP an ""amazing opportunity"" but the party's leadership had done nothing and ""gone AWOL"". There were no policies, no energy, no vision and no spokespeople, he said. ""The party is going nowhere and I'm embarrassed with its allies in Europe and I'm ashamed to be a member of the party,"" said Mr Kilroy-Silk. He said his conviction in Britain's right to govern itself had not changed. He would continue that campaign outside UKIP when he contested the general election in an East Midlands constituency. Reports of his new party plans have prompted a formal complaint to UKIP's disciplinary committee for bringing the party into ""disrepute"". On Thursday, the party challenged Mr Kilroy-Silk to stand down as an MEP so voters can get a genuine UKIP candidate.","Mr Knapman rejected the idea Mr Kilroy-Silk posed a threat to UKIP and queried why he had failed to confirm rumours he was starting a new political party.""He has remarkable ability to influence people but, sadly, after the (European) election it became clear that he was more interested in the Robert Kilroy-Silk Party than the UK Independence Party so it was nice knowing him, now 'goodbye',"" Mr Knapman told BBC Radio 4's Today programme.On Thursday, the party challenged Mr Kilroy-Silk to stand down as an MEP so voters can get a genuine UKIP candidate.""The party is going nowhere and I'm embarrassed with its allies in Europe and I'm ashamed to be a member of the party,"" said Mr Kilroy-Silk.Mr Kilroy-Silk also told Sky News there was ""masses of support"" for him to form a new party - something he has yet to confirm will happen.The MEP said he was ashamed to have joined the party, which he labelled as a ""joke"".Ex-chat show host Robert Kilroy-Silk has quit the UK Independence Party and accused it of betraying its supporters.UKIP won 12 seats and 16.1% of the vote at the European elections on the back of its call for the UK to leave the European Union In his speech, Mr Kilroy-Silk says the result offered UKIP an ""amazing opportunity"" but the party's leadership had done nothing and ""gone AWOL"".Mr Kilroy-Silk accused his fellow UKIP MEPs of being content with growing fat ""sitting on their backsides"" in Brussels."
2,Highbury tunnel players in clear,"The Football Association has said it will not be bringing charges over the tunnel incident prior to the Arsenal and Manchester United game. Arsenal's Patrick Vieira had earlier denied accusations that he threatened Gary Neville before the 4-2 defeat. Vieira also clashed with opposing skipper Roy Keane and referee Graham Poll had to separate them. ""The referee has confirmed that he is satisfied he dealt with the incident at the time,"" said an FA statement. It means United's win will pass off without further intervention from the governing body, whose new chief executive Brian Barwick was in the Highbury stands. ""I didn't threaten anybody. They are big enough players to handle themselves,"" said Vieira. ""I had a talk with Roy Keane and that's it. Gary Neville is a big lad, he can handle himself. ""They just played better than us and deserved to win."" Neville admitted there had been incidents before the game, but insisted it had not distracted his focus. ""There were a couple of things that did happen before the game which disappoint you,"" he said. ""Especially from players of that calibre, but it's a tough game and we've been around a long time."" Neville admitted that he had not enjoyed the match, which was punctuated by fouls and the sending off of Mikael Silvestre for head-butting Freddie Ljungberg . ""I thought it was a horrible game in the first half, and it was not much better in the second,"" he said. ""There is no way that should have happened in a football match."" After the match, Keane accused Vieira of starting the row. ""Patrick Vieira is 6ft 4in and having a go at Gary Neville. So I said, 'have a go at me',"" he said. ""If he wants to intimidate our players and thinks that Gary Neville is an easy target, I'm not having it."" Manchester United manager Sir Alex Ferguson added: ""Vieira was well wound up for it. ""I've heard different stories. Patrick Vieira has apparently threatened some of our players and things like that.""","They are big enough players to handle themselves,"" said Vieira.""Patrick Vieira is 6ft 4in and having a go at Gary Neville.Arsenal's Patrick Vieira had earlier denied accusations that he threatened Gary Neville before the 4-2 defeat.""I thought it was a horrible game in the first half, and it was not much better in the second,"" he said.Patrick Vieira has apparently threatened some of our players and things like that.""The Football Association has said it will not be bringing charges over the tunnel incident prior to the Arsenal and Manchester United game.So I said, 'have a go at me',"" he said.After the match, Keane accused Vieira of starting the row.""There were a couple of things that did happen before the game which disappoint you,"" he said.Gary Neville is a big lad, he can handle himself."
3,Ebbers 'aware' of WorldCom fraud,"Former WorldCom boss Bernie Ebbers was directly involved in the $11bn financial fraud at the firm, his closest associate has told a US court. Giving evidence in the criminal trial of Mr Ebbers, ex-finance chief Scott Sullivan implicated his colleague in the accounting scandal at the firm. Mr Sullivan, WorldCom's former number two, is the government's chief witness in its case against Mr Ebbers. Mr Ebbers has denied multiple charges of conspiracy and fraud. Senior WorldCom executives are accused of orchestrating a huge fraud at the former telecoms company in which they exaggerated revenues and hid the cost of expenses. The firm was forced into bankruptcy, the largest in US history. Mr Sullivan, 42, pleaded guilty to fraud last year and agreed to assist the government with its case against Mr Ebbers. Prosecutors have alleged that Mr Ebbers, 63, directed Mr Sullivan to hide the true state of the company's finances by providing false information to the firm's accountants. Mr Ebbers has denied all the charges, saying he was unaware of the fraud. His lawyers claim that their client was unfamiliar with detailed accounting practices and left that side of the business to Mr Sullivan. However, on Monday Mr Sullivan named Mr Ebbers as one of five executives who participated in the accounting fraud. ""He [Ebbers] has got a hands-on grasp of financial information,"" Mr Sullivan told a New York court. On his first day of questioning, Mr Sullivan admitted to falsifying the company's financial statements. ""We did not disclose these adjustments,"" he said. ""We did not talk about these adjustments and the information was false."" Mr Sullivan said his former boss knew more about accounting matters than many chief financial officers and described him as ""detail-oriented"". He portrayed Mr Ebbers, a charismatic businessman who built up WorldCom from a small regional operator into one of America's largest telecoms firms, as obsessed with costs. ""He would talk about that there were more coffee filters than coffee bags and that means employees are taking coffee home,"" he said. ""We needed to cut expenses. We needed to cut a lot more than coffee expenses."" Mr Sullivan is at the centre of the government's case against Mr Ebbers. Mr Ebbers could face a sentence of 85 years if convicted of all the charges he is facing.","Mr Sullivan is at the centre of the government's case against Mr Ebbers.However, on Monday Mr Sullivan named Mr Ebbers as one of five executives who participated in the accounting fraud.Mr Sullivan, WorldCom's former number two, is the government's chief witness in its case against Mr Ebbers.Mr Sullivan, 42, pleaded guilty to fraud last year and agreed to assist the government with its case against Mr Ebbers.Mr Ebbers has denied all the charges, saying he was unaware of the fraud.Prosecutors have alleged that Mr Ebbers, 63, directed Mr Sullivan to hide the true state of the company's finances by providing false information to the firm's accountants.Mr Ebbers has denied multiple charges of conspiracy and fraud.""He [Ebbers] has got a hands-on grasp of financial information,"" Mr Sullivan told a New York court.Mr Sullivan said his former boss knew more about accounting matters than many chief financial officers and described him as ""detail-oriented""."
4,God cut from Dark Materials film,"The director and screenwriter of the film adaptation of Philip Pullman's His Dark Materials is to remove references to God and the church in the movie. Chris Weitz, director of About a Boy, said the changes were being made after film studio New Line expressed concern. The books tell of a battle against the church and a fight to overthrow God. ""They have expressed worry about the possibility of perceived anti-religiosity,"" Weitz told a His Dark Materials fans' website. Pullman's trilogy has been attacked by some Christian teachers and by the Catholic press as blasphemy. Weitz, who admitted he would not be many people's first choice to direct the films, said he regarded the film adaptation as ""the most important work of my life"". ""In part because it is one of the few books to have changed my life,"" he told bridgetothestars.net. The award-winning trilogy - Northern Lights, The Subtle Knife and The Amber Spyglass - tell the story of Oxford school child Lyra Belacqua. She is drawn into an epic struggle against the Church, which has been carrying out experiments on children in an attempt to remove original sin. As the books progress the struggle turns into a battle to overthrow the Authority, a figure who is God-like in the books. Weitz, who directed American Pie and About A Boy, said New Line feared that any anti-religiosity in the film would make the project ""unviable financially"". He said: ""All my best efforts will be directed towards keeping the film as liberating and iconoclastic an experience as I can. ""But there may be some modification of terms."" Weitz said he had visited Pullman, who had told him that the Authority could ""represent any arbitrary establishment that curtails the freedom of the individual, whether it be religious, political, totalitarian, fundamentalist, communist, what have you"". He added: ""I have no desire to change the nature or intentions of the villains of the piece, but they may appear in more subtle guises."" There are a number of Christian websites which attack the trilogy for their depiction of the church and of God, but Pullman has denied his books are anti-religious. His agent told the Times newspaper that Pullman was happy with the adaptation so far. ""Of course New Line want to make money, but Mr Weitz is a wonderful director and Philip is very supportive. ""You have to recognise that it is a challenge in the climate of Bush's America,""","Chris Weitz, director of About a Boy, said the changes were being made after film studio New Line expressed concern.There are a number of Christian websites which attack the trilogy for their depiction of the church and of God, but Pullman has denied his books are anti-religious.The director and screenwriter of the film adaptation of Philip Pullman's His Dark Materials is to remove references to God and the church in the movie.Weitz, who directed American Pie and About A Boy, said New Line feared that any anti-religiosity in the film would make the project ""unviable financially"".The books tell of a battle against the church and a fight to overthrow God.Weitz, who admitted he would not be many people's first choice to direct the films, said he regarded the film adaptation as ""the most important work of my life"".""They have expressed worry about the possibility of perceived anti-religiosity,"" Weitz told a His Dark Materials fans' website.Weitz said he had visited Pullman, who had told him that the Authority could ""represent any arbitrary establishment that curtails the freedom of the individual, whether it be religious, political, totalitarian, fundamentalist, communist, what have you""."


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [24]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each predictions
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_agregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [25]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [26]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) # t5-base

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

Load some required javascript libraries for displaying visualization in notebook

You can directly call this tokenizer on one sentence or a pair of sentences:

In [27]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [28]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


In [29]:
sentence = "He didnt want to talk about cells on the cell phone because he considered it boring"
inputs = tokenizer.encode(sentence, return_tensors='pt', add_special_tokens=True) # return PyTorch tensors
tokens = tokenizer.convert_ids_to_tokens(list(inputs[0])) # Extract sample of batch index 0 from inputs list of lists
print(tokens)

['▁He', '▁didn', 't', '▁want', '▁to', '▁talk', '▁about', '▁cells', '▁on', '▁the', '▁cell', '▁phone', '▁because', '▁', 'he', '▁considered', '▁it', '▁boring', '</s>']


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [30]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [31]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    # x var for summarization. Column article.
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets y. We are using column summary as label.
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [32]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 37, 837, 3, 9, 3844, 14676, 6079, 2963, 7, 288, 32, 65, 4686, 12, 726, 3, 9, 1514, 16593, 51, 41, 19853, 4440, 23938, 61, 1399, 21, 3, 2160, 115, 53, 46, 9995, 29, 2314, 5, 2963, 7, 288, 32, 10246, 80, 13, 165, 1652, 1866, 8, 2991, 2314, 192, 203, 977, 16, 3, 9, 6894, 12, 1792, 3262, 1113, 2116, 271, 4468, 30, 165, 7282, 5, 86, 811, 12, 8, 10736, 6, 2963, 7, 288, 32, 92, 4686, 12, 386, 203, 31, 885, 4891, 13, 165, 268, 2869, 57, 8, 797, 5779, 5, 94, 243, 34, 4307, 423, 3263, 21, 125, 34, 718, 22187, 1087, 5, 71, 1798, 2991, 2743, 44, 2963, 7, 288, 32, 6640, 46, 9995, 29, 9157, 1669, 12, 428, 3, 9, 29788, 3, 2160, 346, 12, 3, 9, 306, 18, 4563, 2314, 16, 9995, 31, 7, 1164, 8409, 16, 4407, 5, 37, 2743, 1219, 8, 349, 12, 31993, 46, 10921, 21, 8, 3, 2160, 346, 38, 96, 29492, 53, 3051, 1280, 2963, 7, 288, 32, 47, 5008, 14537, 8263, 45, 19053, 11, 7208, 113, 130, 2066, 53, 581, 165, 1390, 12, 4277, 6472, 1427, 18, 7360, 3676, 7282, 16, 9995, 5, 3, 4868, 8, 3, 2160, 

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [33]:
# raw_datasets['train'] = raw_datasets['train'].shard(num_shards=100, index=3)
# raw_datasets['validation'] = raw_datasets['validation'].shard(num_shards=100, index=3)
# raw_datasets['test'] = raw_datasets['test'].shard(num_shards=100, index=3)
print(raw_datasets.num_rows)

{'train': 2002, 'test': 112, 'valid': 111}


In [34]:

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [35]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [36]:
batch_size = 2
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-bbc",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [37]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [38]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [39]:
!git lfs install

Updated git hooks.
Git LFS initialized.


In [40]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

W:\workspace\text_summarization\notebooks\t5-small-finetuned-bbc is already a clone of https://huggingface.co/furyhawk/t5-small-finetuned-bbc. Make sure you pull the latest changes with `repo.git_pull()`.
Using amp fp16 backend


We can now finetune our model by just calling the `train` method:

In [41]:
import torch

In [42]:
torch.cuda.empty_cache()
import gc
# del variables
gc.collect()
torch.cuda.memory_summary(device=None, abbreviated=False)

42



In [43]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\furyx\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [44]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: summary, title, article.
***** Running training *****
  Num examples = 2002
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 1001


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.4882,0.323812,21.2266,16.0927,19.6785,19.8849,19.0


Saving model checkpoint to t5-small-finetuned-bbc\checkpoint-500
Configuration saved in t5-small-finetuned-bbc\checkpoint-500\config.json
Model weights saved in t5-small-finetuned-bbc\checkpoint-500\pytorch_model.bin
tokenizer config file saved in t5-small-finetuned-bbc\checkpoint-500\tokenizer_config.json
Special tokens file saved in t5-small-finetuned-bbc\checkpoint-500\special_tokens_map.json
tokenizer config file saved in t5-small-finetuned-bbc\tokenizer_config.json
Special tokens file saved in t5-small-finetuned-bbc\special_tokens_map.json
Deleting older checkpoint [t5-small-finetuned-bbc\checkpoint-1000] due to args.save_total_limit
Saving model checkpoint to t5-small-finetuned-bbc\checkpoint-1000
Configuration saved in t5-small-finetuned-bbc\checkpoint-1000\config.json
Model weights saved in t5-small-finetuned-bbc\checkpoint-1000\pytorch_model.bin
tokenizer config file saved in t5-small-finetuned-bbc\checkpoint-1000\tokenizer_config.json
Special tokens file saved in t5-small-fin

TrainOutput(global_step=1001, training_loss=0.6741910312440131, metrics={'train_runtime': 384.3744, 'train_samples_per_second': 5.208, 'train_steps_per_second': 2.604, 'total_flos': 400159183208448.0, 'train_loss': 0.6741910312440131, 'epoch': 1.0})

You can now upload the result of the training to the Hub, just execute this instruction:

In [45]:
trainer.push_to_hub()

Saving model checkpoint to t5-small-finetuned-bbc
Configuration saved in t5-small-finetuned-bbc\config.json
Model weights saved in t5-small-finetuned-bbc\pytorch_model.bin
tokenizer config file saved in t5-small-finetuned-bbc\tokenizer_config.json
Special tokens file saved in t5-small-finetuned-bbc\special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.36k/231M [00:00<?, ?B/s]

Upload file runs/Oct29_18-54-01_ALIENMEDIA/events.out.tfevents.1635504852.ALIENMEDIA.6320.0:  67%|######6   | …

To https://huggingface.co/furyhawk/t5-small-finetuned-bbc
   6c9ad8d..2b4559c  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Sequence-to-sequence Language Modeling', 'type': 'text2text-generation'}, 'metrics': [{'name': 'Rouge1', 'type': 'rouge', 'value': 21.2266}]}
To https://huggingface.co/furyhawk/t5-small-finetuned-bbc
   2b4559c..9dfd6be  main -> main



'https://huggingface.co/furyhawk/t5-small-finetuned-bbc/commit/2b4559c91b0a0a31965dd90fd8928245ce4f4274'

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```