# Text Summary & Scoring Project
##### Michael Creegan, Yungfeng Dai, Hong Gyu Ji, Ziling Zeng
##### Python for Data Analysis
##### Columbia University

# Abstract

Summarization is a common problem in the 21st century as the world has become increasingly driven by data. Summarization of data can be very useful to quickly determine if something is relevant or whether it's worth reading. Another use case could be to store summaries of articles in the backend to run downstream tasks on. It could also be useful to understand the semantic integrity to indicate quality.

To explore this topic, we will leverage the extreme summarization dataset (XSUM) which consists of BBC articles accompanying single-sentence summaries. Each article is prefaced with an introductory sentence (which is a summary) that is professionally written, typically by the author of the article.

To summarize articles, we will use an encoder-decoder transformer (sequence-to-sequence) which combines decoders and encoders because we need to perform both input and output tasks: taking in text and then generating a summary. We selected this type of transformer because the encoder accepts inputs (text) and computes a high-level representation of those inputs which are then passed to the decoder to generate a prediction output (summary). This has advantages over using a standalone encoder like BERT/ALBERT/ELECTRA/RoBERTA/DistilBERT to name a few because encoders are pre-trained by filling randomly masked words in sentences and therefore are better suited for output tasks. Using a standalone decoder like gpt2 would also not be optimal because decoders are trained to guess the next word in a sequence (left or right context aka does not have context on one side of the sequence) and therefore are better suited at generating text but not necessarily taking in text because of the hidden context limitations. 

Our scoring will compare the output of the BART encoder-decoder model to the professionally written summaries in the XSUM dataset to see how semantically similar a machine-generated summary is to a professional one as well as to their source articles. Our scoring methodology will be focused on semantic textual similarity and computed using the cosine similarity between the professional human-written summary and the machine-generated one. Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

# Importing Transformers & Dependencies

In [2]:
import pandas as pd
import numpy as np
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from datasets import load_dataset, load_metric
from sentence_transformers import SentenceTransformer, util
import random
from IPython.display import display, HTML

# Load XSUM Dataset

In [3]:
xsum = load_dataset('xsum')

Using custom data configuration default
Reusing dataset xsum (C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)
100%|██████████| 3/3 [00:00<00:00, 14.71it/s]


### We can see that the dataset is a "DatasetDict" where the keys are strings that correspond to the split and the values are the dataset object. In the XSUM dataset, the the keys are "training", "validation", and "test" with values corresponding to "document", "summary", and "id" (columns)

In [4]:
xsum

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

# View Underlying Data

In [5]:
xsum['test'][0]

{'document': 'Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation.\nWorkers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders.\nThe Welsh Government said more people than ever were getting help to address housing problems.\nChanges to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation.\nPrison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because issues such as children or domestic violence were now considered.\nHowever, the same could not be said for men, the charity said, because issues which often affect them, such as post traumatic stress disorder or drug dependency, were often viewed as less of a priority.\nAndrew Stevens, who works in Welsh prisons trying to secure housing for prison leavers, said the

## We can use a function to view a random selection of articles and summaries to get a more accurate depiction of what the data looks like in a synthesized format

In [6]:
def display_function(xsum, num_examples=3):
    assert num_examples <= len(xsum)                # limit to number of records in the xsum
    
    selections = []                                 # create empty list to put the records into 
    
    for _ in range(num_examples):                   # we can use _ here in place of a variable name because we don't care how many time sthe loop is run
        selection = random.randint(0, len(xsum) - 1)
        while selection in selections:
            selection = random.randint(0, len(xsum) - 1)
        selections.append(selection)

    xsumPd = pd.DataFrame(xsum[selections])
    for column, typ in xsum.features.items():
        display(HTML(xsumPd.to_html()))

# Cleaning
Our end goal is to create accurate summaries using this model; therefore, we need to remove the text characters that do not provide any contextual value. There are characters in the article that are not present in the summary that could cause discrepancies between our machine-generated summary vs. the professional human-generated one. Newline characters and backslashes need to be removed because as they are in the document column but in not the summary column and could present challenges when summarizing and scoring.

In [7]:
display_function(xsum["test"])

Unnamed: 0,document,summary,id
0,"Musgrove Park Hospital said the cause of the ""technical issues"" in the hired facility was being investigated.\nSurgery was carried out on 62 patients with just under half reporting complications while some 15 people had ""more significant"" corneal issues.\nThe use of the mobile facility was ceased when the issues were discovered.\nAll affected patients have been spoken to and care plans are now in place for their ongoing treatment, a hospital spokesperson said.\nVanguard Healthcare had been hired to provide the theatre to help clear a backlog for some ophthalmic services.\nThe unit, which consists of an anaesthetic room, operating theatre and two-bed recovery, is at the hospital until the end of the year.\nVanguard said it was ""co-operating fully"" with the hospital for the investigation which would focus on the drugs, equipment, sterilisation and protocols used.\n""The majority of operations were successful, however a number experienced an unusual level of discomfort after surgery,"" added a spokesperson.\n""We have, with the hospital, conveyed to the patients our concern and sympathy for the discomfort or distress they have experienced.""\nThe hospital's chief executive, Jo Cubbon, said the issues meant many patients who had already waited longer than they should have, will now have to wait again for their operation.\n""We are very sorry this has happened and are working to put a solution in place so that these patients will receive their treatment as soon as possible,"" Ms Cubbon said.",About 30 patients in Somerset who had cataract surgery inside a mobile hospital theatre have been left with blurred vision or other complications.,27606486
1,"The Point stars Caerphilly-born TV presenter and actor Matt Johnson, who plays the younger brother of a man who is convinced he does not want to live anymore.\nThe film will be shot in the Brecon Beacons, near the home of its scriptwriter Jasper Warry.\nMr Johnson is an active ambassador for the mental health charity Mind.\nHe recently presented the documentary Iselder a Fi (Depression and Me) discussing his own personal experiences with depression for S4C.\nAs well as appearing in The Point, he will also be the executive producer.\nHe said the script was ""one of the best he's ever read"".\n""It wonderfully tackles the hugely important issues of mental health in a distinct, moving and humorous way.""\nMr Warry said he was ""made up"" with Mr Johnson's involvement and hopes to start shooting in November.\nHe said they will be using local talent wherever possible, both in front and behind of the camera.",A Welsh feature film which tackles mental health issues is due to be shot in Powys later this year.,40004104
2,"Amirah Droudis, 37, will spend at least 33 years behind bars for killing the woman -who cannot be identified - in 2013.\nDroudis's boyfriend, Man Haron Monis, took 18 people hostage in a Lindt cafe in central Sydney in 2014.\nThe 16-hour siege ended with the deaths of Monis and two hostages when police stormed the building.\nMonis had been charged with being an accessory to his ex-wife's killing, and was on bail at the time of the siege.\nThe Supreme Court of New South Wales heard that Monis planned the 2013 murder and Droudis carried it out.\nThe victim, identified by the pseudonym Helen Lee, was stabbed 18 times before being doused in petrol and set alight outside an apartment in western Sydney.\nAfter the trial, Justice Peter Johnson ruled that Monis recruited Droudis to murder his ex-wife.\n""The offender uncritically adopted and espoused Monis's foul beliefs and acted in public support of him in public protests,"" he said in his sentencing remarks on Wednesday.\nThe judge described Monis as ""an evil man"" whose death was ""a result of his own criminal and murderous acts"".\n""No-one mourns his passing and many have been left to grapple the consequences of his destructive acts,"" he said.\nThe judge acknowledged claims that Droudis had been repeatedly assaulted by Monis.\nDroudis was sentenced to a maximum 44 years in jail with a non-parole period of 33 years.\nDetective Inspector Jason Dickinson, who worked on the case, said he was satisfied with the sentence handed to Droudis.\n""This was a brutal and callous crime and I think the sentence today has reflected that brutality,"" he told the Australian Broadcasting Corp.\nThe victim's family made a statement outside court, thanking the judge, prosecutors and police.\n""Today we are very happy that justice has been served to our only daughter,"" the statement said.\nThe findings of an inquest into the cafe siege are due to be handed down this year.\nHow the Sydney siege unfolded",The girlfriend of a man behind a deadly siege in a Sydney cafe has been jailed for murdering his ex-wife.,38822269


Unnamed: 0,document,summary,id
0,"Musgrove Park Hospital said the cause of the ""technical issues"" in the hired facility was being investigated.\nSurgery was carried out on 62 patients with just under half reporting complications while some 15 people had ""more significant"" corneal issues.\nThe use of the mobile facility was ceased when the issues were discovered.\nAll affected patients have been spoken to and care plans are now in place for their ongoing treatment, a hospital spokesperson said.\nVanguard Healthcare had been hired to provide the theatre to help clear a backlog for some ophthalmic services.\nThe unit, which consists of an anaesthetic room, operating theatre and two-bed recovery, is at the hospital until the end of the year.\nVanguard said it was ""co-operating fully"" with the hospital for the investigation which would focus on the drugs, equipment, sterilisation and protocols used.\n""The majority of operations were successful, however a number experienced an unusual level of discomfort after surgery,"" added a spokesperson.\n""We have, with the hospital, conveyed to the patients our concern and sympathy for the discomfort or distress they have experienced.""\nThe hospital's chief executive, Jo Cubbon, said the issues meant many patients who had already waited longer than they should have, will now have to wait again for their operation.\n""We are very sorry this has happened and are working to put a solution in place so that these patients will receive their treatment as soon as possible,"" Ms Cubbon said.",About 30 patients in Somerset who had cataract surgery inside a mobile hospital theatre have been left with blurred vision or other complications.,27606486
1,"The Point stars Caerphilly-born TV presenter and actor Matt Johnson, who plays the younger brother of a man who is convinced he does not want to live anymore.\nThe film will be shot in the Brecon Beacons, near the home of its scriptwriter Jasper Warry.\nMr Johnson is an active ambassador for the mental health charity Mind.\nHe recently presented the documentary Iselder a Fi (Depression and Me) discussing his own personal experiences with depression for S4C.\nAs well as appearing in The Point, he will also be the executive producer.\nHe said the script was ""one of the best he's ever read"".\n""It wonderfully tackles the hugely important issues of mental health in a distinct, moving and humorous way.""\nMr Warry said he was ""made up"" with Mr Johnson's involvement and hopes to start shooting in November.\nHe said they will be using local talent wherever possible, both in front and behind of the camera.",A Welsh feature film which tackles mental health issues is due to be shot in Powys later this year.,40004104
2,"Amirah Droudis, 37, will spend at least 33 years behind bars for killing the woman -who cannot be identified - in 2013.\nDroudis's boyfriend, Man Haron Monis, took 18 people hostage in a Lindt cafe in central Sydney in 2014.\nThe 16-hour siege ended with the deaths of Monis and two hostages when police stormed the building.\nMonis had been charged with being an accessory to his ex-wife's killing, and was on bail at the time of the siege.\nThe Supreme Court of New South Wales heard that Monis planned the 2013 murder and Droudis carried it out.\nThe victim, identified by the pseudonym Helen Lee, was stabbed 18 times before being doused in petrol and set alight outside an apartment in western Sydney.\nAfter the trial, Justice Peter Johnson ruled that Monis recruited Droudis to murder his ex-wife.\n""The offender uncritically adopted and espoused Monis's foul beliefs and acted in public support of him in public protests,"" he said in his sentencing remarks on Wednesday.\nThe judge described Monis as ""an evil man"" whose death was ""a result of his own criminal and murderous acts"".\n""No-one mourns his passing and many have been left to grapple the consequences of his destructive acts,"" he said.\nThe judge acknowledged claims that Droudis had been repeatedly assaulted by Monis.\nDroudis was sentenced to a maximum 44 years in jail with a non-parole period of 33 years.\nDetective Inspector Jason Dickinson, who worked on the case, said he was satisfied with the sentence handed to Droudis.\n""This was a brutal and callous crime and I think the sentence today has reflected that brutality,"" he told the Australian Broadcasting Corp.\nThe victim's family made a statement outside court, thanking the judge, prosecutors and police.\n""Today we are very happy that justice has been served to our only daughter,"" the statement said.\nThe findings of an inquest into the cafe siege are due to be handed down this year.\nHow the Sydney siege unfolded",The girlfriend of a man behind a deadly siege in a Sydney cafe has been jailed for murdering his ex-wife.,38822269


Unnamed: 0,document,summary,id
0,"Musgrove Park Hospital said the cause of the ""technical issues"" in the hired facility was being investigated.\nSurgery was carried out on 62 patients with just under half reporting complications while some 15 people had ""more significant"" corneal issues.\nThe use of the mobile facility was ceased when the issues were discovered.\nAll affected patients have been spoken to and care plans are now in place for their ongoing treatment, a hospital spokesperson said.\nVanguard Healthcare had been hired to provide the theatre to help clear a backlog for some ophthalmic services.\nThe unit, which consists of an anaesthetic room, operating theatre and two-bed recovery, is at the hospital until the end of the year.\nVanguard said it was ""co-operating fully"" with the hospital for the investigation which would focus on the drugs, equipment, sterilisation and protocols used.\n""The majority of operations were successful, however a number experienced an unusual level of discomfort after surgery,"" added a spokesperson.\n""We have, with the hospital, conveyed to the patients our concern and sympathy for the discomfort or distress they have experienced.""\nThe hospital's chief executive, Jo Cubbon, said the issues meant many patients who had already waited longer than they should have, will now have to wait again for their operation.\n""We are very sorry this has happened and are working to put a solution in place so that these patients will receive their treatment as soon as possible,"" Ms Cubbon said.",About 30 patients in Somerset who had cataract surgery inside a mobile hospital theatre have been left with blurred vision or other complications.,27606486
1,"The Point stars Caerphilly-born TV presenter and actor Matt Johnson, who plays the younger brother of a man who is convinced he does not want to live anymore.\nThe film will be shot in the Brecon Beacons, near the home of its scriptwriter Jasper Warry.\nMr Johnson is an active ambassador for the mental health charity Mind.\nHe recently presented the documentary Iselder a Fi (Depression and Me) discussing his own personal experiences with depression for S4C.\nAs well as appearing in The Point, he will also be the executive producer.\nHe said the script was ""one of the best he's ever read"".\n""It wonderfully tackles the hugely important issues of mental health in a distinct, moving and humorous way.""\nMr Warry said he was ""made up"" with Mr Johnson's involvement and hopes to start shooting in November.\nHe said they will be using local talent wherever possible, both in front and behind of the camera.",A Welsh feature film which tackles mental health issues is due to be shot in Powys later this year.,40004104
2,"Amirah Droudis, 37, will spend at least 33 years behind bars for killing the woman -who cannot be identified - in 2013.\nDroudis's boyfriend, Man Haron Monis, took 18 people hostage in a Lindt cafe in central Sydney in 2014.\nThe 16-hour siege ended with the deaths of Monis and two hostages when police stormed the building.\nMonis had been charged with being an accessory to his ex-wife's killing, and was on bail at the time of the siege.\nThe Supreme Court of New South Wales heard that Monis planned the 2013 murder and Droudis carried it out.\nThe victim, identified by the pseudonym Helen Lee, was stabbed 18 times before being doused in petrol and set alight outside an apartment in western Sydney.\nAfter the trial, Justice Peter Johnson ruled that Monis recruited Droudis to murder his ex-wife.\n""The offender uncritically adopted and espoused Monis's foul beliefs and acted in public support of him in public protests,"" he said in his sentencing remarks on Wednesday.\nThe judge described Monis as ""an evil man"" whose death was ""a result of his own criminal and murderous acts"".\n""No-one mourns his passing and many have been left to grapple the consequences of his destructive acts,"" he said.\nThe judge acknowledged claims that Droudis had been repeatedly assaulted by Monis.\nDroudis was sentenced to a maximum 44 years in jail with a non-parole period of 33 years.\nDetective Inspector Jason Dickinson, who worked on the case, said he was satisfied with the sentence handed to Droudis.\n""This was a brutal and callous crime and I think the sentence today has reflected that brutality,"" he told the Australian Broadcasting Corp.\nThe victim's family made a statement outside court, thanking the judge, prosecutors and police.\n""Today we are very happy that justice has been served to our only daughter,"" the statement said.\nThe findings of an inquest into the cafe siege are due to be handed down this year.\nHow the Sydney siege unfolded",The girlfriend of a man behind a deadly siege in a Sydney cafe has been jailed for murdering his ex-wife.,38822269


## We can address the problem we mentioned above by define a cleaning function that replaces new lines and backslashes with white space.

In [8]:
def clean(row):
    row['document'] = row['document'].replace('\n', ' ')\
                                     .replace('\'', '').replace('\"','')
    return row

## We can now apply the cleaning function we created and map it onto our data (it loads for train, test, and validation)

In [9]:
xsum = xsum.map(clean)

Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-fd36b556705cbe4d.arrow
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-edb3a2dc2f06b92c.arrow
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-a4042da98a2992a2.arrow


### Voila!

In [10]:
display_function(xsum["test"])

Unnamed: 0,document,summary,id
0,"The owner of the 1977 Ford Mustang died in August and when his daughter checked his garage in Welwyn Garden City, she found that the car had disappeared. The family last recall seeing the car, registration VTM 648S, in 1995. Hertfordshire Police said: We have exhausted all lines of inquiry and are waiting for any new information. It is understood the reason the cars disappearance went unnoticed for two decades is the owner was not well enough to visit the garage it was housed in. The force said it was convinced the car was not sold by the owner, as his daughter said she would have been informed, and all of the documentation is still in her late fathers house. Police appealed for information about the missing Mustang six months ago but have yet to receive any leads in the case.",Police say they have reached the end of the road in their search for a classic car thought to have been stolen during the last 20 years.,36368304
1,"The company currently employs 500 people at premises in Glasgow, Robroyston, Livingston, Ayr, Clarkston, Hamilton, Lanark and Clydebank. The expansion plans include new stores in Port Glasgow and Irvine, while other locations are in negotiation. Existing stores in Clarkston, Ayr and Livingston will also be enlarged. The family-owned business, which is currently celebrating its centenary, has forecast a turnover of between Â£25m and Â£30m in 2015. Its sells a wide range of products, including clothing and accessories, housewares, jewellery, watches, books, toys and confectionery. Owner Willie Watt said: Our business has changed dramatically since it first opened its doors in Glasgows Sauchiehall Street, when the focus was exclusively on high-end ladies fashions. We have evolved as a business and recognised that a retail offer including a wider range of great value products in a department store-style setting has greater appeal to todays consumer. Weve seen positive growth in terms of both turnover and customer numbers, even in the recent recessionary period, and weve continued our expansion in terms of new store openings. My long-term plan is to have a total of 16 stores throughout Scotland, and were constantly looking at new opportunities to secure ideal locations.",Retailer Watt Brothers has announced plans to create 350 new jobs by opening six new stores and expanding three others over the next few years.,32591214
2,"Amber Rudd told the BBC the manifesto was not going to be identical to the last one and said things had changed since 2015 because of Brexit. The target, set by David Cameron in 2010, has never been met and recent figures put net migration at 273,000. The PM indicated in April that she would stick with the aim. Speaking on a campaign visit last month, Theresa May, who was Ms Rudds predecessor as home secretary, told the BBC: We want to see sustainable net migration in this country. I believe that sustainable net migration is in the tens of thousands. Questions had been raised about whether the commitment would feature in the Conservative manifesto after Culture Secretary Karen Bradley said that immigration was not about putting numbers on it but about ensuring Britain had the skilled workers it needed. Asked whether she agreed with her colleague, Ms Rudd told BBC Radio 5lives Pienaars Politics: Its too early to say. I appreciate you want to push me on this but we are going to have to wait until the manifesto comes out. Pressed on the issue again, she added: Thats why were having a new manifesto. Its not going to be identical to the last one. Were setting it out for hopefully for a five year term, weve got a lot to think through to work out whats the best way to deliver on our priorities. She added: My personal view is, we need to continue to bring immigration down. I want to make sure that we do it in a way that supports businesses, you know were ending freedom of movement when we leaving the European Union. So the situation from that time the [2015] manifesto... has changed because were leaving the European Union, so its right that we look at it again. Ms Rudd also played down the potential impact of excluding students from net migration figures, saying: Its a complete red herring to talk about taking students out of those numbers and it making a big impact. This was because, in theory, roughly the same numbers of students should be leaving the UK at the end of their courses as are arriving each year. Official figures out last month show EU migrants make up more than one in 10 manufacturing sector workers in the UK. The government has promised new migration controls after the UK leaves the EU, when freedom of movement rules will no longer apply, but it has yet to set out the precise model it will adopt. Labour says it accepts that the principle of the free movement of people - which EU leaders say goes hand-in-hand with single market membership - would have to end after Brexit. But shadow Brexit secretary Sir Keir Starmer has said new immigration controls should not be the overarching priority as the UK leaves. UKIP has said that Mrs Mays failure to reduce net migration to less than 100,000 while she was home secretary suggests that she could yet back slide on delivering Brexit.","The home secretary has refused to say whether the Conservative manifesto will repeat their 2015 pledge to cut net migration to the ""tens of thousands"".",39837199


Unnamed: 0,document,summary,id
0,"The owner of the 1977 Ford Mustang died in August and when his daughter checked his garage in Welwyn Garden City, she found that the car had disappeared. The family last recall seeing the car, registration VTM 648S, in 1995. Hertfordshire Police said: We have exhausted all lines of inquiry and are waiting for any new information. It is understood the reason the cars disappearance went unnoticed for two decades is the owner was not well enough to visit the garage it was housed in. The force said it was convinced the car was not sold by the owner, as his daughter said she would have been informed, and all of the documentation is still in her late fathers house. Police appealed for information about the missing Mustang six months ago but have yet to receive any leads in the case.",Police say they have reached the end of the road in their search for a classic car thought to have been stolen during the last 20 years.,36368304
1,"The company currently employs 500 people at premises in Glasgow, Robroyston, Livingston, Ayr, Clarkston, Hamilton, Lanark and Clydebank. The expansion plans include new stores in Port Glasgow and Irvine, while other locations are in negotiation. Existing stores in Clarkston, Ayr and Livingston will also be enlarged. The family-owned business, which is currently celebrating its centenary, has forecast a turnover of between Â£25m and Â£30m in 2015. Its sells a wide range of products, including clothing and accessories, housewares, jewellery, watches, books, toys and confectionery. Owner Willie Watt said: Our business has changed dramatically since it first opened its doors in Glasgows Sauchiehall Street, when the focus was exclusively on high-end ladies fashions. We have evolved as a business and recognised that a retail offer including a wider range of great value products in a department store-style setting has greater appeal to todays consumer. Weve seen positive growth in terms of both turnover and customer numbers, even in the recent recessionary period, and weve continued our expansion in terms of new store openings. My long-term plan is to have a total of 16 stores throughout Scotland, and were constantly looking at new opportunities to secure ideal locations.",Retailer Watt Brothers has announced plans to create 350 new jobs by opening six new stores and expanding three others over the next few years.,32591214
2,"Amber Rudd told the BBC the manifesto was not going to be identical to the last one and said things had changed since 2015 because of Brexit. The target, set by David Cameron in 2010, has never been met and recent figures put net migration at 273,000. The PM indicated in April that she would stick with the aim. Speaking on a campaign visit last month, Theresa May, who was Ms Rudds predecessor as home secretary, told the BBC: We want to see sustainable net migration in this country. I believe that sustainable net migration is in the tens of thousands. Questions had been raised about whether the commitment would feature in the Conservative manifesto after Culture Secretary Karen Bradley said that immigration was not about putting numbers on it but about ensuring Britain had the skilled workers it needed. Asked whether she agreed with her colleague, Ms Rudd told BBC Radio 5lives Pienaars Politics: Its too early to say. I appreciate you want to push me on this but we are going to have to wait until the manifesto comes out. Pressed on the issue again, she added: Thats why were having a new manifesto. Its not going to be identical to the last one. Were setting it out for hopefully for a five year term, weve got a lot to think through to work out whats the best way to deliver on our priorities. She added: My personal view is, we need to continue to bring immigration down. I want to make sure that we do it in a way that supports businesses, you know were ending freedom of movement when we leaving the European Union. So the situation from that time the [2015] manifesto... has changed because were leaving the European Union, so its right that we look at it again. Ms Rudd also played down the potential impact of excluding students from net migration figures, saying: Its a complete red herring to talk about taking students out of those numbers and it making a big impact. This was because, in theory, roughly the same numbers of students should be leaving the UK at the end of their courses as are arriving each year. Official figures out last month show EU migrants make up more than one in 10 manufacturing sector workers in the UK. The government has promised new migration controls after the UK leaves the EU, when freedom of movement rules will no longer apply, but it has yet to set out the precise model it will adopt. Labour says it accepts that the principle of the free movement of people - which EU leaders say goes hand-in-hand with single market membership - would have to end after Brexit. But shadow Brexit secretary Sir Keir Starmer has said new immigration controls should not be the overarching priority as the UK leaves. UKIP has said that Mrs Mays failure to reduce net migration to less than 100,000 while she was home secretary suggests that she could yet back slide on delivering Brexit.","The home secretary has refused to say whether the Conservative manifesto will repeat their 2015 pledge to cut net migration to the ""tens of thousands"".",39837199


Unnamed: 0,document,summary,id
0,"The owner of the 1977 Ford Mustang died in August and when his daughter checked his garage in Welwyn Garden City, she found that the car had disappeared. The family last recall seeing the car, registration VTM 648S, in 1995. Hertfordshire Police said: We have exhausted all lines of inquiry and are waiting for any new information. It is understood the reason the cars disappearance went unnoticed for two decades is the owner was not well enough to visit the garage it was housed in. The force said it was convinced the car was not sold by the owner, as his daughter said she would have been informed, and all of the documentation is still in her late fathers house. Police appealed for information about the missing Mustang six months ago but have yet to receive any leads in the case.",Police say they have reached the end of the road in their search for a classic car thought to have been stolen during the last 20 years.,36368304
1,"The company currently employs 500 people at premises in Glasgow, Robroyston, Livingston, Ayr, Clarkston, Hamilton, Lanark and Clydebank. The expansion plans include new stores in Port Glasgow and Irvine, while other locations are in negotiation. Existing stores in Clarkston, Ayr and Livingston will also be enlarged. The family-owned business, which is currently celebrating its centenary, has forecast a turnover of between Â£25m and Â£30m in 2015. Its sells a wide range of products, including clothing and accessories, housewares, jewellery, watches, books, toys and confectionery. Owner Willie Watt said: Our business has changed dramatically since it first opened its doors in Glasgows Sauchiehall Street, when the focus was exclusively on high-end ladies fashions. We have evolved as a business and recognised that a retail offer including a wider range of great value products in a department store-style setting has greater appeal to todays consumer. Weve seen positive growth in terms of both turnover and customer numbers, even in the recent recessionary period, and weve continued our expansion in terms of new store openings. My long-term plan is to have a total of 16 stores throughout Scotland, and were constantly looking at new opportunities to secure ideal locations.",Retailer Watt Brothers has announced plans to create 350 new jobs by opening six new stores and expanding three others over the next few years.,32591214
2,"Amber Rudd told the BBC the manifesto was not going to be identical to the last one and said things had changed since 2015 because of Brexit. The target, set by David Cameron in 2010, has never been met and recent figures put net migration at 273,000. The PM indicated in April that she would stick with the aim. Speaking on a campaign visit last month, Theresa May, who was Ms Rudds predecessor as home secretary, told the BBC: We want to see sustainable net migration in this country. I believe that sustainable net migration is in the tens of thousands. Questions had been raised about whether the commitment would feature in the Conservative manifesto after Culture Secretary Karen Bradley said that immigration was not about putting numbers on it but about ensuring Britain had the skilled workers it needed. Asked whether she agreed with her colleague, Ms Rudd told BBC Radio 5lives Pienaars Politics: Its too early to say. I appreciate you want to push me on this but we are going to have to wait until the manifesto comes out. Pressed on the issue again, she added: Thats why were having a new manifesto. Its not going to be identical to the last one. Were setting it out for hopefully for a five year term, weve got a lot to think through to work out whats the best way to deliver on our priorities. She added: My personal view is, we need to continue to bring immigration down. I want to make sure that we do it in a way that supports businesses, you know were ending freedom of movement when we leaving the European Union. So the situation from that time the [2015] manifesto... has changed because were leaving the European Union, so its right that we look at it again. Ms Rudd also played down the potential impact of excluding students from net migration figures, saying: Its a complete red herring to talk about taking students out of those numbers and it making a big impact. This was because, in theory, roughly the same numbers of students should be leaving the UK at the end of their courses as are arriving each year. Official figures out last month show EU migrants make up more than one in 10 manufacturing sector workers in the UK. The government has promised new migration controls after the UK leaves the EU, when freedom of movement rules will no longer apply, but it has yet to set out the precise model it will adopt. Labour says it accepts that the principle of the free movement of people - which EU leaders say goes hand-in-hand with single market membership - would have to end after Brexit. But shadow Brexit secretary Sir Keir Starmer has said new immigration controls should not be the overarching priority as the UK leaves. UKIP has said that Mrs Mays failure to reduce net migration to less than 100,000 while she was home secretary suggests that she could yet back slide on delivering Brexit.","The home secretary has refused to say whether the Conservative manifesto will repeat their 2015 pledge to cut net migration to the ""tens of thousands"".",39837199


## We can view the column names and data types with our dataset using .features

In [11]:
xsum['test'].features

{'document': Value(dtype='string', id=None),
 'summary': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [12]:
print(xsum['test'].info)

DatasetInfo(description='\nExtreme Summarization (XSum) Dataset.\n\nThere are three features:\n  - document: Input news article.\n  - summary: One sentence summary of the article.\n  - id: BBC ID of the article.\n\n', citation="\n@article{Narayan2018DontGM,\n  title={Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization},\n  author={Shashi Narayan and Shay B. Cohen and Mirella Lapata},\n  journal={ArXiv},\n  year={2018},\n  volume={abs/1808.08745}\n}\n", homepage='https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset', license='', features={'document': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=SupervisedKeysData(input='document', output='summary'), task_templates=None, builder_name='xsum', config_name='default', version=1.2.0, splits={'train': SplitInfo(name='train', num_bytes=479206615, num_examples=204045, data

# Preparing XSUM Data
Before we can put the text into a model we need to convert it into a format that the transformer can understand. Encoders and decoders only understand numerical values; we need to tokenize each word and then convert the tokens into numerical values. The tokenization transformer splits text into tokens and then adds special tokens if expected based on pretraining. The tokenizer then matches each token to a unique id in the vocabulary of the tokenizer which has a corresponding vector of numerical values. These vectors contain the contextualized value of a word. For example, the vector representation of the word "to" isnt just "to", it also takes into account the words around it which are called context (right and left context). To continue this example, "Welcome to NYC" is a sentence that has the word "to". For the word "to" the left context is "Welcome" and the right context is "NYC". The output is based on these contexts; this is how the value is a contextualized vector thanks to the self-attention mechanism. We can do all of this using the AutoTokenizer.from_pretarined method to ensure that we get a tokenizer that corresponds to the model architecture we want to use (facebook/bart-large-cnn); however, we will specifically reference the BartTokenizer in our checkpoint, tokenizer, and model to ensure all aspects of our model were trained using the same methodologies so we can avoid unexpected summaries

In [13]:
checkpoint = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(checkpoint)
model = BartForConditionalGeneration.from_pretrained(checkpoint)

## We now write a function that preprocesses the test data by passing it to the tokenizer. We need to use the argument truncation=True to ensure any input longer than the model can handle will be truncated to the maximum length allowed. We can view this information in the model config. BART has a maximum length (can take in 1024 tokens in a sequence) of 1024 which we can see in max_position_embeddings

In [14]:
model.config

BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "L

## We can now create the function with the maximum length allowed as per the config and a minimum length of 60 which is explained in the section where we compare human summaries and machine summaries to each other and the original articles

In [15]:
max_input_length = 1024
max_target_length = 60


def preperation_function(examples):
    inputs = [doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding=True)

    
    with tokenizer.as_target_tokenizer(): # Setup the tokenizer for summaries where "as_target_tokenizer" is what provides passes along the context for each vector
        labels = tokenizer(
            examples["summary"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

## We can apply this function to our dataset using map

In [16]:
tokenized_xsum = xsum.map(preperation_function, batched=True)

Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-d006ce488ae4d44a.arrow
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-e5f53a81412bec57.arrow
100%|██████████| 12/12 [00:22<00:00,  1.90s/ba]


In [17]:
tokenized_xsum

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
        num_rows: 11334
    })
})

In [18]:
tokenized_xsum['test'].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'document': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'summary': Value(dtype='string', id=None)}

## The attention mask tells the model what to pay attention to by passing values of 1 for tokens to consider and values of 0 for tokens to ignore. The input ids are the numerical mapping of tokens to BART's vocabulary; each word in BART's vocabulary is assigned a numerical value.

In [19]:
display_function(tokenized_xsum['test'])

Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Marc Carters plea to find a replacement sippy cup for son Ben was retweeted more than 12,000 times. The 14-year-old has only drunk from the double-handled vessels, which are no longer produced, since the age of two. Tommee Tippee said it will produce 500 cups after it searched factories worldwide and found the original mould. The firms attention was drawn to the familys plight when Mr Carter launched the Twitter appeal to find a replacement. More on a dads desperate search for a cup, and other stories His original plea prompted offers of help from as far away as Australia. Mr Carter, 42, said the response from well-wishers had been incredible and it was a huge surprise to be contacted by the manufacturer. Mr Carter said: For me its massive. Some people think Im exaggerating but without it he doesnt drink so personally Im very relieved. Tommee Tippee will send the cups on demand for free to the Carter family. Mr Carter said: I would not be happier if I won the lottery. Weve moved down to the middle of nowhere and dont want much. Just knowing he has got these cups gives us peace of mind. Northumberland-based Tommee Tippee does not normally keep the moulds but had been searching factories around the world in the hope of finding the original plans. A spokesman said: We are delighted to confirm that we are able to start production on a run of the original cup. This will ensure that Ben has a lifetime supply and that his family wont ever have to worry about finding another cup for Ben. Mr Carter, from Devon, told the BBC his son has had his current blue cup for three years, but it is now falling apart and may only last a few more weeks. He said: This tiny blue cup dictates our life.",38141319,"[0, 28987, 1653, 2696, 6221, 7, 465, 10, 5010, 579, 31177, 4946, 13, 979, 1664, 21, 24352, 196, 55, 87, 316, 6, 151, 498, 4, 20, 501, 12, 180, 12, 279, 34, 129, 10789, 31, 5, 1457, 12, 42536, 9048, 6, 61, 32, 117, 1181, 2622, 6, 187, 5, 1046, 9, 80, 4, 1560, 1794, 242, 255, 5600, 1942, 26, 24, 40, 2592, 1764, 12988, 71, 24, 10593, 12126, 3612, 8, 303, 5, 1461, 27421, 4, 20, 2566, 1503, 21, 4777, 7, 5, 284, 29, 18318, 77, 427, 5306, 1660, 5, 599, 2868, 7, 465, 10, 5010, 4, 901, 15, ...]","[0, 250, 4252, 18, 7764, 1707, 7, 3190, 39, 33329, 979, 18, 8055, 22, 27635, 2440, 4946, 113, 34, 1249, 111, 71, 5, 7508, 4425, 11, 7, 146, 10, 7370, 18, 1787, 4, 2]","A dad's desperate search to replace his autistic son's beloved ""little blue cup"" has ended - after the manufacturer stepped in to make a lifetime's supply."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Media playback is not supported on this device The world number two inspired Great Britain to win the Davis Cup for the first time in 79 years with victory over Belgium in Ghent over the weekend. Captain Leon Smith urged the LTA to use that triumph to inspire future players, but Murray, 28, said he did not know where the next generation are. Nothing ever gets done and I dont like wasting my time, he said. The Scot added he has not discussed the lack of young British players competing in Grand Slams with LTA chief executive Michael Downey. Media playback is not supported on this device I dont speak to any of the people who are in a high-up position about that, Murray revealed. I havent really spoken to them about anything. Its concerning not to have any juniors in the Grand Slams because that is something we were always very good at. Its not ideal. Downey earlier said Britains Davis Cup win was a special, emotional moment that could drive interest in the sport. Before the final in Ghent, Murray was criticised by former Great Britain Davis Cup captain David Lloyd for not putting enough back into the game. Id rather concentrate on my own stuff and when Ive finished playing, Ill have a lot more time to try and help or give back to the game, Murray added. Just now, Ive just got to concentrate on trying to win as much as possible. Murray said one of his main frustrations was a lack of players to practise with whenever he is in the UK. After returning from the Shanghai Masters in October, Murray said he arrived at the National Training Centre in London to find no other players present. I was there on a Monday at about 3pm and then on Tuesday, at the same time, he said. There was not one person using any of the indoor courts and not one person in the gym. I took photos of it because the place cost like £40m and there are no people. Prior to Murrays comments, Smith said the LTA needed to quickly create a long-term strategy to capitalise on his teams victory. Smith also praised Judy Murrays tennis programmes but said the mother of British number one Andy and doubles specialist Jamie needs a lot of help. She cant keep doing it on her own, he added. Media playback is not supported on this device Murray leads the LTAs Miss-Hits programme - an introductory course for girls aged between five and eight - and a Scottish-based scheme, Tennis on the Road. Smith, who became Davis Cup captain five years ago with the team a play-off away from relegation to the events lowest tier, added: At the end of the day, we all care about British tennis a lot. What we want to see is more people playing, so there should be a bigger talent pool in years to come. It really is an important time to get strategies rolled out as quickly as possible, not only to get people on the court but to keep them on the court. We need to offer them good clubs and good coaches that turn up in all weather and bang out great sessions. Lets hope it has a positive influence, because it should do. The LTA was criticised for failing to capitalise on Murrays Wimbledon victory in 2013 with participation levels falling in the aftermath. But LTA chief Downey said the coverage created by Britains successful weekend in Belgium should help encourage participation. In the most recent figures released by Sport England, for the six months up to March 2015, tennis participation was up.",34970935,"[0, 18801, 20083, 16, 45, 2800, 15, 42, 2187, 20, 232, 346, 80, 4083, 2860, 1444, 7, 339, 5, 2505, 968, 13, 5, 78, 86, 11, 7589, 107, 19, 1124, 81, 7320, 11, 272, 37754, 81, 5, 983, 4, 8977, 9213, 1259, 2966, 5, 226, 3847, 7, 304, 14, 10121, 7, 9769, 499, 472, 6, 53, 4479, 6, 971, 6, 26, 37, 222, 45, 216, 147, 5, 220, 2706, 32, 4, 10385, 655, 1516, 626, 8, 38, 33976, 101, 21025, 127, 86, 6, 37, 26, 4, 20, 10400, 355, 37, 34, 45, 3373, 5, 1762, 9, 664, 1089, 472, 5468, ...]","[0, 32743, 4479, 161, 1686, 7, 5, 23970, 14731, 1544, 59, 5, 499, 9, 1089, 5919, 16, 10, 3844, 9, 39, 86, 4, 2]",Andy Murray says talking to the Lawn Tennis Association about the future of British tennis is a waste of his time.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","The 25-year old striker is at Euro 2016 with Belgium but has fallen out of favour at Liverpool since boss Jurgen Klopp took over last October. The forward joined the Reds for £32.5m in July 2015 under former boss Brendan Rodgers and has scored 10 goals. He has started only eight league games under Klopp and last month said: Id like to stay if I remain in the coachs plans. If not, itll become difficult. Palace, who completed the signings of England winger Andros Townsend from Newcastle and France international goalkeeper Steve Mandanda from Marseille on Friday, were also reportedly interested in Bentekes Belgium team-mate Michy Batshuayi. But the 22-year-old forward has been linked with a £33m move to Chelsea.",36689057,"[0, 133, 564, 12, 180, 793, 5955, 16, 23, 5122, 336, 19, 7320, 53, 34, 4491, 66, 9, 5976, 23, 3426, 187, 3504, 344, 7150, 225, 11116, 362, 81, 94, 779, 4, 20, 556, 1770, 5, 9269, 13, 984, 2881, 4, 245, 119, 11, 550, 570, 223, 320, 3504, 13015, 9122, 8, 34, 1008, 158, 1175, 4, 91, 34, 554, 129, 799, 1267, 426, 223, 11116, 8, 94, 353, 26, 35, 10367, 101, 7, 1095, 114, 38, 1091, 11, 5, 704, 29, 708, 4, 318, 45, 6, 24, 890, 555, 1202, 4, 5928, 6, 54, 2121, 5, 21769, 9, 1156, ...]","[0, 42904, 5928, 33, 156, 10, 984, 1244, 119, 2311, 7, 1203, 2412, 15464, 242, 1071, 31, 3426, 4, 2]",Crystal Palace have made a £25m bid to sign Christian Benteke from Liverpool.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Marc Carters plea to find a replacement sippy cup for son Ben was retweeted more than 12,000 times. The 14-year-old has only drunk from the double-handled vessels, which are no longer produced, since the age of two. Tommee Tippee said it will produce 500 cups after it searched factories worldwide and found the original mould. The firms attention was drawn to the familys plight when Mr Carter launched the Twitter appeal to find a replacement. More on a dads desperate search for a cup, and other stories His original plea prompted offers of help from as far away as Australia. Mr Carter, 42, said the response from well-wishers had been incredible and it was a huge surprise to be contacted by the manufacturer. Mr Carter said: For me its massive. Some people think Im exaggerating but without it he doesnt drink so personally Im very relieved. Tommee Tippee will send the cups on demand for free to the Carter family. Mr Carter said: I would not be happier if I won the lottery. Weve moved down to the middle of nowhere and dont want much. Just knowing he has got these cups gives us peace of mind. Northumberland-based Tommee Tippee does not normally keep the moulds but had been searching factories around the world in the hope of finding the original plans. A spokesman said: We are delighted to confirm that we are able to start production on a run of the original cup. This will ensure that Ben has a lifetime supply and that his family wont ever have to worry about finding another cup for Ben. Mr Carter, from Devon, told the BBC his son has had his current blue cup for three years, but it is now falling apart and may only last a few more weeks. He said: This tiny blue cup dictates our life.",38141319,"[0, 28987, 1653, 2696, 6221, 7, 465, 10, 5010, 579, 31177, 4946, 13, 979, 1664, 21, 24352, 196, 55, 87, 316, 6, 151, 498, 4, 20, 501, 12, 180, 12, 279, 34, 129, 10789, 31, 5, 1457, 12, 42536, 9048, 6, 61, 32, 117, 1181, 2622, 6, 187, 5, 1046, 9, 80, 4, 1560, 1794, 242, 255, 5600, 1942, 26, 24, 40, 2592, 1764, 12988, 71, 24, 10593, 12126, 3612, 8, 303, 5, 1461, 27421, 4, 20, 2566, 1503, 21, 4777, 7, 5, 284, 29, 18318, 77, 427, 5306, 1660, 5, 599, 2868, 7, 465, 10, 5010, 4, 901, 15, ...]","[0, 250, 4252, 18, 7764, 1707, 7, 3190, 39, 33329, 979, 18, 8055, 22, 27635, 2440, 4946, 113, 34, 1249, 111, 71, 5, 7508, 4425, 11, 7, 146, 10, 7370, 18, 1787, 4, 2]","A dad's desperate search to replace his autistic son's beloved ""little blue cup"" has ended - after the manufacturer stepped in to make a lifetime's supply."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Media playback is not supported on this device The world number two inspired Great Britain to win the Davis Cup for the first time in 79 years with victory over Belgium in Ghent over the weekend. Captain Leon Smith urged the LTA to use that triumph to inspire future players, but Murray, 28, said he did not know where the next generation are. Nothing ever gets done and I dont like wasting my time, he said. The Scot added he has not discussed the lack of young British players competing in Grand Slams with LTA chief executive Michael Downey. Media playback is not supported on this device I dont speak to any of the people who are in a high-up position about that, Murray revealed. I havent really spoken to them about anything. Its concerning not to have any juniors in the Grand Slams because that is something we were always very good at. Its not ideal. Downey earlier said Britains Davis Cup win was a special, emotional moment that could drive interest in the sport. Before the final in Ghent, Murray was criticised by former Great Britain Davis Cup captain David Lloyd for not putting enough back into the game. Id rather concentrate on my own stuff and when Ive finished playing, Ill have a lot more time to try and help or give back to the game, Murray added. Just now, Ive just got to concentrate on trying to win as much as possible. Murray said one of his main frustrations was a lack of players to practise with whenever he is in the UK. After returning from the Shanghai Masters in October, Murray said he arrived at the National Training Centre in London to find no other players present. I was there on a Monday at about 3pm and then on Tuesday, at the same time, he said. There was not one person using any of the indoor courts and not one person in the gym. I took photos of it because the place cost like £40m and there are no people. Prior to Murrays comments, Smith said the LTA needed to quickly create a long-term strategy to capitalise on his teams victory. Smith also praised Judy Murrays tennis programmes but said the mother of British number one Andy and doubles specialist Jamie needs a lot of help. She cant keep doing it on her own, he added. Media playback is not supported on this device Murray leads the LTAs Miss-Hits programme - an introductory course for girls aged between five and eight - and a Scottish-based scheme, Tennis on the Road. Smith, who became Davis Cup captain five years ago with the team a play-off away from relegation to the events lowest tier, added: At the end of the day, we all care about British tennis a lot. What we want to see is more people playing, so there should be a bigger talent pool in years to come. It really is an important time to get strategies rolled out as quickly as possible, not only to get people on the court but to keep them on the court. We need to offer them good clubs and good coaches that turn up in all weather and bang out great sessions. Lets hope it has a positive influence, because it should do. The LTA was criticised for failing to capitalise on Murrays Wimbledon victory in 2013 with participation levels falling in the aftermath. But LTA chief Downey said the coverage created by Britains successful weekend in Belgium should help encourage participation. In the most recent figures released by Sport England, for the six months up to March 2015, tennis participation was up.",34970935,"[0, 18801, 20083, 16, 45, 2800, 15, 42, 2187, 20, 232, 346, 80, 4083, 2860, 1444, 7, 339, 5, 2505, 968, 13, 5, 78, 86, 11, 7589, 107, 19, 1124, 81, 7320, 11, 272, 37754, 81, 5, 983, 4, 8977, 9213, 1259, 2966, 5, 226, 3847, 7, 304, 14, 10121, 7, 9769, 499, 472, 6, 53, 4479, 6, 971, 6, 26, 37, 222, 45, 216, 147, 5, 220, 2706, 32, 4, 10385, 655, 1516, 626, 8, 38, 33976, 101, 21025, 127, 86, 6, 37, 26, 4, 20, 10400, 355, 37, 34, 45, 3373, 5, 1762, 9, 664, 1089, 472, 5468, ...]","[0, 32743, 4479, 161, 1686, 7, 5, 23970, 14731, 1544, 59, 5, 499, 9, 1089, 5919, 16, 10, 3844, 9, 39, 86, 4, 2]",Andy Murray says talking to the Lawn Tennis Association about the future of British tennis is a waste of his time.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","The 25-year old striker is at Euro 2016 with Belgium but has fallen out of favour at Liverpool since boss Jurgen Klopp took over last October. The forward joined the Reds for £32.5m in July 2015 under former boss Brendan Rodgers and has scored 10 goals. He has started only eight league games under Klopp and last month said: Id like to stay if I remain in the coachs plans. If not, itll become difficult. Palace, who completed the signings of England winger Andros Townsend from Newcastle and France international goalkeeper Steve Mandanda from Marseille on Friday, were also reportedly interested in Bentekes Belgium team-mate Michy Batshuayi. But the 22-year-old forward has been linked with a £33m move to Chelsea.",36689057,"[0, 133, 564, 12, 180, 793, 5955, 16, 23, 5122, 336, 19, 7320, 53, 34, 4491, 66, 9, 5976, 23, 3426, 187, 3504, 344, 7150, 225, 11116, 362, 81, 94, 779, 4, 20, 556, 1770, 5, 9269, 13, 984, 2881, 4, 245, 119, 11, 550, 570, 223, 320, 3504, 13015, 9122, 8, 34, 1008, 158, 1175, 4, 91, 34, 554, 129, 799, 1267, 426, 223, 11116, 8, 94, 353, 26, 35, 10367, 101, 7, 1095, 114, 38, 1091, 11, 5, 704, 29, 708, 4, 318, 45, 6, 24, 890, 555, 1202, 4, 5928, 6, 54, 2121, 5, 21769, 9, 1156, ...]","[0, 42904, 5928, 33, 156, 10, 984, 1244, 119, 2311, 7, 1203, 2412, 15464, 242, 1071, 31, 3426, 4, 2]",Crystal Palace have made a £25m bid to sign Christian Benteke from Liverpool.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Marc Carters plea to find a replacement sippy cup for son Ben was retweeted more than 12,000 times. The 14-year-old has only drunk from the double-handled vessels, which are no longer produced, since the age of two. Tommee Tippee said it will produce 500 cups after it searched factories worldwide and found the original mould. The firms attention was drawn to the familys plight when Mr Carter launched the Twitter appeal to find a replacement. More on a dads desperate search for a cup, and other stories His original plea prompted offers of help from as far away as Australia. Mr Carter, 42, said the response from well-wishers had been incredible and it was a huge surprise to be contacted by the manufacturer. Mr Carter said: For me its massive. Some people think Im exaggerating but without it he doesnt drink so personally Im very relieved. Tommee Tippee will send the cups on demand for free to the Carter family. Mr Carter said: I would not be happier if I won the lottery. Weve moved down to the middle of nowhere and dont want much. Just knowing he has got these cups gives us peace of mind. Northumberland-based Tommee Tippee does not normally keep the moulds but had been searching factories around the world in the hope of finding the original plans. A spokesman said: We are delighted to confirm that we are able to start production on a run of the original cup. This will ensure that Ben has a lifetime supply and that his family wont ever have to worry about finding another cup for Ben. Mr Carter, from Devon, told the BBC his son has had his current blue cup for three years, but it is now falling apart and may only last a few more weeks. He said: This tiny blue cup dictates our life.",38141319,"[0, 28987, 1653, 2696, 6221, 7, 465, 10, 5010, 579, 31177, 4946, 13, 979, 1664, 21, 24352, 196, 55, 87, 316, 6, 151, 498, 4, 20, 501, 12, 180, 12, 279, 34, 129, 10789, 31, 5, 1457, 12, 42536, 9048, 6, 61, 32, 117, 1181, 2622, 6, 187, 5, 1046, 9, 80, 4, 1560, 1794, 242, 255, 5600, 1942, 26, 24, 40, 2592, 1764, 12988, 71, 24, 10593, 12126, 3612, 8, 303, 5, 1461, 27421, 4, 20, 2566, 1503, 21, 4777, 7, 5, 284, 29, 18318, 77, 427, 5306, 1660, 5, 599, 2868, 7, 465, 10, 5010, 4, 901, 15, ...]","[0, 250, 4252, 18, 7764, 1707, 7, 3190, 39, 33329, 979, 18, 8055, 22, 27635, 2440, 4946, 113, 34, 1249, 111, 71, 5, 7508, 4425, 11, 7, 146, 10, 7370, 18, 1787, 4, 2]","A dad's desperate search to replace his autistic son's beloved ""little blue cup"" has ended - after the manufacturer stepped in to make a lifetime's supply."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Media playback is not supported on this device The world number two inspired Great Britain to win the Davis Cup for the first time in 79 years with victory over Belgium in Ghent over the weekend. Captain Leon Smith urged the LTA to use that triumph to inspire future players, but Murray, 28, said he did not know where the next generation are. Nothing ever gets done and I dont like wasting my time, he said. The Scot added he has not discussed the lack of young British players competing in Grand Slams with LTA chief executive Michael Downey. Media playback is not supported on this device I dont speak to any of the people who are in a high-up position about that, Murray revealed. I havent really spoken to them about anything. Its concerning not to have any juniors in the Grand Slams because that is something we were always very good at. Its not ideal. Downey earlier said Britains Davis Cup win was a special, emotional moment that could drive interest in the sport. Before the final in Ghent, Murray was criticised by former Great Britain Davis Cup captain David Lloyd for not putting enough back into the game. Id rather concentrate on my own stuff and when Ive finished playing, Ill have a lot more time to try and help or give back to the game, Murray added. Just now, Ive just got to concentrate on trying to win as much as possible. Murray said one of his main frustrations was a lack of players to practise with whenever he is in the UK. After returning from the Shanghai Masters in October, Murray said he arrived at the National Training Centre in London to find no other players present. I was there on a Monday at about 3pm and then on Tuesday, at the same time, he said. There was not one person using any of the indoor courts and not one person in the gym. I took photos of it because the place cost like £40m and there are no people. Prior to Murrays comments, Smith said the LTA needed to quickly create a long-term strategy to capitalise on his teams victory. Smith also praised Judy Murrays tennis programmes but said the mother of British number one Andy and doubles specialist Jamie needs a lot of help. She cant keep doing it on her own, he added. Media playback is not supported on this device Murray leads the LTAs Miss-Hits programme - an introductory course for girls aged between five and eight - and a Scottish-based scheme, Tennis on the Road. Smith, who became Davis Cup captain five years ago with the team a play-off away from relegation to the events lowest tier, added: At the end of the day, we all care about British tennis a lot. What we want to see is more people playing, so there should be a bigger talent pool in years to come. It really is an important time to get strategies rolled out as quickly as possible, not only to get people on the court but to keep them on the court. We need to offer them good clubs and good coaches that turn up in all weather and bang out great sessions. Lets hope it has a positive influence, because it should do. The LTA was criticised for failing to capitalise on Murrays Wimbledon victory in 2013 with participation levels falling in the aftermath. But LTA chief Downey said the coverage created by Britains successful weekend in Belgium should help encourage participation. In the most recent figures released by Sport England, for the six months up to March 2015, tennis participation was up.",34970935,"[0, 18801, 20083, 16, 45, 2800, 15, 42, 2187, 20, 232, 346, 80, 4083, 2860, 1444, 7, 339, 5, 2505, 968, 13, 5, 78, 86, 11, 7589, 107, 19, 1124, 81, 7320, 11, 272, 37754, 81, 5, 983, 4, 8977, 9213, 1259, 2966, 5, 226, 3847, 7, 304, 14, 10121, 7, 9769, 499, 472, 6, 53, 4479, 6, 971, 6, 26, 37, 222, 45, 216, 147, 5, 220, 2706, 32, 4, 10385, 655, 1516, 626, 8, 38, 33976, 101, 21025, 127, 86, 6, 37, 26, 4, 20, 10400, 355, 37, 34, 45, 3373, 5, 1762, 9, 664, 1089, 472, 5468, ...]","[0, 32743, 4479, 161, 1686, 7, 5, 23970, 14731, 1544, 59, 5, 499, 9, 1089, 5919, 16, 10, 3844, 9, 39, 86, 4, 2]",Andy Murray says talking to the Lawn Tennis Association about the future of British tennis is a waste of his time.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","The 25-year old striker is at Euro 2016 with Belgium but has fallen out of favour at Liverpool since boss Jurgen Klopp took over last October. The forward joined the Reds for £32.5m in July 2015 under former boss Brendan Rodgers and has scored 10 goals. He has started only eight league games under Klopp and last month said: Id like to stay if I remain in the coachs plans. If not, itll become difficult. Palace, who completed the signings of England winger Andros Townsend from Newcastle and France international goalkeeper Steve Mandanda from Marseille on Friday, were also reportedly interested in Bentekes Belgium team-mate Michy Batshuayi. But the 22-year-old forward has been linked with a £33m move to Chelsea.",36689057,"[0, 133, 564, 12, 180, 793, 5955, 16, 23, 5122, 336, 19, 7320, 53, 34, 4491, 66, 9, 5976, 23, 3426, 187, 3504, 344, 7150, 225, 11116, 362, 81, 94, 779, 4, 20, 556, 1770, 5, 9269, 13, 984, 2881, 4, 245, 119, 11, 550, 570, 223, 320, 3504, 13015, 9122, 8, 34, 1008, 158, 1175, 4, 91, 34, 554, 129, 799, 1267, 426, 223, 11116, 8, 94, 353, 26, 35, 10367, 101, 7, 1095, 114, 38, 1091, 11, 5, 704, 29, 708, 4, 318, 45, 6, 24, 890, 555, 1202, 4, 5928, 6, 54, 2121, 5, 21769, 9, 1156, ...]","[0, 42904, 5928, 33, 156, 10, 984, 1244, 119, 2311, 7, 1203, 2412, 15464, 242, 1071, 31, 3426, 4, 2]",Crystal Palace have made a £25m bid to sign Christian Benteke from Liverpool.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Marc Carters plea to find a replacement sippy cup for son Ben was retweeted more than 12,000 times. The 14-year-old has only drunk from the double-handled vessels, which are no longer produced, since the age of two. Tommee Tippee said it will produce 500 cups after it searched factories worldwide and found the original mould. The firms attention was drawn to the familys plight when Mr Carter launched the Twitter appeal to find a replacement. More on a dads desperate search for a cup, and other stories His original plea prompted offers of help from as far away as Australia. Mr Carter, 42, said the response from well-wishers had been incredible and it was a huge surprise to be contacted by the manufacturer. Mr Carter said: For me its massive. Some people think Im exaggerating but without it he doesnt drink so personally Im very relieved. Tommee Tippee will send the cups on demand for free to the Carter family. Mr Carter said: I would not be happier if I won the lottery. Weve moved down to the middle of nowhere and dont want much. Just knowing he has got these cups gives us peace of mind. Northumberland-based Tommee Tippee does not normally keep the moulds but had been searching factories around the world in the hope of finding the original plans. A spokesman said: We are delighted to confirm that we are able to start production on a run of the original cup. This will ensure that Ben has a lifetime supply and that his family wont ever have to worry about finding another cup for Ben. Mr Carter, from Devon, told the BBC his son has had his current blue cup for three years, but it is now falling apart and may only last a few more weeks. He said: This tiny blue cup dictates our life.",38141319,"[0, 28987, 1653, 2696, 6221, 7, 465, 10, 5010, 579, 31177, 4946, 13, 979, 1664, 21, 24352, 196, 55, 87, 316, 6, 151, 498, 4, 20, 501, 12, 180, 12, 279, 34, 129, 10789, 31, 5, 1457, 12, 42536, 9048, 6, 61, 32, 117, 1181, 2622, 6, 187, 5, 1046, 9, 80, 4, 1560, 1794, 242, 255, 5600, 1942, 26, 24, 40, 2592, 1764, 12988, 71, 24, 10593, 12126, 3612, 8, 303, 5, 1461, 27421, 4, 20, 2566, 1503, 21, 4777, 7, 5, 284, 29, 18318, 77, 427, 5306, 1660, 5, 599, 2868, 7, 465, 10, 5010, 4, 901, 15, ...]","[0, 250, 4252, 18, 7764, 1707, 7, 3190, 39, 33329, 979, 18, 8055, 22, 27635, 2440, 4946, 113, 34, 1249, 111, 71, 5, 7508, 4425, 11, 7, 146, 10, 7370, 18, 1787, 4, 2]","A dad's desperate search to replace his autistic son's beloved ""little blue cup"" has ended - after the manufacturer stepped in to make a lifetime's supply."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Media playback is not supported on this device The world number two inspired Great Britain to win the Davis Cup for the first time in 79 years with victory over Belgium in Ghent over the weekend. Captain Leon Smith urged the LTA to use that triumph to inspire future players, but Murray, 28, said he did not know where the next generation are. Nothing ever gets done and I dont like wasting my time, he said. The Scot added he has not discussed the lack of young British players competing in Grand Slams with LTA chief executive Michael Downey. Media playback is not supported on this device I dont speak to any of the people who are in a high-up position about that, Murray revealed. I havent really spoken to them about anything. Its concerning not to have any juniors in the Grand Slams because that is something we were always very good at. Its not ideal. Downey earlier said Britains Davis Cup win was a special, emotional moment that could drive interest in the sport. Before the final in Ghent, Murray was criticised by former Great Britain Davis Cup captain David Lloyd for not putting enough back into the game. Id rather concentrate on my own stuff and when Ive finished playing, Ill have a lot more time to try and help or give back to the game, Murray added. Just now, Ive just got to concentrate on trying to win as much as possible. Murray said one of his main frustrations was a lack of players to practise with whenever he is in the UK. After returning from the Shanghai Masters in October, Murray said he arrived at the National Training Centre in London to find no other players present. I was there on a Monday at about 3pm and then on Tuesday, at the same time, he said. There was not one person using any of the indoor courts and not one person in the gym. I took photos of it because the place cost like £40m and there are no people. Prior to Murrays comments, Smith said the LTA needed to quickly create a long-term strategy to capitalise on his teams victory. Smith also praised Judy Murrays tennis programmes but said the mother of British number one Andy and doubles specialist Jamie needs a lot of help. She cant keep doing it on her own, he added. Media playback is not supported on this device Murray leads the LTAs Miss-Hits programme - an introductory course for girls aged between five and eight - and a Scottish-based scheme, Tennis on the Road. Smith, who became Davis Cup captain five years ago with the team a play-off away from relegation to the events lowest tier, added: At the end of the day, we all care about British tennis a lot. What we want to see is more people playing, so there should be a bigger talent pool in years to come. It really is an important time to get strategies rolled out as quickly as possible, not only to get people on the court but to keep them on the court. We need to offer them good clubs and good coaches that turn up in all weather and bang out great sessions. Lets hope it has a positive influence, because it should do. The LTA was criticised for failing to capitalise on Murrays Wimbledon victory in 2013 with participation levels falling in the aftermath. But LTA chief Downey said the coverage created by Britains successful weekend in Belgium should help encourage participation. In the most recent figures released by Sport England, for the six months up to March 2015, tennis participation was up.",34970935,"[0, 18801, 20083, 16, 45, 2800, 15, 42, 2187, 20, 232, 346, 80, 4083, 2860, 1444, 7, 339, 5, 2505, 968, 13, 5, 78, 86, 11, 7589, 107, 19, 1124, 81, 7320, 11, 272, 37754, 81, 5, 983, 4, 8977, 9213, 1259, 2966, 5, 226, 3847, 7, 304, 14, 10121, 7, 9769, 499, 472, 6, 53, 4479, 6, 971, 6, 26, 37, 222, 45, 216, 147, 5, 220, 2706, 32, 4, 10385, 655, 1516, 626, 8, 38, 33976, 101, 21025, 127, 86, 6, 37, 26, 4, 20, 10400, 355, 37, 34, 45, 3373, 5, 1762, 9, 664, 1089, 472, 5468, ...]","[0, 32743, 4479, 161, 1686, 7, 5, 23970, 14731, 1544, 59, 5, 499, 9, 1089, 5919, 16, 10, 3844, 9, 39, 86, 4, 2]",Andy Murray says talking to the Lawn Tennis Association about the future of British tennis is a waste of his time.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","The 25-year old striker is at Euro 2016 with Belgium but has fallen out of favour at Liverpool since boss Jurgen Klopp took over last October. The forward joined the Reds for £32.5m in July 2015 under former boss Brendan Rodgers and has scored 10 goals. He has started only eight league games under Klopp and last month said: Id like to stay if I remain in the coachs plans. If not, itll become difficult. Palace, who completed the signings of England winger Andros Townsend from Newcastle and France international goalkeeper Steve Mandanda from Marseille on Friday, were also reportedly interested in Bentekes Belgium team-mate Michy Batshuayi. But the 22-year-old forward has been linked with a £33m move to Chelsea.",36689057,"[0, 133, 564, 12, 180, 793, 5955, 16, 23, 5122, 336, 19, 7320, 53, 34, 4491, 66, 9, 5976, 23, 3426, 187, 3504, 344, 7150, 225, 11116, 362, 81, 94, 779, 4, 20, 556, 1770, 5, 9269, 13, 984, 2881, 4, 245, 119, 11, 550, 570, 223, 320, 3504, 13015, 9122, 8, 34, 1008, 158, 1175, 4, 91, 34, 554, 129, 799, 1267, 426, 223, 11116, 8, 94, 353, 26, 35, 10367, 101, 7, 1095, 114, 38, 1091, 11, 5, 704, 29, 708, 4, 318, 45, 6, 24, 890, 555, 1202, 4, 5928, 6, 54, 2121, 5, 21769, 9, 1156, ...]","[0, 42904, 5928, 33, 156, 10, 984, 1244, 119, 2311, 7, 1203, 2412, 15464, 242, 1071, 31, 3426, 4, 2]",Crystal Palace have made a £25m bid to sign Christian Benteke from Liverpool.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Marc Carters plea to find a replacement sippy cup for son Ben was retweeted more than 12,000 times. The 14-year-old has only drunk from the double-handled vessels, which are no longer produced, since the age of two. Tommee Tippee said it will produce 500 cups after it searched factories worldwide and found the original mould. The firms attention was drawn to the familys plight when Mr Carter launched the Twitter appeal to find a replacement. More on a dads desperate search for a cup, and other stories His original plea prompted offers of help from as far away as Australia. Mr Carter, 42, said the response from well-wishers had been incredible and it was a huge surprise to be contacted by the manufacturer. Mr Carter said: For me its massive. Some people think Im exaggerating but without it he doesnt drink so personally Im very relieved. Tommee Tippee will send the cups on demand for free to the Carter family. Mr Carter said: I would not be happier if I won the lottery. Weve moved down to the middle of nowhere and dont want much. Just knowing he has got these cups gives us peace of mind. Northumberland-based Tommee Tippee does not normally keep the moulds but had been searching factories around the world in the hope of finding the original plans. A spokesman said: We are delighted to confirm that we are able to start production on a run of the original cup. This will ensure that Ben has a lifetime supply and that his family wont ever have to worry about finding another cup for Ben. Mr Carter, from Devon, told the BBC his son has had his current blue cup for three years, but it is now falling apart and may only last a few more weeks. He said: This tiny blue cup dictates our life.",38141319,"[0, 28987, 1653, 2696, 6221, 7, 465, 10, 5010, 579, 31177, 4946, 13, 979, 1664, 21, 24352, 196, 55, 87, 316, 6, 151, 498, 4, 20, 501, 12, 180, 12, 279, 34, 129, 10789, 31, 5, 1457, 12, 42536, 9048, 6, 61, 32, 117, 1181, 2622, 6, 187, 5, 1046, 9, 80, 4, 1560, 1794, 242, 255, 5600, 1942, 26, 24, 40, 2592, 1764, 12988, 71, 24, 10593, 12126, 3612, 8, 303, 5, 1461, 27421, 4, 20, 2566, 1503, 21, 4777, 7, 5, 284, 29, 18318, 77, 427, 5306, 1660, 5, 599, 2868, 7, 465, 10, 5010, 4, 901, 15, ...]","[0, 250, 4252, 18, 7764, 1707, 7, 3190, 39, 33329, 979, 18, 8055, 22, 27635, 2440, 4946, 113, 34, 1249, 111, 71, 5, 7508, 4425, 11, 7, 146, 10, 7370, 18, 1787, 4, 2]","A dad's desperate search to replace his autistic son's beloved ""little blue cup"" has ended - after the manufacturer stepped in to make a lifetime's supply."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Media playback is not supported on this device The world number two inspired Great Britain to win the Davis Cup for the first time in 79 years with victory over Belgium in Ghent over the weekend. Captain Leon Smith urged the LTA to use that triumph to inspire future players, but Murray, 28, said he did not know where the next generation are. Nothing ever gets done and I dont like wasting my time, he said. The Scot added he has not discussed the lack of young British players competing in Grand Slams with LTA chief executive Michael Downey. Media playback is not supported on this device I dont speak to any of the people who are in a high-up position about that, Murray revealed. I havent really spoken to them about anything. Its concerning not to have any juniors in the Grand Slams because that is something we were always very good at. Its not ideal. Downey earlier said Britains Davis Cup win was a special, emotional moment that could drive interest in the sport. Before the final in Ghent, Murray was criticised by former Great Britain Davis Cup captain David Lloyd for not putting enough back into the game. Id rather concentrate on my own stuff and when Ive finished playing, Ill have a lot more time to try and help or give back to the game, Murray added. Just now, Ive just got to concentrate on trying to win as much as possible. Murray said one of his main frustrations was a lack of players to practise with whenever he is in the UK. After returning from the Shanghai Masters in October, Murray said he arrived at the National Training Centre in London to find no other players present. I was there on a Monday at about 3pm and then on Tuesday, at the same time, he said. There was not one person using any of the indoor courts and not one person in the gym. I took photos of it because the place cost like £40m and there are no people. Prior to Murrays comments, Smith said the LTA needed to quickly create a long-term strategy to capitalise on his teams victory. Smith also praised Judy Murrays tennis programmes but said the mother of British number one Andy and doubles specialist Jamie needs a lot of help. She cant keep doing it on her own, he added. Media playback is not supported on this device Murray leads the LTAs Miss-Hits programme - an introductory course for girls aged between five and eight - and a Scottish-based scheme, Tennis on the Road. Smith, who became Davis Cup captain five years ago with the team a play-off away from relegation to the events lowest tier, added: At the end of the day, we all care about British tennis a lot. What we want to see is more people playing, so there should be a bigger talent pool in years to come. It really is an important time to get strategies rolled out as quickly as possible, not only to get people on the court but to keep them on the court. We need to offer them good clubs and good coaches that turn up in all weather and bang out great sessions. Lets hope it has a positive influence, because it should do. The LTA was criticised for failing to capitalise on Murrays Wimbledon victory in 2013 with participation levels falling in the aftermath. But LTA chief Downey said the coverage created by Britains successful weekend in Belgium should help encourage participation. In the most recent figures released by Sport England, for the six months up to March 2015, tennis participation was up.",34970935,"[0, 18801, 20083, 16, 45, 2800, 15, 42, 2187, 20, 232, 346, 80, 4083, 2860, 1444, 7, 339, 5, 2505, 968, 13, 5, 78, 86, 11, 7589, 107, 19, 1124, 81, 7320, 11, 272, 37754, 81, 5, 983, 4, 8977, 9213, 1259, 2966, 5, 226, 3847, 7, 304, 14, 10121, 7, 9769, 499, 472, 6, 53, 4479, 6, 971, 6, 26, 37, 222, 45, 216, 147, 5, 220, 2706, 32, 4, 10385, 655, 1516, 626, 8, 38, 33976, 101, 21025, 127, 86, 6, 37, 26, 4, 20, 10400, 355, 37, 34, 45, 3373, 5, 1762, 9, 664, 1089, 472, 5468, ...]","[0, 32743, 4479, 161, 1686, 7, 5, 23970, 14731, 1544, 59, 5, 499, 9, 1089, 5919, 16, 10, 3844, 9, 39, 86, 4, 2]",Andy Murray says talking to the Lawn Tennis Association about the future of British tennis is a waste of his time.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","The 25-year old striker is at Euro 2016 with Belgium but has fallen out of favour at Liverpool since boss Jurgen Klopp took over last October. The forward joined the Reds for £32.5m in July 2015 under former boss Brendan Rodgers and has scored 10 goals. He has started only eight league games under Klopp and last month said: Id like to stay if I remain in the coachs plans. If not, itll become difficult. Palace, who completed the signings of England winger Andros Townsend from Newcastle and France international goalkeeper Steve Mandanda from Marseille on Friday, were also reportedly interested in Bentekes Belgium team-mate Michy Batshuayi. But the 22-year-old forward has been linked with a £33m move to Chelsea.",36689057,"[0, 133, 564, 12, 180, 793, 5955, 16, 23, 5122, 336, 19, 7320, 53, 34, 4491, 66, 9, 5976, 23, 3426, 187, 3504, 344, 7150, 225, 11116, 362, 81, 94, 779, 4, 20, 556, 1770, 5, 9269, 13, 984, 2881, 4, 245, 119, 11, 550, 570, 223, 320, 3504, 13015, 9122, 8, 34, 1008, 158, 1175, 4, 91, 34, 554, 129, 799, 1267, 426, 223, 11116, 8, 94, 353, 26, 35, 10367, 101, 7, 1095, 114, 38, 1091, 11, 5, 704, 29, 708, 4, 318, 45, 6, 24, 890, 555, 1202, 4, 5928, 6, 54, 2121, 5, 21769, 9, 1156, ...]","[0, 42904, 5928, 33, 156, 10, 984, 1244, 119, 2311, 7, 1203, 2412, 15464, 242, 1071, 31, 3426, 4, 2]",Crystal Palace have made a £25m bid to sign Christian Benteke from Liverpool.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Marc Carters plea to find a replacement sippy cup for son Ben was retweeted more than 12,000 times. The 14-year-old has only drunk from the double-handled vessels, which are no longer produced, since the age of two. Tommee Tippee said it will produce 500 cups after it searched factories worldwide and found the original mould. The firms attention was drawn to the familys plight when Mr Carter launched the Twitter appeal to find a replacement. More on a dads desperate search for a cup, and other stories His original plea prompted offers of help from as far away as Australia. Mr Carter, 42, said the response from well-wishers had been incredible and it was a huge surprise to be contacted by the manufacturer. Mr Carter said: For me its massive. Some people think Im exaggerating but without it he doesnt drink so personally Im very relieved. Tommee Tippee will send the cups on demand for free to the Carter family. Mr Carter said: I would not be happier if I won the lottery. Weve moved down to the middle of nowhere and dont want much. Just knowing he has got these cups gives us peace of mind. Northumberland-based Tommee Tippee does not normally keep the moulds but had been searching factories around the world in the hope of finding the original plans. A spokesman said: We are delighted to confirm that we are able to start production on a run of the original cup. This will ensure that Ben has a lifetime supply and that his family wont ever have to worry about finding another cup for Ben. Mr Carter, from Devon, told the BBC his son has had his current blue cup for three years, but it is now falling apart and may only last a few more weeks. He said: This tiny blue cup dictates our life.",38141319,"[0, 28987, 1653, 2696, 6221, 7, 465, 10, 5010, 579, 31177, 4946, 13, 979, 1664, 21, 24352, 196, 55, 87, 316, 6, 151, 498, 4, 20, 501, 12, 180, 12, 279, 34, 129, 10789, 31, 5, 1457, 12, 42536, 9048, 6, 61, 32, 117, 1181, 2622, 6, 187, 5, 1046, 9, 80, 4, 1560, 1794, 242, 255, 5600, 1942, 26, 24, 40, 2592, 1764, 12988, 71, 24, 10593, 12126, 3612, 8, 303, 5, 1461, 27421, 4, 20, 2566, 1503, 21, 4777, 7, 5, 284, 29, 18318, 77, 427, 5306, 1660, 5, 599, 2868, 7, 465, 10, 5010, 4, 901, 15, ...]","[0, 250, 4252, 18, 7764, 1707, 7, 3190, 39, 33329, 979, 18, 8055, 22, 27635, 2440, 4946, 113, 34, 1249, 111, 71, 5, 7508, 4425, 11, 7, 146, 10, 7370, 18, 1787, 4, 2]","A dad's desperate search to replace his autistic son's beloved ""little blue cup"" has ended - after the manufacturer stepped in to make a lifetime's supply."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Media playback is not supported on this device The world number two inspired Great Britain to win the Davis Cup for the first time in 79 years with victory over Belgium in Ghent over the weekend. Captain Leon Smith urged the LTA to use that triumph to inspire future players, but Murray, 28, said he did not know where the next generation are. Nothing ever gets done and I dont like wasting my time, he said. The Scot added he has not discussed the lack of young British players competing in Grand Slams with LTA chief executive Michael Downey. Media playback is not supported on this device I dont speak to any of the people who are in a high-up position about that, Murray revealed. I havent really spoken to them about anything. Its concerning not to have any juniors in the Grand Slams because that is something we were always very good at. Its not ideal. Downey earlier said Britains Davis Cup win was a special, emotional moment that could drive interest in the sport. Before the final in Ghent, Murray was criticised by former Great Britain Davis Cup captain David Lloyd for not putting enough back into the game. Id rather concentrate on my own stuff and when Ive finished playing, Ill have a lot more time to try and help or give back to the game, Murray added. Just now, Ive just got to concentrate on trying to win as much as possible. Murray said one of his main frustrations was a lack of players to practise with whenever he is in the UK. After returning from the Shanghai Masters in October, Murray said he arrived at the National Training Centre in London to find no other players present. I was there on a Monday at about 3pm and then on Tuesday, at the same time, he said. There was not one person using any of the indoor courts and not one person in the gym. I took photos of it because the place cost like £40m and there are no people. Prior to Murrays comments, Smith said the LTA needed to quickly create a long-term strategy to capitalise on his teams victory. Smith also praised Judy Murrays tennis programmes but said the mother of British number one Andy and doubles specialist Jamie needs a lot of help. She cant keep doing it on her own, he added. Media playback is not supported on this device Murray leads the LTAs Miss-Hits programme - an introductory course for girls aged between five and eight - and a Scottish-based scheme, Tennis on the Road. Smith, who became Davis Cup captain five years ago with the team a play-off away from relegation to the events lowest tier, added: At the end of the day, we all care about British tennis a lot. What we want to see is more people playing, so there should be a bigger talent pool in years to come. It really is an important time to get strategies rolled out as quickly as possible, not only to get people on the court but to keep them on the court. We need to offer them good clubs and good coaches that turn up in all weather and bang out great sessions. Lets hope it has a positive influence, because it should do. The LTA was criticised for failing to capitalise on Murrays Wimbledon victory in 2013 with participation levels falling in the aftermath. But LTA chief Downey said the coverage created by Britains successful weekend in Belgium should help encourage participation. In the most recent figures released by Sport England, for the six months up to March 2015, tennis participation was up.",34970935,"[0, 18801, 20083, 16, 45, 2800, 15, 42, 2187, 20, 232, 346, 80, 4083, 2860, 1444, 7, 339, 5, 2505, 968, 13, 5, 78, 86, 11, 7589, 107, 19, 1124, 81, 7320, 11, 272, 37754, 81, 5, 983, 4, 8977, 9213, 1259, 2966, 5, 226, 3847, 7, 304, 14, 10121, 7, 9769, 499, 472, 6, 53, 4479, 6, 971, 6, 26, 37, 222, 45, 216, 147, 5, 220, 2706, 32, 4, 10385, 655, 1516, 626, 8, 38, 33976, 101, 21025, 127, 86, 6, 37, 26, 4, 20, 10400, 355, 37, 34, 45, 3373, 5, 1762, 9, 664, 1089, 472, 5468, ...]","[0, 32743, 4479, 161, 1686, 7, 5, 23970, 14731, 1544, 59, 5, 499, 9, 1089, 5919, 16, 10, 3844, 9, 39, 86, 4, 2]",Andy Murray says talking to the Lawn Tennis Association about the future of British tennis is a waste of his time.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","The 25-year old striker is at Euro 2016 with Belgium but has fallen out of favour at Liverpool since boss Jurgen Klopp took over last October. The forward joined the Reds for £32.5m in July 2015 under former boss Brendan Rodgers and has scored 10 goals. He has started only eight league games under Klopp and last month said: Id like to stay if I remain in the coachs plans. If not, itll become difficult. Palace, who completed the signings of England winger Andros Townsend from Newcastle and France international goalkeeper Steve Mandanda from Marseille on Friday, were also reportedly interested in Bentekes Belgium team-mate Michy Batshuayi. But the 22-year-old forward has been linked with a £33m move to Chelsea.",36689057,"[0, 133, 564, 12, 180, 793, 5955, 16, 23, 5122, 336, 19, 7320, 53, 34, 4491, 66, 9, 5976, 23, 3426, 187, 3504, 344, 7150, 225, 11116, 362, 81, 94, 779, 4, 20, 556, 1770, 5, 9269, 13, 984, 2881, 4, 245, 119, 11, 550, 570, 223, 320, 3504, 13015, 9122, 8, 34, 1008, 158, 1175, 4, 91, 34, 554, 129, 799, 1267, 426, 223, 11116, 8, 94, 353, 26, 35, 10367, 101, 7, 1095, 114, 38, 1091, 11, 5, 704, 29, 708, 4, 318, 45, 6, 24, 890, 555, 1202, 4, 5928, 6, 54, 2121, 5, 21769, 9, 1156, ...]","[0, 42904, 5928, 33, 156, 10, 984, 1244, 119, 2311, 7, 1203, 2412, 15464, 242, 1071, 31, 3426, 4, 2]",Crystal Palace have made a £25m bid to sign Christian Benteke from Liverpool.


# 

In [20]:
tokenized_xsum['test'].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'document': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'summary': Value(dtype='string', id=None)}

# Compare Machine Summaries to Professional Human Written Summaries
To score our machine-generated summaries against professional human-written ones, we compute the cosine similarities between embeddings to measure the semantic similarity between two texts. The comparisons we will be making include human summary to machine summary, human summary to the original document, and machine summary to the original document. Initially, we wanted to make the maximum length in each machine summary the same length as the summaries in the XSUM. However, because the length of the XSUM summaries is so short (hence the name extreme summaries), the model only provided the first words of every article. This makes sense because BART's pretraining likely influenced its methodology to recognize that the start of text often contains valuable summarization information. As a result, we opted for a length of 60 words to keep it brief but allow the model to output enough context to be meaningful. The average summaries for our models are outlined below (at ~19 words per human summary)

We are going to focus on 10 articles and build 10 models to inspect each pair individually

In [21]:
def listToString(s): 
    str1 = "" 
    
    for ele in s: 
        str1 += ele  
 
    return str1 

In [22]:
article1 = tokenized_xsum['test']['document'][0]
article2 = tokenized_xsum['test']['document'][123]
article3 = tokenized_xsum['test']['document'][99]
article4 = tokenized_xsum['test']['document'][1100]
article5 = tokenized_xsum['test']['document'][1118]
article6 = tokenized_xsum['test']['document'][45]
article7 = tokenized_xsum['test']['document'][13]
article8 = tokenized_xsum['test']['document'][69]
article9 = tokenized_xsum['test']['document'][27]
article10 = tokenized_xsum['test']['document'][9]

summary1 = tokenized_xsum['test']['summary'][0]
summary2 = tokenized_xsum['test']['summary'][123]
summary3 = tokenized_xsum['test']['summary'][99]
summary4 = tokenized_xsum['test']['summary'][1100]
summary5 = tokenized_xsum['test']['summary'][1118]
summary6 = tokenized_xsum['test']['summary'][45]
summary7 = tokenized_xsum['test']['summary'][13]
summary8 = tokenized_xsum['test']['summary'][69]
summary9 = tokenized_xsum['test']['summary'][27]
summary10 = tokenized_xsum['test']['summary'][9]


In [23]:
summaryList = [summary1.split(),
summary2.split(), 
summary3.split(), 
summary4.split(),
summary5.split(),
summary6.split(),
summary7.split(), 
summary8.split(),
summary9.split(), 
summary10.split()]

count = sum( [ len(listElem) for listElem in summaryList])

print('The total number of words in these summaries is: ', count)
print('The average words per summary is: ', count / len(summaryList))

The total number of words in these summaries is:  186
The average words per summary is:  18.6


## We had 50% of our models run with the parameters early_stopping=True and 50% with early_stopping=False to see if this would provide any meaningful difference

## Model 1

In [54]:
input1 = tokenizer(article1, return_tensors='pt', truncation=True)
summary_ids1 = model.generate(input1['input_ids'], max_length=20)
machineSummary1 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids1])

In [25]:
machineSummary1 = listToString(machineSummary1)
original1 = listToString(article1)

comparison1 = [summary1, machineSummary1, original1]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings1 = token_model.encode(comparison1)
print(util.pytorch_cos_sim(comparison_embeddings1[0], comparison_embeddings1[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings1[0], comparison_embeddings1[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings1[1], comparison_embeddings1[2])) # machine summary to original article

tensor([[0.7313]])
tensor([[0.7645]])
tensor([[0.9574]])


In [26]:
comparison1

['There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.',
 'Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation. Workers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders. Welsh Government said',
 'Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation. Workers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders. The Welsh Government said more people than ever were getting help to address housing problems. Changes to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation. Prison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because

# Model 2

In [27]:
input2 = tokenizer(article2, return_tensors='pt', truncation=True)
summary_ids2 = model.generate(input2['input_ids'], max_length=60)
machineSummary2 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids2])

In [28]:
machineSummary2 = listToString(machineSummary2)
original2 = listToString(article2)

comparison2 = [summary2, machineSummary2, original2]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings2 = token_model.encode(comparison2)
print(util.pytorch_cos_sim(comparison_embeddings2[0], comparison_embeddings2[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings2[0], comparison_embeddings2[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings2[1], comparison_embeddings2[2])) # machine summary to original article

tensor([[0.7189]])
tensor([[0.5850]])
tensor([[0.6048]])


In [29]:
comparison2

["For a man often described as capricious, Tyson Fury's chaotic reign as world heavyweight champion was strangely predictable.",
 'Fury has been speaking about his mental health struggles for years. The repeated claims from Furys camp that his victory was downplayed by the British media, and that they had an agenda against him from the outset, are delusional. Fury is not the first boxer to lose motivation having reached',

# Model 3

In [30]:
input3 = tokenizer(article3, return_tensors='pt', truncation=True)
summary_ids3 = model.generate(input3['input_ids'], max_length=60)
machineSummary3 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids3])

In [31]:
machineSummary3 = listToString(machineSummary3)
original3 = listToString(article3)

comparison3 = [summary3, machineSummary3, original3]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings3 = token_model.encode(comparison3)
print(util.pytorch_cos_sim(comparison_embeddings3[0], comparison_embeddings3[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings3[0], comparison_embeddings3[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings3[1], comparison_embeddings3[2])) # machine summary to original article

tensor([[0.5551]])
tensor([[0.7642]])
tensor([[0.8500]])


In [32]:
comparison3

['A barrister who was due to move into his own chambers in Huddersfield has pleaded guilty to supplying cocaine.',
 'Omar Khan, 31, had worked at The Johnson Partnership in Nottingham for five years. Partner Digby Johnson said he did not represent Khan, who had set up his own office and was set to leave the company. Erlin Manahasa, Albert Dibra and Naza',
 'Omar Khan, 31, had worked at The Johnson Partnership in Nottingham for five years before he was arrested. Erlin Manahasa, Albert Dibra and Nazaquat Ali joined Khan in admitting the same charge, between 1 October  and 4 December last year, at Nottingham Crown Court. They are due to be sentenced on 15 April. Updates on this story and more from Nottinghamshire The court heard the case involved the recovery of 1kg (2.2lb) of cocaine. Digby Johnson, a partner at the Johnson firm, confirmed they did not represent Khan - who had set up his own office and was set to leave the company. I still find it hard to believe he could do something as

# Model 4

In [33]:
input4 = tokenizer(article4, return_tensors='pt', truncation=True)
summary_ids4 = model.generate(input4['input_ids'], max_length=60)
machineSummary4 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids4])

In [34]:
machineSummary4 = listToString(machineSummary4)
original4 = listToString(article4)

comparison4 = [summary4, machineSummary4, original4]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings4 = token_model.encode(comparison4)
print(util.pytorch_cos_sim(comparison_embeddings4[0], comparison_embeddings4[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings4[0], comparison_embeddings4[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings4[1], comparison_embeddings4[2])) # machine summary to original article

tensor([[0.5436]])
tensor([[0.6342]])
tensor([[0.8264]])


In [35]:
comparison4

['Star Wars fans are being given the opportunity to become Jedi Knights and learn how to wield lightsabers in combat.',
 'The sport began eight years ago in Italy but has only just come to England with the first classes in Cheltenham. Instructor Jordan Court said people were already hooked. The lightsabers used in the sport are all hand-made and are provided for use during the classes.',
 'LudoSport has opened its first academy teaching seven forms of combat from the Star Wars world using flexible blades mounted on weighted hilts. The sport began eight years ago in Italy but has only just come to England with the first classes in Cheltenham. Instructor Jordan Court said people were already hooked. The classes in Cheltenham began last month. So far there are six pupils, but this number is expected to increase. Mr Court attended an international boot camp to learn the different stages of the sport which range in characteristics from defensive in stage one to aggressive and flamboyant in 

# Model 5

In [36]:
input5 = tokenizer(article5, return_tensors='pt', truncation=True)
summary_ids5 = model.generate(input5['input_ids'], max_length=60)
machineSummary5 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids5])

In [37]:
machineSummary5 = listToString(machineSummary5)
original5 = listToString(article5)

comparison5 = [summary5, machineSummary5, original5]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings5 = token_model.encode(comparison5)
print(util.pytorch_cos_sim(comparison_embeddings5[0], comparison_embeddings5[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings5[0], comparison_embeddings5[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings5[1], comparison_embeddings5[2])) # machine summary to original article

tensor([[0.5847]])
tensor([[0.6152]])
tensor([[0.9742]])


In [38]:
comparison5

['Awareness rides are taking place to try and cut the number of people on horseback injured or killed on roads.',
 'The Pass Wide and Slow Wales campaign has collected 1,300 signatures on the assemblys e-petition website. It wants an annual road safety awareness campaign explaining to motorists how to react around horses. The British Horse Society found that since 2010 there have been 2,000 road accidents in',
 'The Pass Wide and Slow Wales campaign has collected 1,300 signatures on the assemblys e-petition website. It wants an annual road safety awareness campaign explaining to motorists how to react around horses. The British Horse Society found that since 2010 there have been 2,000 road accidents in the UK, with 1,500 because of cars passing too closely. As a result of these, 180 horses and 36 riders have died. Awareness rides were planned for Penarth, Vale of Glamorgan, Swansea, Neyland in Pembrokeshire, Machynlleth, Powys, Flintshire and Porthmadog in Gwynedd. Any petition with ov

# Model 6

In [39]:
input6 = tokenizer(article6, return_tensors='pt', truncation=True)
summary_ids6 = model.generate(input6['input_ids'], max_length=60, early_stopping=False)
machineSummary6 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids6])

In [40]:
machineSummary6 = listToString(machineSummary6)
original6 = listToString(article6)

comparison6 = [summary6, machineSummary6, original6]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings6 = token_model.encode(comparison6)
print(util.pytorch_cos_sim(comparison_embeddings6[0], comparison_embeddings6[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings6[0], comparison_embeddings6[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings6[1], comparison_embeddings6[2])) # machine summary to original article

tensor([[0.7071]])
tensor([[0.7340]])
tensor([[0.9464]])


In [41]:
comparison6

['Two new councillors have been elected in a by-election in the City of Edinburgh.',
 'SNP topped the vote in the Leith Walk by-election. Scottish Labour won the second seat from the Greens. Deidre Brock of the SNP and Maggie Chapman of the Scottish Greens stood down. It was the first time the Single Transferable Vote (STV) system had',
 'It was the first time the Single Transferable Vote (STV) system had been used to select two members in the same ward in a by-election. The SNP topped the vote in the Leith Walk by-election, while Scottish Labour won the second seat from the Greens. The by-election was called after Deidre Brock of the SNP and Maggie Chapman of the Scottish Greens stood down. The SNPs John Lewis Ritchie topped the Leith Walk poll with 2,290 votes. He was elected at stage one in the STV process with a swing in first-preference votes of 7.6% from Labour. Labours Marion Donaldson received 1,623 votes, ahead of Susan Jane Rae of the Scottish Greens on 1,381. Ms Donaldson wa

# Model 7

In [42]:
input7 = tokenizer(article7, return_tensors='pt', truncation=True)
summary_ids7 = model.generate(input7['input_ids'], max_length=60, early_stopping=False)
machineSummary7 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids7])

In [43]:
machineSummary7 = listToString(machineSummary7)
original7 = listToString(article7)

comparison7 = [summary7, machineSummary7, original7]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings7 = token_model.encode(comparison7)
print(util.pytorch_cos_sim(comparison_embeddings7[0], comparison_embeddings7[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings7[0], comparison_embeddings7[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings7[1], comparison_embeddings7[2])) # machine summary to original article

tensor([[0.7054]])
tensor([[0.6673]])
tensor([[0.9054]])


In [44]:
comparison7

["Torquay United boss Kevin Nicholson says none of the money from Eunan O'Kane's move to Leeds from Bournemouth will go to the playing squad.",
 ' OKane moved for an undisclosed fee, but Nicholson says any money will go to help the cash-strapped club. The Gulls are still looking for new owners having been taken over by a consortium of local business people last summer. They were forced to close down the clubs academy',
 'The National League sold the Republic of Ireland midfielder to the Cherries for £175,000 in 2012 and had a 15% sell-on clause included in the deal. OKane moved for an undisclosed fee, but Nicholson says any money will go to help the cash-strapped club. I dont think Ill be getting anything, Nicholson told BBC Devon. Theres more important things. The Gulls are still looking for new owners having been taken over by a consortium of local business people last summer. They were forced to close down the clubs academy and drastically reduce the playing budget after millionaire

# Model 8

In [45]:
input8 = tokenizer(article8, return_tensors='pt', truncation=True)
summary_ids8 = model.generate(input8['input_ids'], max_length=60, early_stopping=False)
machineSummary8 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids8])

In [46]:
machineSummary8 = listToString(machineSummary8)
original8 = listToString(article8)

comparison8 = [summary8, machineSummary8, original8]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings8 = token_model.encode(comparison8)
print(util.pytorch_cos_sim(comparison_embeddings8[0], comparison_embeddings8[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings8[0], comparison_embeddings8[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings8[1], comparison_embeddings8[2])) # machine summary to original article

tensor([[0.5923]])
tensor([[0.6410]])
tensor([[0.9681]])


In [47]:
comparison8

['Manufacturers have reported positive business trends, in the latest survey from the Scottish Chambers of Commerce.',
 'Manufacturers reported their highest growth in new orders for nearly three years. In retail, there was also a return to optimism - though only just. In tourism, firms reported improving visitor numbers in the final quarter of the year, but falling sales revenues. Construction is expecting an investment dip.',

# Model 9

In [48]:
input9 = tokenizer(article9, return_tensors='pt', truncation=True)
summary_ids9 = model.generate(input9['input_ids'], max_length=60, early_stopping=False)
machineSummary9 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids9])

In [49]:
machineSummary9 = listToString(machineSummary9)
original9 = listToString(article9)

comparison9 = [summary9, machineSummary9, original9]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings9 = token_model.encode(comparison9)
print(util.pytorch_cos_sim(comparison_embeddings9[0], comparison_embeddings9[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings9[0], comparison_embeddings9[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings9[1], comparison_embeddings9[2])) # machine summary to original article

tensor([[0.8161]])
tensor([[0.8348]])
tensor([[0.8977]])


In [50]:
comparison9

['Of his last 30 matches in 2016, Andy Murray won 28 and lost just two.',
 'The world number one has won 21 of his first 30 matches in 2017. Murray has had shingles and an elbow problem, and now his left hip is proving cause for concern. Opting out of two scheduled exhibition matches at the Hurlingham Club in London may not be too',
 'Media playback is not supported on this device Of his first 30 matches in 2017, the world number one has won 21 and lost nine. Winning his last five tournaments of 2016 to pip Novak Djokovic to the year-end number one position in the final match of the season at Londons O2 Arena was astonishing, dramatic and unforgettable. And yet it appears that relentless run of success, and the 87 matches he played over a season, has come at a price. Murrays straight-set defeat by world number 90 Jordan Thompson in the first round at Queens Club was the sixth time he has lost to a player outside the top 20 this year. He has had shingles and an elbow problem, and now hi

# Model 10

In [51]:
input10 = tokenizer(article10, return_tensors='pt', truncation=True)
summary_ids10 = model.generate(input10['input_ids'], max_length=20, early_stopping=False)
machineSummary10 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids10])

In [52]:
machineSummary10 = listToString(machineSummary10)
summary10 = listToString(summary10)
original10 = listToString(article10)

comparison10 = [summary10, machineSummary10, original10]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings10 = token_model.encode(comparison10)
print(util.pytorch_cos_sim(comparison_embeddings10[0], comparison_embeddings10[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings10[0], comparison_embeddings10[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings10[1], comparison_embeddings10[2])) # machine summary to original article

tensor([[0.8084]])
tensor([[0.7987]])
tensor([[0.5793]])


In [53]:
comparison10

["Manager Brendan Rodgers is sure Celtic can exploit the wide open spaces of Hampden when they meet Rangers in Sunday's League Cup semi-final.",
 'Celtic face Rangers in the Scottish Cup semi-final at Hampden Park.',
 'Im really looking forward to it - the home of Scottish football, said Rodgers ahead of his maiden visit. I hear the pitch is good, a nice big pitch suits the speed in our team and our intensity. The technical area goes right out to the end of the pitch, but you might need a taxi to get back to your staff. This will be Rodgers second taste of the Old Firm derby and his experience of the fixture got off to a great start with a 5-1 league victory at Celtic Park last month. It was a brilliant performance by the players in every aspect, he recalled. Obviously this one is on a neutral ground, but well be looking to have a similar performance. Well be prepared and focused. We know its going to be a tough game. We anticipated that the last time. Rodgers is also aware Celtics vis

# Conclusion

We can see that the machine model had a higher cosine similarity to the original article 70% of the time compared to the human article. However, this may be influenced by the fact that the length of the machine summary was about 3x the size of the average human summary. The argument early_stopping=True/False did not appear to have any real effect on cosine-similarity at the max length size of 60 (we compared the 10 models with and without and obtained similar results). The pre-trained transformers do provide relevant summaries when reviewing these articles so it appears there is a definite use case for providing news article snippets in products like Bloomberg First Word or other content editors. 20% of the models showed the machine vs human summaries having relatively equivalent cosine similarities. It was also interesting that the machine summary generally was more similar to the article than the summary; however, the summary was much shorter and still generally scored relatively high. This indicates that the human-written summaries are more concise and convey more meaningful information through less text and are therefore better summaries. This does make sense since the summaries are generally written by the authors of the articles. It appears that human summaries are shorter and more semantically similar to articles than machine summaries for articles about sports and athletes. This may be an area that huggingface could focus on pretraining new pipelines, transformers, and models in the future to expand their use cases.