# Text Summary & Scoring Project
##### Michael Creegan, Yungfeng Dai, Hong Gyu Ji, Ziling Zeng
##### Python for Data Analysis
##### Columbia University

# Abstract

Summarization is a common problem in the 21st century as the world has become increasingly driven by data. Summarization of data can be very useful to quickly determine if something is relevant or whether it's worth reading. Another use case could be to store summaries of articles in the backend to run downstream tasks on. It could also be useful to understand the semantic integrity to indicate quality.

To explore this topic, we will leverage the extreme summarization dataset (XSUM) which consists of BBC articles accompanying single-sentence summaries. Each article is prefaced with an introductory sentence (which is a summary) that is professionally written, typically by the author of the article.

To summarize articles, we will use an encoder-decoder transformer (sequence-to-sequence) which combines decoders and encoders because we need to perform both input and output tasks: taking in text and then generating a summary. We selected this type of transformer because the encoder accepts inputs (text) and computes a high-level representation of those inputs which are then passed to the decoder to generate a prediction output (summary). This has advantages over using a standalone encoder like BERT/ALBERT/ELECTRA/RoBERTA/DistilBERT to name a few because encoders are pre-trained by filling randomly masked words in sentences and therefore are better suited for output tasks. Using a standalone decoder like gpt2 would also not be optimal because decoders are trained to guess the next word in a sequence (left or right context aka does not have context on one side of the sequence) and therefore are better suited at generating text but not necessarily taking in text because of the hidden context limitations. 

Our scoring will compare the output of the BART encoder-decoder model to the professionally written summaries in the XSUM dataset to see how semantically similar a machine-generated summary is to a professional one as well as to their source articles. Our scoring methodology will be focused on semantic textual similarity and computed using the cosine similarity between the professional human-written summary and the machine-generated one. Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

# Importing Transformers & Dependencies

In [55]:
import pandas as pd
import numpy as np
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from datasets import load_dataset, load_metric
from sentence_transformers import SentenceTransformer, util
import random
from IPython.display import display, HTML

# Load XSUM Dataset

In [56]:
xsum = load_dataset('xsum')

Using custom data configuration default
Reusing dataset xsum (C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)
100%|██████████| 3/3 [00:00<00:00, 56.21it/s]


### We can see that the dataset is a "DatasetDict" where the keys are strings that correspond to the split and the values are the dataset object. In the XSUM dataset, the the keys are "training", "validation", and "test" with values corresponding to "document", "summary", and "id" (columns)

In [57]:
xsum

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

# View Underlying Data

In [58]:
xsum['test'][0]

{'document': 'Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation.\nWorkers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders.\nThe Welsh Government said more people than ever were getting help to address housing problems.\nChanges to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation.\nPrison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because issues such as children or domestic violence were now considered.\nHowever, the same could not be said for men, the charity said, because issues which often affect them, such as post traumatic stress disorder or drug dependency, were often viewed as less of a priority.\nAndrew Stevens, who works in Welsh prisons trying to secure housing for prison leavers, said the

## We can use a function to view a random selection of articles and summaries to get a more accurate depiction of what the data looks like in a synthesized format

In [59]:
def display_function(xsum, num_examples=3):
    assert num_examples <= len(xsum)                # limit to number of records in the xsum
    
    selections = []                                 # create empty list to put the records into 
    
    for _ in range(num_examples):                   # we can use _ here in place of a variable name because we don't care how many time sthe loop is run
        selection = random.randint(0, len(xsum) - 1)
        while selection in selections:
            selection = random.randint(0, len(xsum) - 1)
        selections.append(selection)

    xsumPd = pd.DataFrame(xsum[selections])
    for column, typ in xsum.features.items():
        display(HTML(xsumPd.to_html()))

# Cleaning
Our end goal is to create accurate summaries using this model; therefore, we need to remove the text characters that do not provide any contextual value. There are characters in the article that are not present in the summary that could cause discrepancies between our machine-generated summary vs. the professional human-generated one. Newline characters and backslashes need to be removed because as they are in the document column but in not the summary column and could present challenges when summarizing and scoring.

In [60]:
display_function(xsum["test"])

Unnamed: 0,document,summary,id
0,"The sixth form at Halewood Academy in Knowsley will shut in August 2017 after the Department for Education agreed it could stop providing A-levels.\nPrincipal Gary Evans said it was ""sad"" but left the academy in a stronger financial position.\nEducation chiefs pledged to get an another A level plan in place by 2017.\nMr Evans said: ""We shall continue to work extensively with other post-16 providers to ensure that all of our students remain in education or training once they leave the academy.\n""Discussions are also taking place for a future potential post-16 joint venture across Knowsley,"" he said.\nKnowsley has the lowest proportion of students taking A-levels in England at 2% and has among the lowest university entry rates in England.\nA letter to the school from parliamentary undersecretary of state for schools, Lord Nash, outlined the plan.\nHe said after considering the quality of provision, the impact on existing students and the availability of post-16 education in the area ""I have agreed their request to close the sixth form"".\nKnowsley councillor Gary See said the local authority was ""naturally disappointed with this outcome"" but pleased there was ""some clarity for the Academy and its students"".\nHe said due to the school's academy status, the council had ""no powers to intervene"" but had committed to working with the government to establish ""new sixth form provision from September 2017"".\nParents at the school had protested against the closure, arguing it ""is letting down the children of this community"" and could block their ambitions.\nStudents who are part-way through their studies will be able to continue at the sixth form.",A Merseyside borough will have no A-level provision after the government approved the closure of the area's only sixth form offering the qualification.,36674622
1,"Ednane Mahmood, from Blackburn, is also alleged to have provided internet links to others with speeches and propaganda that encouraged acts of terrorism.\nHe was stopped by police at Manchester Airport after returning from Turkey on 21 September and charged on Wednesday.\nMr Mahmood was released on conditional bail when he appeared at Westminster Magistrates' Court.\nThe conditions include sleeping at his family home, reporting to a police station in Blackburn twice a week, not applying for travel documents and a ban from using the internet.\nDuring the brief hearing before magistrates, the 18-year-old spoke only to confirm his name, date of birth and address.\nMr Mahmood will next appear at the Old Bailey on 24 April.\nHis arrest followed an investigation by the North West Counter Terrorism Unit and Lancashire Constabulary.",An 18-year-old man has appeared in court charged with attempting to travel to Syria to commit acts of terrorism.,32232373
2,"Professional hunter Theo Bronkhorst and farm owner Honest Ndlovu, were charged with poaching offences and for not having the required hunting permit.\nThe pair were granted bail of $1,000 each (Â£638) and ordered to appear in court again on 5 August.\nWalter Palmer, the US dentist who shot the animal known as Cecil, has left Zimbabwe but could also face charges.\nMr Palmer said he paid for the hunt, but was not aware of the lion's identity.\nHe said he regretted shooting the animal and believed he was on a legal hunt. He relied on professional guides to find a lion and obtain the necessary permits, he said.\nMr Bronkhorst and Mr Ndlovu could face up to 15 years in prison if found guilty.\nCecil is believed to have died on 1 July, but the carcass was not discovered until a few days later.\nMr Palmer is said to have shot and injured the animal with a bow. The group did not find the wounded lion until 40 hours later, when he was shot dead with a gun.\nSeparately, court records have shown that Mr Palmer has a felony record in the US after killing a black bear in the state of Wisconsin in 2006.\nThe dentist was given a one-year probation and fined $3,000, having shot the creature outside an authorised zone and then tried to pass it off as having been killed elsewhere.\nRecords from the Minnesota Board of Dentistry also show that Mr Palmer was the subject of a sexual harassment complaint which was settled in 2006.\nA receptionist alleged that he had made indecent comments to her. Mr Palmer admitted no wrongdoing and agreed to pay out more than $127,000.\nThe American tourist is believed to have paid about $50,000 to go on the hunt in Zimbabwe.\nMore than 265,000 people have signed an online ""Justice for Cecil"" petition, calling on Zimbabwe's government to stop issuing hunting permits for endangered animals.\nAs news of the killing and details about the perpetrator spread online, there was a slew of comments on social media condemning Walter Palmer, with some people calling for him to face justice.\nHow the internet descended on the man who killed Cecil the lion\nMr Palmer insists that he believed his guides had secured ""all proper permits"" for the hunt.\n""I relied on the expertise of my local professional guides to ensure a legal hunt,"" he said in a statement on Tuesday.\n""I deeply regret that my pursuit of an activity I love and practice responsibly and legally resulted in the taking of this lion.""\nHe said he had not been contacted by authorities in Zimbabwe or the US but would ""assist them in any inquiries they may have"".\nThe dentist is believed to be back in the US, although his exact whereabouts are unknown.\nHis dental practice was closed on Tuesday and a note was placed on the door referring visitors to a public relations firm.\nCecil the lion was skinned and beheaded, according to the Zimbabwe Conservation Task Force (ZCTF), a local charity.\nThe ZCTF said the hunters had used bait to lure him outside Hwange National Park during a night-time pursuit.\n35,000\nMax estimated lion population\n12,000\nMax lion population in southern Africa\n665 Approx number of 'trophy' lions killed for export from Africa per year\n49 Lion 'trophies' exported from Zimbabwe in 2013\n0.29% Contribution to GDP of Zimbabwe from trophy hunting\n17% Of Zimbabwe's land given to trophy hunting\nThe animal had a GPS collar fitted for a research project by UK-based Oxford University that allowed authorities to track its movements. The hunters tried to destroy it, but failed, according to the ZCTF.\nOn Monday, the head of the ZCTF told the BBC that Cecil ""never bothered anybody"" and was ""one of the most beautiful animals to look at"".",Two men accused of helping a US tourist hunt and kill Zimbabwe's most famous lion have been released on bail.,33699346


Unnamed: 0,document,summary,id
0,"The sixth form at Halewood Academy in Knowsley will shut in August 2017 after the Department for Education agreed it could stop providing A-levels.\nPrincipal Gary Evans said it was ""sad"" but left the academy in a stronger financial position.\nEducation chiefs pledged to get an another A level plan in place by 2017.\nMr Evans said: ""We shall continue to work extensively with other post-16 providers to ensure that all of our students remain in education or training once they leave the academy.\n""Discussions are also taking place for a future potential post-16 joint venture across Knowsley,"" he said.\nKnowsley has the lowest proportion of students taking A-levels in England at 2% and has among the lowest university entry rates in England.\nA letter to the school from parliamentary undersecretary of state for schools, Lord Nash, outlined the plan.\nHe said after considering the quality of provision, the impact on existing students and the availability of post-16 education in the area ""I have agreed their request to close the sixth form"".\nKnowsley councillor Gary See said the local authority was ""naturally disappointed with this outcome"" but pleased there was ""some clarity for the Academy and its students"".\nHe said due to the school's academy status, the council had ""no powers to intervene"" but had committed to working with the government to establish ""new sixth form provision from September 2017"".\nParents at the school had protested against the closure, arguing it ""is letting down the children of this community"" and could block their ambitions.\nStudents who are part-way through their studies will be able to continue at the sixth form.",A Merseyside borough will have no A-level provision after the government approved the closure of the area's only sixth form offering the qualification.,36674622
1,"Ednane Mahmood, from Blackburn, is also alleged to have provided internet links to others with speeches and propaganda that encouraged acts of terrorism.\nHe was stopped by police at Manchester Airport after returning from Turkey on 21 September and charged on Wednesday.\nMr Mahmood was released on conditional bail when he appeared at Westminster Magistrates' Court.\nThe conditions include sleeping at his family home, reporting to a police station in Blackburn twice a week, not applying for travel documents and a ban from using the internet.\nDuring the brief hearing before magistrates, the 18-year-old spoke only to confirm his name, date of birth and address.\nMr Mahmood will next appear at the Old Bailey on 24 April.\nHis arrest followed an investigation by the North West Counter Terrorism Unit and Lancashire Constabulary.",An 18-year-old man has appeared in court charged with attempting to travel to Syria to commit acts of terrorism.,32232373
2,"Professional hunter Theo Bronkhorst and farm owner Honest Ndlovu, were charged with poaching offences and for not having the required hunting permit.\nThe pair were granted bail of $1,000 each (Â£638) and ordered to appear in court again on 5 August.\nWalter Palmer, the US dentist who shot the animal known as Cecil, has left Zimbabwe but could also face charges.\nMr Palmer said he paid for the hunt, but was not aware of the lion's identity.\nHe said he regretted shooting the animal and believed he was on a legal hunt. He relied on professional guides to find a lion and obtain the necessary permits, he said.\nMr Bronkhorst and Mr Ndlovu could face up to 15 years in prison if found guilty.\nCecil is believed to have died on 1 July, but the carcass was not discovered until a few days later.\nMr Palmer is said to have shot and injured the animal with a bow. The group did not find the wounded lion until 40 hours later, when he was shot dead with a gun.\nSeparately, court records have shown that Mr Palmer has a felony record in the US after killing a black bear in the state of Wisconsin in 2006.\nThe dentist was given a one-year probation and fined $3,000, having shot the creature outside an authorised zone and then tried to pass it off as having been killed elsewhere.\nRecords from the Minnesota Board of Dentistry also show that Mr Palmer was the subject of a sexual harassment complaint which was settled in 2006.\nA receptionist alleged that he had made indecent comments to her. Mr Palmer admitted no wrongdoing and agreed to pay out more than $127,000.\nThe American tourist is believed to have paid about $50,000 to go on the hunt in Zimbabwe.\nMore than 265,000 people have signed an online ""Justice for Cecil"" petition, calling on Zimbabwe's government to stop issuing hunting permits for endangered animals.\nAs news of the killing and details about the perpetrator spread online, there was a slew of comments on social media condemning Walter Palmer, with some people calling for him to face justice.\nHow the internet descended on the man who killed Cecil the lion\nMr Palmer insists that he believed his guides had secured ""all proper permits"" for the hunt.\n""I relied on the expertise of my local professional guides to ensure a legal hunt,"" he said in a statement on Tuesday.\n""I deeply regret that my pursuit of an activity I love and practice responsibly and legally resulted in the taking of this lion.""\nHe said he had not been contacted by authorities in Zimbabwe or the US but would ""assist them in any inquiries they may have"".\nThe dentist is believed to be back in the US, although his exact whereabouts are unknown.\nHis dental practice was closed on Tuesday and a note was placed on the door referring visitors to a public relations firm.\nCecil the lion was skinned and beheaded, according to the Zimbabwe Conservation Task Force (ZCTF), a local charity.\nThe ZCTF said the hunters had used bait to lure him outside Hwange National Park during a night-time pursuit.\n35,000\nMax estimated lion population\n12,000\nMax lion population in southern Africa\n665 Approx number of 'trophy' lions killed for export from Africa per year\n49 Lion 'trophies' exported from Zimbabwe in 2013\n0.29% Contribution to GDP of Zimbabwe from trophy hunting\n17% Of Zimbabwe's land given to trophy hunting\nThe animal had a GPS collar fitted for a research project by UK-based Oxford University that allowed authorities to track its movements. The hunters tried to destroy it, but failed, according to the ZCTF.\nOn Monday, the head of the ZCTF told the BBC that Cecil ""never bothered anybody"" and was ""one of the most beautiful animals to look at"".",Two men accused of helping a US tourist hunt and kill Zimbabwe's most famous lion have been released on bail.,33699346


Unnamed: 0,document,summary,id
0,"The sixth form at Halewood Academy in Knowsley will shut in August 2017 after the Department for Education agreed it could stop providing A-levels.\nPrincipal Gary Evans said it was ""sad"" but left the academy in a stronger financial position.\nEducation chiefs pledged to get an another A level plan in place by 2017.\nMr Evans said: ""We shall continue to work extensively with other post-16 providers to ensure that all of our students remain in education or training once they leave the academy.\n""Discussions are also taking place for a future potential post-16 joint venture across Knowsley,"" he said.\nKnowsley has the lowest proportion of students taking A-levels in England at 2% and has among the lowest university entry rates in England.\nA letter to the school from parliamentary undersecretary of state for schools, Lord Nash, outlined the plan.\nHe said after considering the quality of provision, the impact on existing students and the availability of post-16 education in the area ""I have agreed their request to close the sixth form"".\nKnowsley councillor Gary See said the local authority was ""naturally disappointed with this outcome"" but pleased there was ""some clarity for the Academy and its students"".\nHe said due to the school's academy status, the council had ""no powers to intervene"" but had committed to working with the government to establish ""new sixth form provision from September 2017"".\nParents at the school had protested against the closure, arguing it ""is letting down the children of this community"" and could block their ambitions.\nStudents who are part-way through their studies will be able to continue at the sixth form.",A Merseyside borough will have no A-level provision after the government approved the closure of the area's only sixth form offering the qualification.,36674622
1,"Ednane Mahmood, from Blackburn, is also alleged to have provided internet links to others with speeches and propaganda that encouraged acts of terrorism.\nHe was stopped by police at Manchester Airport after returning from Turkey on 21 September and charged on Wednesday.\nMr Mahmood was released on conditional bail when he appeared at Westminster Magistrates' Court.\nThe conditions include sleeping at his family home, reporting to a police station in Blackburn twice a week, not applying for travel documents and a ban from using the internet.\nDuring the brief hearing before magistrates, the 18-year-old spoke only to confirm his name, date of birth and address.\nMr Mahmood will next appear at the Old Bailey on 24 April.\nHis arrest followed an investigation by the North West Counter Terrorism Unit and Lancashire Constabulary.",An 18-year-old man has appeared in court charged with attempting to travel to Syria to commit acts of terrorism.,32232373
2,"Professional hunter Theo Bronkhorst and farm owner Honest Ndlovu, were charged with poaching offences and for not having the required hunting permit.\nThe pair were granted bail of $1,000 each (Â£638) and ordered to appear in court again on 5 August.\nWalter Palmer, the US dentist who shot the animal known as Cecil, has left Zimbabwe but could also face charges.\nMr Palmer said he paid for the hunt, but was not aware of the lion's identity.\nHe said he regretted shooting the animal and believed he was on a legal hunt. He relied on professional guides to find a lion and obtain the necessary permits, he said.\nMr Bronkhorst and Mr Ndlovu could face up to 15 years in prison if found guilty.\nCecil is believed to have died on 1 July, but the carcass was not discovered until a few days later.\nMr Palmer is said to have shot and injured the animal with a bow. The group did not find the wounded lion until 40 hours later, when he was shot dead with a gun.\nSeparately, court records have shown that Mr Palmer has a felony record in the US after killing a black bear in the state of Wisconsin in 2006.\nThe dentist was given a one-year probation and fined $3,000, having shot the creature outside an authorised zone and then tried to pass it off as having been killed elsewhere.\nRecords from the Minnesota Board of Dentistry also show that Mr Palmer was the subject of a sexual harassment complaint which was settled in 2006.\nA receptionist alleged that he had made indecent comments to her. Mr Palmer admitted no wrongdoing and agreed to pay out more than $127,000.\nThe American tourist is believed to have paid about $50,000 to go on the hunt in Zimbabwe.\nMore than 265,000 people have signed an online ""Justice for Cecil"" petition, calling on Zimbabwe's government to stop issuing hunting permits for endangered animals.\nAs news of the killing and details about the perpetrator spread online, there was a slew of comments on social media condemning Walter Palmer, with some people calling for him to face justice.\nHow the internet descended on the man who killed Cecil the lion\nMr Palmer insists that he believed his guides had secured ""all proper permits"" for the hunt.\n""I relied on the expertise of my local professional guides to ensure a legal hunt,"" he said in a statement on Tuesday.\n""I deeply regret that my pursuit of an activity I love and practice responsibly and legally resulted in the taking of this lion.""\nHe said he had not been contacted by authorities in Zimbabwe or the US but would ""assist them in any inquiries they may have"".\nThe dentist is believed to be back in the US, although his exact whereabouts are unknown.\nHis dental practice was closed on Tuesday and a note was placed on the door referring visitors to a public relations firm.\nCecil the lion was skinned and beheaded, according to the Zimbabwe Conservation Task Force (ZCTF), a local charity.\nThe ZCTF said the hunters had used bait to lure him outside Hwange National Park during a night-time pursuit.\n35,000\nMax estimated lion population\n12,000\nMax lion population in southern Africa\n665 Approx number of 'trophy' lions killed for export from Africa per year\n49 Lion 'trophies' exported from Zimbabwe in 2013\n0.29% Contribution to GDP of Zimbabwe from trophy hunting\n17% Of Zimbabwe's land given to trophy hunting\nThe animal had a GPS collar fitted for a research project by UK-based Oxford University that allowed authorities to track its movements. The hunters tried to destroy it, but failed, according to the ZCTF.\nOn Monday, the head of the ZCTF told the BBC that Cecil ""never bothered anybody"" and was ""one of the most beautiful animals to look at"".",Two men accused of helping a US tourist hunt and kill Zimbabwe's most famous lion have been released on bail.,33699346


## We can address the problem we mentioned above by define a cleaning function that replaces new lines and backslashes with white space.

In [61]:
def clean(row):
    row['document'] = row['document'].replace('\n', ' ')\
                                     .replace('\'', '').replace('\"','')
    return row

## We can now apply the cleaning function we created and map it onto our data (it loads for train, test, and validation)

In [62]:
xsum = xsum.map(clean)

Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-fd36b556705cbe4d.arrow
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-edb3a2dc2f06b92c.arrow
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-a4042da98a2992a2.arrow


### Voila!

In [63]:
display_function(xsum["test"])

Unnamed: 0,document,summary,id
0,"In an open letter, he said he loved Russia, calling it a great democracy. Mr Depardieu had recently announced he would give up his French passport after the government criticised his decision to move abroad to avoid higher taxes. Moscow earlier said President Vladimir Putin had personally signed the decree granting the actor Russian citizenship. In December, Mr Putin had said he would be happy to welcome the actor in his new country. If hed like to have a Russian passport, consider it settled, Mr Putin said. Prime Minister Jean-Marc Ayrault had called Mr Depardieus decision to leave the country shabby. In the letter, broadcast on Thursday on Russian TV station Pervyi Kanal, Mr Depardieu said: I filed a passport application and I am pleased that it was accepted. I love your country, Russia - its people, its history, its writers. I love your culture, your intelligence. He said that he had spoken to French President Francois Hollande and told him Russia was a great democracy, and not a country where the prime minister calls one of its citizens shabby. Under Frances civil code, dual citizenship is permitted but it is unlawful to be stateless. A person must obtain another nationality before giving up French citizenship. Mr Depardieus highly publicised tax row began last year after Mr Hollande said he would raise taxes to 75% for those earning more than 1m euros (Â£817,400). The actor accused the new socialist government of punishing success, creation and talent, and announced in early December that he would move to Belgium. Although the Constitutional Council struck down the tax rise proposal on Sunday, Mr Depardieu said this did not change the situation one bit. The BBCs Hugh Schofield in Paris says the series of events would be amusingly eccentric, were it not also serious in its implications for Frances international image. Mr Depardieu, described by Mr Putin as a successful businessman and friend, has developed close ties with Russia, which has a flat 13% personal income tax rate. He currently appears in an advertisement for Sovietsky Banks credit card and is prominently featured on the banks home page. In 2011, he played the lead role in the film Rasputin, a Franco-Russian production about the life of eccentric monk Grigory Rasputin. In addition, Mr Depardieu has also helped raise funds for a childrens hospital in St Petersburg.","Actor Gerard Depardieu has hailed Russia's decision to grant him citizenship, following a tax row with the government in his native France.",20896894
1,"She became Kenyas first high-profile athlete to fail a test, when she tested positive for performance-enhancing drugs in September. Jeptoo, 33, says she may have been prescribed some banned substances at a local hospital after a road accident. She has become the 45th Kenyan athlete to have failed a doping test. David Rudisha, the Olympic 800 metres champion, said he fears for Kenyas hard-won reputation after repeat allegations of doping. Athletics Kenya followed due process in her matter and it was appropriate that she serves a two-year ban, said the governing bodys chief executive Isaac Kamande. The ban comes only a few days after Athletics Kenya announced that eight more Kenyan athletes have been suspended for between one to four years for taking performance-enhancing drugs. Over the last two years Kenya has been in the spotlight after a German television programme claimed that many Kenyan athletes are doping. Jeptoo, one of most successful runners in Kenyan history, was due to be crowned world Marathon Major Champion for the year 2014 but the ceremony was called off soon after news of her failed test. She has won the previous three Boston and two Chicago marathons and also previously won the Stockholm, Paris, Milan and Lisbon marathons.","Kenya's Rita Jeptoo, winner of the Boston and Chicago marathons, has been banned for two years after failing a drugs test.",31062211
2,"That steady stream of stories has led to the launch of a major public inquiry into their activities. The breadth and nature of what is being alleged is almost too big to grasp, but it fundamentally comes down to a simple question of whether elements of the police were out of control. So, here are seven key themes and allegations that lie in the road ahead - and some of the real practical and legal problems the inquiry faces. Some police officers had relationships with women whom they met within the protest movements they had been deployed to infiltrate. Last year, the Metropolitan Police paid one woman who had a child with an officer Â£425,000 in compensation. There are approximately a dozen civil claims for damages before the courts amid allegations that officers were expected to have relationships as part of their cover identity. But how many did so and under what circumstances? This is a huge challenge for the inquiry. How will it find out and inform the public if the undercover officer involved remains unknown, there are no records and, crucially, the partner never had any suspicions? During the 40-year history of the Special Demonstration Squad (SDS) - the police unit at the heart of many of the allegations - officers used 106 covert identities. According to a published police review, some 42 of them were almost certainly taken from children who had died - and the parents did not know about it. In 2013, a senior officer said the practice wasnt sanctioned by Scotland Yard - yet it seemed to have gone on for years. How many names were used? Who authorised it? Should the parents have known? If the names of the dead children are revealed, will that identify the officers the police want to protect? The undercover affair has so far led to more than 50 convictions being quashed after a failure to disclose that officers had infiltrated protest groups later accused of criminality. The two largest cases relate to environmental protests at power stations, both of which involved Mark Kennedy, an officer with the National Public Order Intelligence Unit. He would drive protesters around, effectively facilitating demonstrations later found to have broken the law. A review for the Home Office said there could be a possible further 83 miscarriages of justice - although its author, Mark Ellison QC, couldnt be sure there were not more. So will the inquiry look at allegations that officers lied in court? John Jordan was convicted over his role in a protest in 1996 - but was cleared on appeal in 2013 after it emerged that his co-defendant was Jim Boyling, an undercover officer. The officer even gave evidence in character. Jordan has been taking legal action for a full explanation of what happened. Peter Francis, the only former SDS officer speaking publicly, says that Scotland Yard kept intelligence files on MPs during the 1990s. During his time in Special Branch, he says he saw files on 10 Labour MPs which he and others would regularly update. So what did that monitoring amount to? Was the information on MPs incidental, gathered as part of watching campaign groups? Or did some Scotland Yard chiefs want deeper intelligence on the MPs? Separate allegations have emerged that undercover officers also gathered information on some trade union activists. The most toxic allegation so far has been that Scotland Yard had a spy in the Lawrence family camp. He later had a meeting with a senior officer helping to prepare Scotland Yard for the public inquiry into the London teenagers murder. The exact nature of what information was gathered, why it was gathered and how it was used remains unclear. The then Metropolitan Police Commissioner and now peer, Lord Condon, has said that had he known of the existence of such undercover action in relation to the Lawrences, he would have stopped it. Peter Francis spent four years deep undercover and he eventually became mentally ill, suffering post-traumatic stress disorder. Today, he says some of what he was asked to do was wrong - and he wants senior officers to account for the way they deployed officers like him. He is not the only officer to have had concerns about the ethics of their work. Phase two of the inquiry is expected to look at the operational governance and oversight of undercover operations, including how officers are selected, trained, managed and cared for. The most important acronym in this inquiry stands for Neither Confirm Nor Deny. Its a legal position adopted by the police and other security agencies in cases involving protection of undercover officers or sensitive sources. The first potential legal battle will come if police will refuse to admit whether or not they had officers deployed in specific circumstances. Official reports have already revealed the existence of some of these undercover officers - such as the one who was in a campaign group close to the Lawrence family - but they remain anonymous. If officers remain in the shadows because, quite simply, they were incredibly good at their job, police chiefs will almost certainly argue that the public interest lies in protecting their anonymity because of their legal duty of care.",The allegations of wrongdoing by undercover police officers that have emerged since 2011 have been extraordinary.,33682769


Unnamed: 0,document,summary,id
0,"In an open letter, he said he loved Russia, calling it a great democracy. Mr Depardieu had recently announced he would give up his French passport after the government criticised his decision to move abroad to avoid higher taxes. Moscow earlier said President Vladimir Putin had personally signed the decree granting the actor Russian citizenship. In December, Mr Putin had said he would be happy to welcome the actor in his new country. If hed like to have a Russian passport, consider it settled, Mr Putin said. Prime Minister Jean-Marc Ayrault had called Mr Depardieus decision to leave the country shabby. In the letter, broadcast on Thursday on Russian TV station Pervyi Kanal, Mr Depardieu said: I filed a passport application and I am pleased that it was accepted. I love your country, Russia - its people, its history, its writers. I love your culture, your intelligence. He said that he had spoken to French President Francois Hollande and told him Russia was a great democracy, and not a country where the prime minister calls one of its citizens shabby. Under Frances civil code, dual citizenship is permitted but it is unlawful to be stateless. A person must obtain another nationality before giving up French citizenship. Mr Depardieus highly publicised tax row began last year after Mr Hollande said he would raise taxes to 75% for those earning more than 1m euros (Â£817,400). The actor accused the new socialist government of punishing success, creation and talent, and announced in early December that he would move to Belgium. Although the Constitutional Council struck down the tax rise proposal on Sunday, Mr Depardieu said this did not change the situation one bit. The BBCs Hugh Schofield in Paris says the series of events would be amusingly eccentric, were it not also serious in its implications for Frances international image. Mr Depardieu, described by Mr Putin as a successful businessman and friend, has developed close ties with Russia, which has a flat 13% personal income tax rate. He currently appears in an advertisement for Sovietsky Banks credit card and is prominently featured on the banks home page. In 2011, he played the lead role in the film Rasputin, a Franco-Russian production about the life of eccentric monk Grigory Rasputin. In addition, Mr Depardieu has also helped raise funds for a childrens hospital in St Petersburg.","Actor Gerard Depardieu has hailed Russia's decision to grant him citizenship, following a tax row with the government in his native France.",20896894
1,"She became Kenyas first high-profile athlete to fail a test, when she tested positive for performance-enhancing drugs in September. Jeptoo, 33, says she may have been prescribed some banned substances at a local hospital after a road accident. She has become the 45th Kenyan athlete to have failed a doping test. David Rudisha, the Olympic 800 metres champion, said he fears for Kenyas hard-won reputation after repeat allegations of doping. Athletics Kenya followed due process in her matter and it was appropriate that she serves a two-year ban, said the governing bodys chief executive Isaac Kamande. The ban comes only a few days after Athletics Kenya announced that eight more Kenyan athletes have been suspended for between one to four years for taking performance-enhancing drugs. Over the last two years Kenya has been in the spotlight after a German television programme claimed that many Kenyan athletes are doping. Jeptoo, one of most successful runners in Kenyan history, was due to be crowned world Marathon Major Champion for the year 2014 but the ceremony was called off soon after news of her failed test. She has won the previous three Boston and two Chicago marathons and also previously won the Stockholm, Paris, Milan and Lisbon marathons.","Kenya's Rita Jeptoo, winner of the Boston and Chicago marathons, has been banned for two years after failing a drugs test.",31062211
2,"That steady stream of stories has led to the launch of a major public inquiry into their activities. The breadth and nature of what is being alleged is almost too big to grasp, but it fundamentally comes down to a simple question of whether elements of the police were out of control. So, here are seven key themes and allegations that lie in the road ahead - and some of the real practical and legal problems the inquiry faces. Some police officers had relationships with women whom they met within the protest movements they had been deployed to infiltrate. Last year, the Metropolitan Police paid one woman who had a child with an officer Â£425,000 in compensation. There are approximately a dozen civil claims for damages before the courts amid allegations that officers were expected to have relationships as part of their cover identity. But how many did so and under what circumstances? This is a huge challenge for the inquiry. How will it find out and inform the public if the undercover officer involved remains unknown, there are no records and, crucially, the partner never had any suspicions? During the 40-year history of the Special Demonstration Squad (SDS) - the police unit at the heart of many of the allegations - officers used 106 covert identities. According to a published police review, some 42 of them were almost certainly taken from children who had died - and the parents did not know about it. In 2013, a senior officer said the practice wasnt sanctioned by Scotland Yard - yet it seemed to have gone on for years. How many names were used? Who authorised it? Should the parents have known? If the names of the dead children are revealed, will that identify the officers the police want to protect? The undercover affair has so far led to more than 50 convictions being quashed after a failure to disclose that officers had infiltrated protest groups later accused of criminality. The two largest cases relate to environmental protests at power stations, both of which involved Mark Kennedy, an officer with the National Public Order Intelligence Unit. He would drive protesters around, effectively facilitating demonstrations later found to have broken the law. A review for the Home Office said there could be a possible further 83 miscarriages of justice - although its author, Mark Ellison QC, couldnt be sure there were not more. So will the inquiry look at allegations that officers lied in court? John Jordan was convicted over his role in a protest in 1996 - but was cleared on appeal in 2013 after it emerged that his co-defendant was Jim Boyling, an undercover officer. The officer even gave evidence in character. Jordan has been taking legal action for a full explanation of what happened. Peter Francis, the only former SDS officer speaking publicly, says that Scotland Yard kept intelligence files on MPs during the 1990s. During his time in Special Branch, he says he saw files on 10 Labour MPs which he and others would regularly update. So what did that monitoring amount to? Was the information on MPs incidental, gathered as part of watching campaign groups? Or did some Scotland Yard chiefs want deeper intelligence on the MPs? Separate allegations have emerged that undercover officers also gathered information on some trade union activists. The most toxic allegation so far has been that Scotland Yard had a spy in the Lawrence family camp. He later had a meeting with a senior officer helping to prepare Scotland Yard for the public inquiry into the London teenagers murder. The exact nature of what information was gathered, why it was gathered and how it was used remains unclear. The then Metropolitan Police Commissioner and now peer, Lord Condon, has said that had he known of the existence of such undercover action in relation to the Lawrences, he would have stopped it. Peter Francis spent four years deep undercover and he eventually became mentally ill, suffering post-traumatic stress disorder. Today, he says some of what he was asked to do was wrong - and he wants senior officers to account for the way they deployed officers like him. He is not the only officer to have had concerns about the ethics of their work. Phase two of the inquiry is expected to look at the operational governance and oversight of undercover operations, including how officers are selected, trained, managed and cared for. The most important acronym in this inquiry stands for Neither Confirm Nor Deny. Its a legal position adopted by the police and other security agencies in cases involving protection of undercover officers or sensitive sources. The first potential legal battle will come if police will refuse to admit whether or not they had officers deployed in specific circumstances. Official reports have already revealed the existence of some of these undercover officers - such as the one who was in a campaign group close to the Lawrence family - but they remain anonymous. If officers remain in the shadows because, quite simply, they were incredibly good at their job, police chiefs will almost certainly argue that the public interest lies in protecting their anonymity because of their legal duty of care.",The allegations of wrongdoing by undercover police officers that have emerged since 2011 have been extraordinary.,33682769


Unnamed: 0,document,summary,id
0,"In an open letter, he said he loved Russia, calling it a great democracy. Mr Depardieu had recently announced he would give up his French passport after the government criticised his decision to move abroad to avoid higher taxes. Moscow earlier said President Vladimir Putin had personally signed the decree granting the actor Russian citizenship. In December, Mr Putin had said he would be happy to welcome the actor in his new country. If hed like to have a Russian passport, consider it settled, Mr Putin said. Prime Minister Jean-Marc Ayrault had called Mr Depardieus decision to leave the country shabby. In the letter, broadcast on Thursday on Russian TV station Pervyi Kanal, Mr Depardieu said: I filed a passport application and I am pleased that it was accepted. I love your country, Russia - its people, its history, its writers. I love your culture, your intelligence. He said that he had spoken to French President Francois Hollande and told him Russia was a great democracy, and not a country where the prime minister calls one of its citizens shabby. Under Frances civil code, dual citizenship is permitted but it is unlawful to be stateless. A person must obtain another nationality before giving up French citizenship. Mr Depardieus highly publicised tax row began last year after Mr Hollande said he would raise taxes to 75% for those earning more than 1m euros (Â£817,400). The actor accused the new socialist government of punishing success, creation and talent, and announced in early December that he would move to Belgium. Although the Constitutional Council struck down the tax rise proposal on Sunday, Mr Depardieu said this did not change the situation one bit. The BBCs Hugh Schofield in Paris says the series of events would be amusingly eccentric, were it not also serious in its implications for Frances international image. Mr Depardieu, described by Mr Putin as a successful businessman and friend, has developed close ties with Russia, which has a flat 13% personal income tax rate. He currently appears in an advertisement for Sovietsky Banks credit card and is prominently featured on the banks home page. In 2011, he played the lead role in the film Rasputin, a Franco-Russian production about the life of eccentric monk Grigory Rasputin. In addition, Mr Depardieu has also helped raise funds for a childrens hospital in St Petersburg.","Actor Gerard Depardieu has hailed Russia's decision to grant him citizenship, following a tax row with the government in his native France.",20896894
1,"She became Kenyas first high-profile athlete to fail a test, when she tested positive for performance-enhancing drugs in September. Jeptoo, 33, says she may have been prescribed some banned substances at a local hospital after a road accident. She has become the 45th Kenyan athlete to have failed a doping test. David Rudisha, the Olympic 800 metres champion, said he fears for Kenyas hard-won reputation after repeat allegations of doping. Athletics Kenya followed due process in her matter and it was appropriate that she serves a two-year ban, said the governing bodys chief executive Isaac Kamande. The ban comes only a few days after Athletics Kenya announced that eight more Kenyan athletes have been suspended for between one to four years for taking performance-enhancing drugs. Over the last two years Kenya has been in the spotlight after a German television programme claimed that many Kenyan athletes are doping. Jeptoo, one of most successful runners in Kenyan history, was due to be crowned world Marathon Major Champion for the year 2014 but the ceremony was called off soon after news of her failed test. She has won the previous three Boston and two Chicago marathons and also previously won the Stockholm, Paris, Milan and Lisbon marathons.","Kenya's Rita Jeptoo, winner of the Boston and Chicago marathons, has been banned for two years after failing a drugs test.",31062211
2,"That steady stream of stories has led to the launch of a major public inquiry into their activities. The breadth and nature of what is being alleged is almost too big to grasp, but it fundamentally comes down to a simple question of whether elements of the police were out of control. So, here are seven key themes and allegations that lie in the road ahead - and some of the real practical and legal problems the inquiry faces. Some police officers had relationships with women whom they met within the protest movements they had been deployed to infiltrate. Last year, the Metropolitan Police paid one woman who had a child with an officer Â£425,000 in compensation. There are approximately a dozen civil claims for damages before the courts amid allegations that officers were expected to have relationships as part of their cover identity. But how many did so and under what circumstances? This is a huge challenge for the inquiry. How will it find out and inform the public if the undercover officer involved remains unknown, there are no records and, crucially, the partner never had any suspicions? During the 40-year history of the Special Demonstration Squad (SDS) - the police unit at the heart of many of the allegations - officers used 106 covert identities. According to a published police review, some 42 of them were almost certainly taken from children who had died - and the parents did not know about it. In 2013, a senior officer said the practice wasnt sanctioned by Scotland Yard - yet it seemed to have gone on for years. How many names were used? Who authorised it? Should the parents have known? If the names of the dead children are revealed, will that identify the officers the police want to protect? The undercover affair has so far led to more than 50 convictions being quashed after a failure to disclose that officers had infiltrated protest groups later accused of criminality. The two largest cases relate to environmental protests at power stations, both of which involved Mark Kennedy, an officer with the National Public Order Intelligence Unit. He would drive protesters around, effectively facilitating demonstrations later found to have broken the law. A review for the Home Office said there could be a possible further 83 miscarriages of justice - although its author, Mark Ellison QC, couldnt be sure there were not more. So will the inquiry look at allegations that officers lied in court? John Jordan was convicted over his role in a protest in 1996 - but was cleared on appeal in 2013 after it emerged that his co-defendant was Jim Boyling, an undercover officer. The officer even gave evidence in character. Jordan has been taking legal action for a full explanation of what happened. Peter Francis, the only former SDS officer speaking publicly, says that Scotland Yard kept intelligence files on MPs during the 1990s. During his time in Special Branch, he says he saw files on 10 Labour MPs which he and others would regularly update. So what did that monitoring amount to? Was the information on MPs incidental, gathered as part of watching campaign groups? Or did some Scotland Yard chiefs want deeper intelligence on the MPs? Separate allegations have emerged that undercover officers also gathered information on some trade union activists. The most toxic allegation so far has been that Scotland Yard had a spy in the Lawrence family camp. He later had a meeting with a senior officer helping to prepare Scotland Yard for the public inquiry into the London teenagers murder. The exact nature of what information was gathered, why it was gathered and how it was used remains unclear. The then Metropolitan Police Commissioner and now peer, Lord Condon, has said that had he known of the existence of such undercover action in relation to the Lawrences, he would have stopped it. Peter Francis spent four years deep undercover and he eventually became mentally ill, suffering post-traumatic stress disorder. Today, he says some of what he was asked to do was wrong - and he wants senior officers to account for the way they deployed officers like him. He is not the only officer to have had concerns about the ethics of their work. Phase two of the inquiry is expected to look at the operational governance and oversight of undercover operations, including how officers are selected, trained, managed and cared for. The most important acronym in this inquiry stands for Neither Confirm Nor Deny. Its a legal position adopted by the police and other security agencies in cases involving protection of undercover officers or sensitive sources. The first potential legal battle will come if police will refuse to admit whether or not they had officers deployed in specific circumstances. Official reports have already revealed the existence of some of these undercover officers - such as the one who was in a campaign group close to the Lawrence family - but they remain anonymous. If officers remain in the shadows because, quite simply, they were incredibly good at their job, police chiefs will almost certainly argue that the public interest lies in protecting their anonymity because of their legal duty of care.",The allegations of wrongdoing by undercover police officers that have emerged since 2011 have been extraordinary.,33682769


## We can view the column names and data types with our dataset using .features

In [64]:
xsum['test'].features

{'document': Value(dtype='string', id=None),
 'summary': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [65]:
print(xsum['test'].info)

DatasetInfo(description='\nExtreme Summarization (XSum) Dataset.\n\nThere are three features:\n  - document: Input news article.\n  - summary: One sentence summary of the article.\n  - id: BBC ID of the article.\n\n', citation="\n@article{Narayan2018DontGM,\n  title={Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization},\n  author={Shashi Narayan and Shay B. Cohen and Mirella Lapata},\n  journal={ArXiv},\n  year={2018},\n  volume={abs/1808.08745}\n}\n", homepage='https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset', license='', features={'document': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=SupervisedKeysData(input='document', output='summary'), task_templates=None, builder_name='xsum', config_name='default', version=1.2.0, splits={'train': SplitInfo(name='train', num_bytes=479206615, num_examples=204045, data

# Preparing XSUM Data
Before we can put the text into a model we need to convert it into a format that the transformer can understand. Encoders and decoders only understand numerical values; we need to tokenize each word and then convert the tokens into numerical values. The tokenization transformer splits text into tokens and then adds special tokens if expected based on pretraining. The tokenizer then matches each token to a unique id in the vocabulary of the tokenizer which has a corresponding vector of numerical values. These vectors contain the contextualized value of a word. For example, the vector representation of the word "to" isnt just "to", it also takes into account the words around it which are called context (right and left context). To continue this example, "Welcome to NYC" is a sentence that has the word "to". For the word "to" the left context is "Welcome" and the right context is "NYC". The output is based on these contexts; this is how the value is a contextualized vector thanks to the self-attention mechanism. We can do all of this using the AutoTokenizer.from_pretarined method to ensure that we get a tokenizer that corresponds to the model architecture we want to use (facebook/bart-large-cnn); however, we will specifically reference the BartTokenizer in our checkpoint, tokenizer, and model to ensure all aspects of our model were trained using the same methodologies so we can avoid unexpected summaries

In [66]:
checkpoint = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(checkpoint)
model = BartForConditionalGeneration.from_pretrained(checkpoint)

## We now write a function that preprocesses the test data by passing it to the tokenizer. We need to use the argument truncation=True to ensure any input longer than the model can handle will be truncated to the maximum length allowed. We can view this information in the model config. BART has a maximum length (can take in 1024 tokens in a sequence) of 1024 which we can see in max_position_embeddings

In [67]:
model.config

BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "L

## We can now create the function with the maximum length allowed as per the config and a minimum length of 60 which is explained in the section where we compare human summaries and machine summaries to each other and the original articles

In [68]:
max_input_length = 1024
max_target_length = 60


def preperation_function(examples):
    inputs = [doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding=True)

    
    with tokenizer.as_target_tokenizer(): # Setup the tokenizer for summaries where "as_target_tokenizer" is what provides passes along the context for each vector
        labels = tokenizer(
            examples["summary"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

## We can apply this function to our dataset using map

In [69]:
tokenized_xsum = xsum.map(preperation_function, batched=True)

Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-d006ce488ae4d44a.arrow
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-e5f53a81412bec57.arrow
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934\cache-92dafe1aca587ba8.arrow


In [70]:
tokenized_xsum

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
        num_rows: 11334
    })
})

In [71]:
tokenized_xsum['test'].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'document': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'summary': Value(dtype='string', id=None)}

## The attention mask tells the model what to pay attention to by passing values of 1 for tokens to consider and values of 0 for tokens to ignore. The input ids are the numerical mapping of tokens to BART's vocabulary; each word in BART's vocabulary is assigned a numerical value.

In [72]:
display_function(tokenized_xsum['test'])

Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It will be the first time that the tournament has been held in England since 1993, when the home side beat New Zealand in the final at Lords. The tournament, which starts on 26 June next year, consists of 31 matches, with Lords hosting the final on 23 July. It will feature eight teams and will be played in a round-robin format. Steve Elworthy, the ECBs director of events, said the tournament will help drive interest and participation in womens cricket at every level. He added: Its critical we use this event to reach out to young children in particular, so weve moved the tournament start date to earlier in the summer, a decision which will help our host venues encourage attendance by engaging with schools in the build-up to the event.",35521829,"[0, 243, 40, 28, 5, 78, 86, 14, 5, 1967, 34, 57, 547, 11, 1156, 187, 9095, 6, 77, 5, 184, 526, 1451, 188, 3324, 11, 5, 507, 23, 26608, 4, 20, 1967, 6, 61, 2012, 15, 973, 502, 220, 76, 6, 10726, 9, 1105, 2856, 6, 19, 26608, 5162, 5, 507, 15, 883, 550, 4, 85, 40, 1905, 799, 893, 8, 40, 28, 702, 11, 10, 1062, 12, 1001, 9413, 7390, 4, 2206, 1448, 17328, 6, 5, 6899, 29, 736, 9, 1061, 6, 26, 5, 1967, 40, 244, 1305, 773, 8, 5740, 11, 38085, 1290, 5630, 23, 358, 672, ...]","[0, 495, 31679, 2459, 6867, 6, 1063, 6355, 2696, 6867, 6, 24011, 6, 25296, 4643, 2696, 6867, 8, 5736, 18, 33, 57, 1440, 25, 10141, 13, 5, 14305, 2691, 18, 623, 968, 11, 193, 4, 2]","Derbyshire, Leicestershire, Somerset, Gloucestershire and Lord's have been named as venues for the ICC Women's World Cup in 2017."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It was the third day of hefty falls, sparked by news on Wednesday that the company admitted falsifying fuel economy data for more than 600,000 vehicles sold in Japan. Government officials raided a company office and authorities want a full report from the company in weeks. The shares are 40% cheaper than before news of the false data emerged. Elsewhere on the Asian markets, shares of consumer electronics giant Sony also traded lower and closed down 1.7%. The company trimmed nearly 10% off its previous profit estimate for the full year to March 2016, due to a one-off charge. Sony is scheduled to report its financial results next week. On the broader Japanese market, the benchmark Nikkei 225 index reversed earlier losses and ended the Friday session higher by 1.2% - or 208.87 points - at 17,572.49. Other Asian markets traded lower on Friday, mirroring how US markets performed overnight. South Koreas Kospi closed down 0.33% at 2,015.49. In Australia the S&P ASX 200 ended the week down 0.69% at 5,236.39. Chinas Shanghai composite ended up 0.2% to 2,959.24. Meanwhile in Hong Kong the Hang Seng index dropped 0.7% to trade at 21,467.",36108480,"[0, 243, 21, 5, 371, 183, 9, 15234, 5712, 6, 6246, 30, 340, 15, 307, 14, 5, 138, 2641, 22461, 4945, 2423, 866, 414, 13, 55, 87, 5594, 6, 151, 1734, 1088, 11, 1429, 4, 1621, 503, 18000, 10, 138, 558, 8, 1247, 236, 10, 455, 266, 31, 5, 138, 11, 688, 4, 20, 327, 32, 843, 207, 7246, 87, 137, 340, 9, 5, 3950, 414, 4373, 4, 13487, 8569, 15, 5, 3102, 1048, 6, 327, 9, 2267, 8917, 3065, 6366, 67, 2281, 795, 8, 1367, 159, 112, 4, 406, 2153, 20, 138, 20856, 823, 158, 207, 160, 63, 986, ...]","[0, 12494, 11, 2898, 10948, 4218, 14228, 1792, 12839, 8484, 12662, 508, 4, 245, 207, 11, 273, 721, 7, 593, 23, 37311, 4796, 4, 2]",Shares in Japanese automaker Mitsubishi Motors plunged 13.5% in Friday trade to close at 504 yen.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Roedd Jason Cooper wedi bod o flaen llys or blaen ar gyhuddiad o geisio llofruddio Laura Jayne Stuart, ond clywodd y llys ddydd Gwener bod Ms Stuart wedi marw. Ymddangosodd Mr Cooper, 27 oed o Ddinbych, ar gyswllt fideo o garchar Altcourse yn Lerpwl. Maen wynebu cyhuddiadau o lofruddio Ms Stuart yn Ninbych ar 12 Awst, ac o glwyfo David Roberts gydar bwriad o achosi niwed corfforol difrifol iddo. Doedd dim cais am fechnïaeth ac fe gafodd y diffynnydd ei gadw yn y ddalfa nes ir achos yn ei erbyn ddechrau ym mis Chwefror, 2018. Er bod dyn wedi ei gyhuddo, mae Heddlu Gogledd Cymru yn pwysleisio fod eu hymchwiliad yn parhau, a bod swyddogion yn dal i chwilio am y gyllell gafodd ei defnyddio i drywanu Ms Stuart. Maer heddlu yn gofyn i unrhyw un sydd â gwybodaeth i gysylltu drwy ffonio 101 neu 0800 555 111 gan ddefnyddior cyfeirnod RC 1712 2068.",41035472,"[0, 27110, 13093, 3262, 5097, 885, 19237, 28072, 1021, 2342, 102, 225, 19385, 2459, 50, 3089, 102, 225, 4709, 18124, 298, 7027, 118, 625, 1021, 5473, 354, 1020, 19385, 1116, 338, 7027, 1020, 6939, 3309, 858, 10125, 6, 15, 417, 740, 352, 605, 13533, 1423, 19385, 2459, 385, 7180, 16134, 17822, 5777, 28072, 2135, 10125, 885, 19237, 4401, 605, 4, 854, 119, 16134, 1097, 366, 13533, 427, 5097, 6, 974, 1021, 196, 1021, 211, 27228, 1409, 611, 6, 4709, 821, 2459, 605, 890, 90, 856, 44234, 1021, 821, 13161, 271, 7330, 21282, 1423, 282, 23813, 642, 42448, 4, 3066, 225, ...]","[0, 448, 4791, 38481, 885, 19237, 1423, 119, 16134, 1097, 366, 1021, 2342, 102, 225, 226, 32142, 1423, 12627, 261, 854, 338, 12449, 16134, 571, 12476, 939, 885, 20706, 9519, 19258, 298, 7027, 118, 625, 1021, 784, 1116, 338, 7027, 493, 4774, 4, 2]",Mae dyn wedi ymddangos o flaen Llys y Goron Yr Wyddgrug i wynebu cyhuddiad o lofruddiaeth.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It will be the first time that the tournament has been held in England since 1993, when the home side beat New Zealand in the final at Lords. The tournament, which starts on 26 June next year, consists of 31 matches, with Lords hosting the final on 23 July. It will feature eight teams and will be played in a round-robin format. Steve Elworthy, the ECBs director of events, said the tournament will help drive interest and participation in womens cricket at every level. He added: Its critical we use this event to reach out to young children in particular, so weve moved the tournament start date to earlier in the summer, a decision which will help our host venues encourage attendance by engaging with schools in the build-up to the event.",35521829,"[0, 243, 40, 28, 5, 78, 86, 14, 5, 1967, 34, 57, 547, 11, 1156, 187, 9095, 6, 77, 5, 184, 526, 1451, 188, 3324, 11, 5, 507, 23, 26608, 4, 20, 1967, 6, 61, 2012, 15, 973, 502, 220, 76, 6, 10726, 9, 1105, 2856, 6, 19, 26608, 5162, 5, 507, 15, 883, 550, 4, 85, 40, 1905, 799, 893, 8, 40, 28, 702, 11, 10, 1062, 12, 1001, 9413, 7390, 4, 2206, 1448, 17328, 6, 5, 6899, 29, 736, 9, 1061, 6, 26, 5, 1967, 40, 244, 1305, 773, 8, 5740, 11, 38085, 1290, 5630, 23, 358, 672, ...]","[0, 495, 31679, 2459, 6867, 6, 1063, 6355, 2696, 6867, 6, 24011, 6, 25296, 4643, 2696, 6867, 8, 5736, 18, 33, 57, 1440, 25, 10141, 13, 5, 14305, 2691, 18, 623, 968, 11, 193, 4, 2]","Derbyshire, Leicestershire, Somerset, Gloucestershire and Lord's have been named as venues for the ICC Women's World Cup in 2017."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It was the third day of hefty falls, sparked by news on Wednesday that the company admitted falsifying fuel economy data for more than 600,000 vehicles sold in Japan. Government officials raided a company office and authorities want a full report from the company in weeks. The shares are 40% cheaper than before news of the false data emerged. Elsewhere on the Asian markets, shares of consumer electronics giant Sony also traded lower and closed down 1.7%. The company trimmed nearly 10% off its previous profit estimate for the full year to March 2016, due to a one-off charge. Sony is scheduled to report its financial results next week. On the broader Japanese market, the benchmark Nikkei 225 index reversed earlier losses and ended the Friday session higher by 1.2% - or 208.87 points - at 17,572.49. Other Asian markets traded lower on Friday, mirroring how US markets performed overnight. South Koreas Kospi closed down 0.33% at 2,015.49. In Australia the S&P ASX 200 ended the week down 0.69% at 5,236.39. Chinas Shanghai composite ended up 0.2% to 2,959.24. Meanwhile in Hong Kong the Hang Seng index dropped 0.7% to trade at 21,467.",36108480,"[0, 243, 21, 5, 371, 183, 9, 15234, 5712, 6, 6246, 30, 340, 15, 307, 14, 5, 138, 2641, 22461, 4945, 2423, 866, 414, 13, 55, 87, 5594, 6, 151, 1734, 1088, 11, 1429, 4, 1621, 503, 18000, 10, 138, 558, 8, 1247, 236, 10, 455, 266, 31, 5, 138, 11, 688, 4, 20, 327, 32, 843, 207, 7246, 87, 137, 340, 9, 5, 3950, 414, 4373, 4, 13487, 8569, 15, 5, 3102, 1048, 6, 327, 9, 2267, 8917, 3065, 6366, 67, 2281, 795, 8, 1367, 159, 112, 4, 406, 2153, 20, 138, 20856, 823, 158, 207, 160, 63, 986, ...]","[0, 12494, 11, 2898, 10948, 4218, 14228, 1792, 12839, 8484, 12662, 508, 4, 245, 207, 11, 273, 721, 7, 593, 23, 37311, 4796, 4, 2]",Shares in Japanese automaker Mitsubishi Motors plunged 13.5% in Friday trade to close at 504 yen.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Roedd Jason Cooper wedi bod o flaen llys or blaen ar gyhuddiad o geisio llofruddio Laura Jayne Stuart, ond clywodd y llys ddydd Gwener bod Ms Stuart wedi marw. Ymddangosodd Mr Cooper, 27 oed o Ddinbych, ar gyswllt fideo o garchar Altcourse yn Lerpwl. Maen wynebu cyhuddiadau o lofruddio Ms Stuart yn Ninbych ar 12 Awst, ac o glwyfo David Roberts gydar bwriad o achosi niwed corfforol difrifol iddo. Doedd dim cais am fechnïaeth ac fe gafodd y diffynnydd ei gadw yn y ddalfa nes ir achos yn ei erbyn ddechrau ym mis Chwefror, 2018. Er bod dyn wedi ei gyhuddo, mae Heddlu Gogledd Cymru yn pwysleisio fod eu hymchwiliad yn parhau, a bod swyddogion yn dal i chwilio am y gyllell gafodd ei defnyddio i drywanu Ms Stuart. Maer heddlu yn gofyn i unrhyw un sydd â gwybodaeth i gysylltu drwy ffonio 101 neu 0800 555 111 gan ddefnyddior cyfeirnod RC 1712 2068.",41035472,"[0, 27110, 13093, 3262, 5097, 885, 19237, 28072, 1021, 2342, 102, 225, 19385, 2459, 50, 3089, 102, 225, 4709, 18124, 298, 7027, 118, 625, 1021, 5473, 354, 1020, 19385, 1116, 338, 7027, 1020, 6939, 3309, 858, 10125, 6, 15, 417, 740, 352, 605, 13533, 1423, 19385, 2459, 385, 7180, 16134, 17822, 5777, 28072, 2135, 10125, 885, 19237, 4401, 605, 4, 854, 119, 16134, 1097, 366, 13533, 427, 5097, 6, 974, 1021, 196, 1021, 211, 27228, 1409, 611, 6, 4709, 821, 2459, 605, 890, 90, 856, 44234, 1021, 821, 13161, 271, 7330, 21282, 1423, 282, 23813, 642, 42448, 4, 3066, 225, ...]","[0, 448, 4791, 38481, 885, 19237, 1423, 119, 16134, 1097, 366, 1021, 2342, 102, 225, 226, 32142, 1423, 12627, 261, 854, 338, 12449, 16134, 571, 12476, 939, 885, 20706, 9519, 19258, 298, 7027, 118, 625, 1021, 784, 1116, 338, 7027, 493, 4774, 4, 2]",Mae dyn wedi ymddangos o flaen Llys y Goron Yr Wyddgrug i wynebu cyhuddiad o lofruddiaeth.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It will be the first time that the tournament has been held in England since 1993, when the home side beat New Zealand in the final at Lords. The tournament, which starts on 26 June next year, consists of 31 matches, with Lords hosting the final on 23 July. It will feature eight teams and will be played in a round-robin format. Steve Elworthy, the ECBs director of events, said the tournament will help drive interest and participation in womens cricket at every level. He added: Its critical we use this event to reach out to young children in particular, so weve moved the tournament start date to earlier in the summer, a decision which will help our host venues encourage attendance by engaging with schools in the build-up to the event.",35521829,"[0, 243, 40, 28, 5, 78, 86, 14, 5, 1967, 34, 57, 547, 11, 1156, 187, 9095, 6, 77, 5, 184, 526, 1451, 188, 3324, 11, 5, 507, 23, 26608, 4, 20, 1967, 6, 61, 2012, 15, 973, 502, 220, 76, 6, 10726, 9, 1105, 2856, 6, 19, 26608, 5162, 5, 507, 15, 883, 550, 4, 85, 40, 1905, 799, 893, 8, 40, 28, 702, 11, 10, 1062, 12, 1001, 9413, 7390, 4, 2206, 1448, 17328, 6, 5, 6899, 29, 736, 9, 1061, 6, 26, 5, 1967, 40, 244, 1305, 773, 8, 5740, 11, 38085, 1290, 5630, 23, 358, 672, ...]","[0, 495, 31679, 2459, 6867, 6, 1063, 6355, 2696, 6867, 6, 24011, 6, 25296, 4643, 2696, 6867, 8, 5736, 18, 33, 57, 1440, 25, 10141, 13, 5, 14305, 2691, 18, 623, 968, 11, 193, 4, 2]","Derbyshire, Leicestershire, Somerset, Gloucestershire and Lord's have been named as venues for the ICC Women's World Cup in 2017."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It was the third day of hefty falls, sparked by news on Wednesday that the company admitted falsifying fuel economy data for more than 600,000 vehicles sold in Japan. Government officials raided a company office and authorities want a full report from the company in weeks. The shares are 40% cheaper than before news of the false data emerged. Elsewhere on the Asian markets, shares of consumer electronics giant Sony also traded lower and closed down 1.7%. The company trimmed nearly 10% off its previous profit estimate for the full year to March 2016, due to a one-off charge. Sony is scheduled to report its financial results next week. On the broader Japanese market, the benchmark Nikkei 225 index reversed earlier losses and ended the Friday session higher by 1.2% - or 208.87 points - at 17,572.49. Other Asian markets traded lower on Friday, mirroring how US markets performed overnight. South Koreas Kospi closed down 0.33% at 2,015.49. In Australia the S&P ASX 200 ended the week down 0.69% at 5,236.39. Chinas Shanghai composite ended up 0.2% to 2,959.24. Meanwhile in Hong Kong the Hang Seng index dropped 0.7% to trade at 21,467.",36108480,"[0, 243, 21, 5, 371, 183, 9, 15234, 5712, 6, 6246, 30, 340, 15, 307, 14, 5, 138, 2641, 22461, 4945, 2423, 866, 414, 13, 55, 87, 5594, 6, 151, 1734, 1088, 11, 1429, 4, 1621, 503, 18000, 10, 138, 558, 8, 1247, 236, 10, 455, 266, 31, 5, 138, 11, 688, 4, 20, 327, 32, 843, 207, 7246, 87, 137, 340, 9, 5, 3950, 414, 4373, 4, 13487, 8569, 15, 5, 3102, 1048, 6, 327, 9, 2267, 8917, 3065, 6366, 67, 2281, 795, 8, 1367, 159, 112, 4, 406, 2153, 20, 138, 20856, 823, 158, 207, 160, 63, 986, ...]","[0, 12494, 11, 2898, 10948, 4218, 14228, 1792, 12839, 8484, 12662, 508, 4, 245, 207, 11, 273, 721, 7, 593, 23, 37311, 4796, 4, 2]",Shares in Japanese automaker Mitsubishi Motors plunged 13.5% in Friday trade to close at 504 yen.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Roedd Jason Cooper wedi bod o flaen llys or blaen ar gyhuddiad o geisio llofruddio Laura Jayne Stuart, ond clywodd y llys ddydd Gwener bod Ms Stuart wedi marw. Ymddangosodd Mr Cooper, 27 oed o Ddinbych, ar gyswllt fideo o garchar Altcourse yn Lerpwl. Maen wynebu cyhuddiadau o lofruddio Ms Stuart yn Ninbych ar 12 Awst, ac o glwyfo David Roberts gydar bwriad o achosi niwed corfforol difrifol iddo. Doedd dim cais am fechnïaeth ac fe gafodd y diffynnydd ei gadw yn y ddalfa nes ir achos yn ei erbyn ddechrau ym mis Chwefror, 2018. Er bod dyn wedi ei gyhuddo, mae Heddlu Gogledd Cymru yn pwysleisio fod eu hymchwiliad yn parhau, a bod swyddogion yn dal i chwilio am y gyllell gafodd ei defnyddio i drywanu Ms Stuart. Maer heddlu yn gofyn i unrhyw un sydd â gwybodaeth i gysylltu drwy ffonio 101 neu 0800 555 111 gan ddefnyddior cyfeirnod RC 1712 2068.",41035472,"[0, 27110, 13093, 3262, 5097, 885, 19237, 28072, 1021, 2342, 102, 225, 19385, 2459, 50, 3089, 102, 225, 4709, 18124, 298, 7027, 118, 625, 1021, 5473, 354, 1020, 19385, 1116, 338, 7027, 1020, 6939, 3309, 858, 10125, 6, 15, 417, 740, 352, 605, 13533, 1423, 19385, 2459, 385, 7180, 16134, 17822, 5777, 28072, 2135, 10125, 885, 19237, 4401, 605, 4, 854, 119, 16134, 1097, 366, 13533, 427, 5097, 6, 974, 1021, 196, 1021, 211, 27228, 1409, 611, 6, 4709, 821, 2459, 605, 890, 90, 856, 44234, 1021, 821, 13161, 271, 7330, 21282, 1423, 282, 23813, 642, 42448, 4, 3066, 225, ...]","[0, 448, 4791, 38481, 885, 19237, 1423, 119, 16134, 1097, 366, 1021, 2342, 102, 225, 226, 32142, 1423, 12627, 261, 854, 338, 12449, 16134, 571, 12476, 939, 885, 20706, 9519, 19258, 298, 7027, 118, 625, 1021, 784, 1116, 338, 7027, 493, 4774, 4, 2]",Mae dyn wedi ymddangos o flaen Llys y Goron Yr Wyddgrug i wynebu cyhuddiad o lofruddiaeth.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It will be the first time that the tournament has been held in England since 1993, when the home side beat New Zealand in the final at Lords. The tournament, which starts on 26 June next year, consists of 31 matches, with Lords hosting the final on 23 July. It will feature eight teams and will be played in a round-robin format. Steve Elworthy, the ECBs director of events, said the tournament will help drive interest and participation in womens cricket at every level. He added: Its critical we use this event to reach out to young children in particular, so weve moved the tournament start date to earlier in the summer, a decision which will help our host venues encourage attendance by engaging with schools in the build-up to the event.",35521829,"[0, 243, 40, 28, 5, 78, 86, 14, 5, 1967, 34, 57, 547, 11, 1156, 187, 9095, 6, 77, 5, 184, 526, 1451, 188, 3324, 11, 5, 507, 23, 26608, 4, 20, 1967, 6, 61, 2012, 15, 973, 502, 220, 76, 6, 10726, 9, 1105, 2856, 6, 19, 26608, 5162, 5, 507, 15, 883, 550, 4, 85, 40, 1905, 799, 893, 8, 40, 28, 702, 11, 10, 1062, 12, 1001, 9413, 7390, 4, 2206, 1448, 17328, 6, 5, 6899, 29, 736, 9, 1061, 6, 26, 5, 1967, 40, 244, 1305, 773, 8, 5740, 11, 38085, 1290, 5630, 23, 358, 672, ...]","[0, 495, 31679, 2459, 6867, 6, 1063, 6355, 2696, 6867, 6, 24011, 6, 25296, 4643, 2696, 6867, 8, 5736, 18, 33, 57, 1440, 25, 10141, 13, 5, 14305, 2691, 18, 623, 968, 11, 193, 4, 2]","Derbyshire, Leicestershire, Somerset, Gloucestershire and Lord's have been named as venues for the ICC Women's World Cup in 2017."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It was the third day of hefty falls, sparked by news on Wednesday that the company admitted falsifying fuel economy data for more than 600,000 vehicles sold in Japan. Government officials raided a company office and authorities want a full report from the company in weeks. The shares are 40% cheaper than before news of the false data emerged. Elsewhere on the Asian markets, shares of consumer electronics giant Sony also traded lower and closed down 1.7%. The company trimmed nearly 10% off its previous profit estimate for the full year to March 2016, due to a one-off charge. Sony is scheduled to report its financial results next week. On the broader Japanese market, the benchmark Nikkei 225 index reversed earlier losses and ended the Friday session higher by 1.2% - or 208.87 points - at 17,572.49. Other Asian markets traded lower on Friday, mirroring how US markets performed overnight. South Koreas Kospi closed down 0.33% at 2,015.49. In Australia the S&P ASX 200 ended the week down 0.69% at 5,236.39. Chinas Shanghai composite ended up 0.2% to 2,959.24. Meanwhile in Hong Kong the Hang Seng index dropped 0.7% to trade at 21,467.",36108480,"[0, 243, 21, 5, 371, 183, 9, 15234, 5712, 6, 6246, 30, 340, 15, 307, 14, 5, 138, 2641, 22461, 4945, 2423, 866, 414, 13, 55, 87, 5594, 6, 151, 1734, 1088, 11, 1429, 4, 1621, 503, 18000, 10, 138, 558, 8, 1247, 236, 10, 455, 266, 31, 5, 138, 11, 688, 4, 20, 327, 32, 843, 207, 7246, 87, 137, 340, 9, 5, 3950, 414, 4373, 4, 13487, 8569, 15, 5, 3102, 1048, 6, 327, 9, 2267, 8917, 3065, 6366, 67, 2281, 795, 8, 1367, 159, 112, 4, 406, 2153, 20, 138, 20856, 823, 158, 207, 160, 63, 986, ...]","[0, 12494, 11, 2898, 10948, 4218, 14228, 1792, 12839, 8484, 12662, 508, 4, 245, 207, 11, 273, 721, 7, 593, 23, 37311, 4796, 4, 2]",Shares in Japanese automaker Mitsubishi Motors plunged 13.5% in Friday trade to close at 504 yen.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Roedd Jason Cooper wedi bod o flaen llys or blaen ar gyhuddiad o geisio llofruddio Laura Jayne Stuart, ond clywodd y llys ddydd Gwener bod Ms Stuart wedi marw. Ymddangosodd Mr Cooper, 27 oed o Ddinbych, ar gyswllt fideo o garchar Altcourse yn Lerpwl. Maen wynebu cyhuddiadau o lofruddio Ms Stuart yn Ninbych ar 12 Awst, ac o glwyfo David Roberts gydar bwriad o achosi niwed corfforol difrifol iddo. Doedd dim cais am fechnïaeth ac fe gafodd y diffynnydd ei gadw yn y ddalfa nes ir achos yn ei erbyn ddechrau ym mis Chwefror, 2018. Er bod dyn wedi ei gyhuddo, mae Heddlu Gogledd Cymru yn pwysleisio fod eu hymchwiliad yn parhau, a bod swyddogion yn dal i chwilio am y gyllell gafodd ei defnyddio i drywanu Ms Stuart. Maer heddlu yn gofyn i unrhyw un sydd â gwybodaeth i gysylltu drwy ffonio 101 neu 0800 555 111 gan ddefnyddior cyfeirnod RC 1712 2068.",41035472,"[0, 27110, 13093, 3262, 5097, 885, 19237, 28072, 1021, 2342, 102, 225, 19385, 2459, 50, 3089, 102, 225, 4709, 18124, 298, 7027, 118, 625, 1021, 5473, 354, 1020, 19385, 1116, 338, 7027, 1020, 6939, 3309, 858, 10125, 6, 15, 417, 740, 352, 605, 13533, 1423, 19385, 2459, 385, 7180, 16134, 17822, 5777, 28072, 2135, 10125, 885, 19237, 4401, 605, 4, 854, 119, 16134, 1097, 366, 13533, 427, 5097, 6, 974, 1021, 196, 1021, 211, 27228, 1409, 611, 6, 4709, 821, 2459, 605, 890, 90, 856, 44234, 1021, 821, 13161, 271, 7330, 21282, 1423, 282, 23813, 642, 42448, 4, 3066, 225, ...]","[0, 448, 4791, 38481, 885, 19237, 1423, 119, 16134, 1097, 366, 1021, 2342, 102, 225, 226, 32142, 1423, 12627, 261, 854, 338, 12449, 16134, 571, 12476, 939, 885, 20706, 9519, 19258, 298, 7027, 118, 625, 1021, 784, 1116, 338, 7027, 493, 4774, 4, 2]",Mae dyn wedi ymddangos o flaen Llys y Goron Yr Wyddgrug i wynebu cyhuddiad o lofruddiaeth.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It will be the first time that the tournament has been held in England since 1993, when the home side beat New Zealand in the final at Lords. The tournament, which starts on 26 June next year, consists of 31 matches, with Lords hosting the final on 23 July. It will feature eight teams and will be played in a round-robin format. Steve Elworthy, the ECBs director of events, said the tournament will help drive interest and participation in womens cricket at every level. He added: Its critical we use this event to reach out to young children in particular, so weve moved the tournament start date to earlier in the summer, a decision which will help our host venues encourage attendance by engaging with schools in the build-up to the event.",35521829,"[0, 243, 40, 28, 5, 78, 86, 14, 5, 1967, 34, 57, 547, 11, 1156, 187, 9095, 6, 77, 5, 184, 526, 1451, 188, 3324, 11, 5, 507, 23, 26608, 4, 20, 1967, 6, 61, 2012, 15, 973, 502, 220, 76, 6, 10726, 9, 1105, 2856, 6, 19, 26608, 5162, 5, 507, 15, 883, 550, 4, 85, 40, 1905, 799, 893, 8, 40, 28, 702, 11, 10, 1062, 12, 1001, 9413, 7390, 4, 2206, 1448, 17328, 6, 5, 6899, 29, 736, 9, 1061, 6, 26, 5, 1967, 40, 244, 1305, 773, 8, 5740, 11, 38085, 1290, 5630, 23, 358, 672, ...]","[0, 495, 31679, 2459, 6867, 6, 1063, 6355, 2696, 6867, 6, 24011, 6, 25296, 4643, 2696, 6867, 8, 5736, 18, 33, 57, 1440, 25, 10141, 13, 5, 14305, 2691, 18, 623, 968, 11, 193, 4, 2]","Derbyshire, Leicestershire, Somerset, Gloucestershire and Lord's have been named as venues for the ICC Women's World Cup in 2017."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It was the third day of hefty falls, sparked by news on Wednesday that the company admitted falsifying fuel economy data for more than 600,000 vehicles sold in Japan. Government officials raided a company office and authorities want a full report from the company in weeks. The shares are 40% cheaper than before news of the false data emerged. Elsewhere on the Asian markets, shares of consumer electronics giant Sony also traded lower and closed down 1.7%. The company trimmed nearly 10% off its previous profit estimate for the full year to March 2016, due to a one-off charge. Sony is scheduled to report its financial results next week. On the broader Japanese market, the benchmark Nikkei 225 index reversed earlier losses and ended the Friday session higher by 1.2% - or 208.87 points - at 17,572.49. Other Asian markets traded lower on Friday, mirroring how US markets performed overnight. South Koreas Kospi closed down 0.33% at 2,015.49. In Australia the S&P ASX 200 ended the week down 0.69% at 5,236.39. Chinas Shanghai composite ended up 0.2% to 2,959.24. Meanwhile in Hong Kong the Hang Seng index dropped 0.7% to trade at 21,467.",36108480,"[0, 243, 21, 5, 371, 183, 9, 15234, 5712, 6, 6246, 30, 340, 15, 307, 14, 5, 138, 2641, 22461, 4945, 2423, 866, 414, 13, 55, 87, 5594, 6, 151, 1734, 1088, 11, 1429, 4, 1621, 503, 18000, 10, 138, 558, 8, 1247, 236, 10, 455, 266, 31, 5, 138, 11, 688, 4, 20, 327, 32, 843, 207, 7246, 87, 137, 340, 9, 5, 3950, 414, 4373, 4, 13487, 8569, 15, 5, 3102, 1048, 6, 327, 9, 2267, 8917, 3065, 6366, 67, 2281, 795, 8, 1367, 159, 112, 4, 406, 2153, 20, 138, 20856, 823, 158, 207, 160, 63, 986, ...]","[0, 12494, 11, 2898, 10948, 4218, 14228, 1792, 12839, 8484, 12662, 508, 4, 245, 207, 11, 273, 721, 7, 593, 23, 37311, 4796, 4, 2]",Shares in Japanese automaker Mitsubishi Motors plunged 13.5% in Friday trade to close at 504 yen.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Roedd Jason Cooper wedi bod o flaen llys or blaen ar gyhuddiad o geisio llofruddio Laura Jayne Stuart, ond clywodd y llys ddydd Gwener bod Ms Stuart wedi marw. Ymddangosodd Mr Cooper, 27 oed o Ddinbych, ar gyswllt fideo o garchar Altcourse yn Lerpwl. Maen wynebu cyhuddiadau o lofruddio Ms Stuart yn Ninbych ar 12 Awst, ac o glwyfo David Roberts gydar bwriad o achosi niwed corfforol difrifol iddo. Doedd dim cais am fechnïaeth ac fe gafodd y diffynnydd ei gadw yn y ddalfa nes ir achos yn ei erbyn ddechrau ym mis Chwefror, 2018. Er bod dyn wedi ei gyhuddo, mae Heddlu Gogledd Cymru yn pwysleisio fod eu hymchwiliad yn parhau, a bod swyddogion yn dal i chwilio am y gyllell gafodd ei defnyddio i drywanu Ms Stuart. Maer heddlu yn gofyn i unrhyw un sydd â gwybodaeth i gysylltu drwy ffonio 101 neu 0800 555 111 gan ddefnyddior cyfeirnod RC 1712 2068.",41035472,"[0, 27110, 13093, 3262, 5097, 885, 19237, 28072, 1021, 2342, 102, 225, 19385, 2459, 50, 3089, 102, 225, 4709, 18124, 298, 7027, 118, 625, 1021, 5473, 354, 1020, 19385, 1116, 338, 7027, 1020, 6939, 3309, 858, 10125, 6, 15, 417, 740, 352, 605, 13533, 1423, 19385, 2459, 385, 7180, 16134, 17822, 5777, 28072, 2135, 10125, 885, 19237, 4401, 605, 4, 854, 119, 16134, 1097, 366, 13533, 427, 5097, 6, 974, 1021, 196, 1021, 211, 27228, 1409, 611, 6, 4709, 821, 2459, 605, 890, 90, 856, 44234, 1021, 821, 13161, 271, 7330, 21282, 1423, 282, 23813, 642, 42448, 4, 3066, 225, ...]","[0, 448, 4791, 38481, 885, 19237, 1423, 119, 16134, 1097, 366, 1021, 2342, 102, 225, 226, 32142, 1423, 12627, 261, 854, 338, 12449, 16134, 571, 12476, 939, 885, 20706, 9519, 19258, 298, 7027, 118, 625, 1021, 784, 1116, 338, 7027, 493, 4774, 4, 2]",Mae dyn wedi ymddangos o flaen Llys y Goron Yr Wyddgrug i wynebu cyhuddiad o lofruddiaeth.


Unnamed: 0,attention_mask,document,id,input_ids,labels,summary
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It will be the first time that the tournament has been held in England since 1993, when the home side beat New Zealand in the final at Lords. The tournament, which starts on 26 June next year, consists of 31 matches, with Lords hosting the final on 23 July. It will feature eight teams and will be played in a round-robin format. Steve Elworthy, the ECBs director of events, said the tournament will help drive interest and participation in womens cricket at every level. He added: Its critical we use this event to reach out to young children in particular, so weve moved the tournament start date to earlier in the summer, a decision which will help our host venues encourage attendance by engaging with schools in the build-up to the event.",35521829,"[0, 243, 40, 28, 5, 78, 86, 14, 5, 1967, 34, 57, 547, 11, 1156, 187, 9095, 6, 77, 5, 184, 526, 1451, 188, 3324, 11, 5, 507, 23, 26608, 4, 20, 1967, 6, 61, 2012, 15, 973, 502, 220, 76, 6, 10726, 9, 1105, 2856, 6, 19, 26608, 5162, 5, 507, 15, 883, 550, 4, 85, 40, 1905, 799, 893, 8, 40, 28, 702, 11, 10, 1062, 12, 1001, 9413, 7390, 4, 2206, 1448, 17328, 6, 5, 6899, 29, 736, 9, 1061, 6, 26, 5, 1967, 40, 244, 1305, 773, 8, 5740, 11, 38085, 1290, 5630, 23, 358, 672, ...]","[0, 495, 31679, 2459, 6867, 6, 1063, 6355, 2696, 6867, 6, 24011, 6, 25296, 4643, 2696, 6867, 8, 5736, 18, 33, 57, 1440, 25, 10141, 13, 5, 14305, 2691, 18, 623, 968, 11, 193, 4, 2]","Derbyshire, Leicestershire, Somerset, Gloucestershire and Lord's have been named as venues for the ICC Women's World Cup in 2017."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","It was the third day of hefty falls, sparked by news on Wednesday that the company admitted falsifying fuel economy data for more than 600,000 vehicles sold in Japan. Government officials raided a company office and authorities want a full report from the company in weeks. The shares are 40% cheaper than before news of the false data emerged. Elsewhere on the Asian markets, shares of consumer electronics giant Sony also traded lower and closed down 1.7%. The company trimmed nearly 10% off its previous profit estimate for the full year to March 2016, due to a one-off charge. Sony is scheduled to report its financial results next week. On the broader Japanese market, the benchmark Nikkei 225 index reversed earlier losses and ended the Friday session higher by 1.2% - or 208.87 points - at 17,572.49. Other Asian markets traded lower on Friday, mirroring how US markets performed overnight. South Koreas Kospi closed down 0.33% at 2,015.49. In Australia the S&P ASX 200 ended the week down 0.69% at 5,236.39. Chinas Shanghai composite ended up 0.2% to 2,959.24. Meanwhile in Hong Kong the Hang Seng index dropped 0.7% to trade at 21,467.",36108480,"[0, 243, 21, 5, 371, 183, 9, 15234, 5712, 6, 6246, 30, 340, 15, 307, 14, 5, 138, 2641, 22461, 4945, 2423, 866, 414, 13, 55, 87, 5594, 6, 151, 1734, 1088, 11, 1429, 4, 1621, 503, 18000, 10, 138, 558, 8, 1247, 236, 10, 455, 266, 31, 5, 138, 11, 688, 4, 20, 327, 32, 843, 207, 7246, 87, 137, 340, 9, 5, 3950, 414, 4373, 4, 13487, 8569, 15, 5, 3102, 1048, 6, 327, 9, 2267, 8917, 3065, 6366, 67, 2281, 795, 8, 1367, 159, 112, 4, 406, 2153, 20, 138, 20856, 823, 158, 207, 160, 63, 986, ...]","[0, 12494, 11, 2898, 10948, 4218, 14228, 1792, 12839, 8484, 12662, 508, 4, 245, 207, 11, 273, 721, 7, 593, 23, 37311, 4796, 4, 2]",Shares in Japanese automaker Mitsubishi Motors plunged 13.5% in Friday trade to close at 504 yen.
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","Roedd Jason Cooper wedi bod o flaen llys or blaen ar gyhuddiad o geisio llofruddio Laura Jayne Stuart, ond clywodd y llys ddydd Gwener bod Ms Stuart wedi marw. Ymddangosodd Mr Cooper, 27 oed o Ddinbych, ar gyswllt fideo o garchar Altcourse yn Lerpwl. Maen wynebu cyhuddiadau o lofruddio Ms Stuart yn Ninbych ar 12 Awst, ac o glwyfo David Roberts gydar bwriad o achosi niwed corfforol difrifol iddo. Doedd dim cais am fechnïaeth ac fe gafodd y diffynnydd ei gadw yn y ddalfa nes ir achos yn ei erbyn ddechrau ym mis Chwefror, 2018. Er bod dyn wedi ei gyhuddo, mae Heddlu Gogledd Cymru yn pwysleisio fod eu hymchwiliad yn parhau, a bod swyddogion yn dal i chwilio am y gyllell gafodd ei defnyddio i drywanu Ms Stuart. Maer heddlu yn gofyn i unrhyw un sydd â gwybodaeth i gysylltu drwy ffonio 101 neu 0800 555 111 gan ddefnyddior cyfeirnod RC 1712 2068.",41035472,"[0, 27110, 13093, 3262, 5097, 885, 19237, 28072, 1021, 2342, 102, 225, 19385, 2459, 50, 3089, 102, 225, 4709, 18124, 298, 7027, 118, 625, 1021, 5473, 354, 1020, 19385, 1116, 338, 7027, 1020, 6939, 3309, 858, 10125, 6, 15, 417, 740, 352, 605, 13533, 1423, 19385, 2459, 385, 7180, 16134, 17822, 5777, 28072, 2135, 10125, 885, 19237, 4401, 605, 4, 854, 119, 16134, 1097, 366, 13533, 427, 5097, 6, 974, 1021, 196, 1021, 211, 27228, 1409, 611, 6, 4709, 821, 2459, 605, 890, 90, 856, 44234, 1021, 821, 13161, 271, 7330, 21282, 1423, 282, 23813, 642, 42448, 4, 3066, 225, ...]","[0, 448, 4791, 38481, 885, 19237, 1423, 119, 16134, 1097, 366, 1021, 2342, 102, 225, 226, 32142, 1423, 12627, 261, 854, 338, 12449, 16134, 571, 12476, 939, 885, 20706, 9519, 19258, 298, 7027, 118, 625, 1021, 784, 1116, 338, 7027, 493, 4774, 4, 2]",Mae dyn wedi ymddangos o flaen Llys y Goron Yr Wyddgrug i wynebu cyhuddiad o lofruddiaeth.


# 

In [73]:
tokenized_xsum['test'].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'document': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'summary': Value(dtype='string', id=None)}

# Compare Machine Summaries to Professional Human Written Summaries
To score our machine-generated summaries against professional human-written ones, we compute the cosine similarities between embeddings to measure the semantic similarity between two texts. The comparisons we will be making include human summary to machine summary, human summary to the original document, and machine summary to the original document. Initially, we wanted to make the maximum length in each machine summary the same length as the summaries in the XSUM. However, because the length of the XSUM summaries is so short (hence the name extreme summaries), the model only provided the first words of every article. This makes sense because BART's pretraining likely influenced its methodology to recognize that the start of text often contains valuable summarization information. As a result, we opted for a length of 60 words to keep it brief but allow the model to output enough context to be meaningful. The average summaries for our models are outlined below (at ~19 words per human summary)

We are going to focus on 10 articles and build 10 models to inspect each pair individually

In [74]:
def listToString(s): 
    str1 = "" 
    
    for ele in s: 
        str1 += ele  
 
    return str1 

In [75]:
article1 = tokenized_xsum['test']['document'][0]
article2 = tokenized_xsum['test']['document'][123]
article3 = tokenized_xsum['test']['document'][99]
article4 = tokenized_xsum['test']['document'][1100]
article5 = tokenized_xsum['test']['document'][1118]
article6 = tokenized_xsum['test']['document'][45]
article7 = tokenized_xsum['test']['document'][13]
article8 = tokenized_xsum['test']['document'][69]
article9 = tokenized_xsum['test']['document'][27]
article10 = tokenized_xsum['test']['document'][9]

summary1 = tokenized_xsum['test']['summary'][0]
summary2 = tokenized_xsum['test']['summary'][123]
summary3 = tokenized_xsum['test']['summary'][99]
summary4 = tokenized_xsum['test']['summary'][1100]
summary5 = tokenized_xsum['test']['summary'][1118]
summary6 = tokenized_xsum['test']['summary'][45]
summary7 = tokenized_xsum['test']['summary'][13]
summary8 = tokenized_xsum['test']['summary'][69]
summary9 = tokenized_xsum['test']['summary'][27]
summary10 = tokenized_xsum['test']['summary'][9]


In [76]:
summaryList = [summary1.split(),
summary2.split(), 
summary3.split(), 
summary4.split(),
summary5.split(),
summary6.split(),
summary7.split(), 
summary8.split(),
summary9.split(), 
summary10.split()]

count = sum( [ len(listElem) for listElem in summaryList])

print('The total number of words in these summaries is: ', count)
print('The average words per summary is: ', count / len(summaryList))

The total number of words in these summaries is:  186
The average words per summary is:  18.6


## We had 50% of our models run with the parameters early_stopping=True and 50% with early_stopping=False to see if this would provide any meaningful difference

## Model 1

In [77]:
input1 = tokenizer(article1, return_tensors='pt', truncation=True)
summary_ids1 = model.generate(input1['input_ids'], max_length=20)
machineSummary1 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids1])

In [78]:
machineSummary1 = listToString(machineSummary1)
original1 = listToString(article1)

comparison1 = [summary1, machineSummary1, original1]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings1 = token_model.encode(comparison1)
print(util.pytorch_cos_sim(comparison_embeddings1[0], comparison_embeddings1[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings1[0], comparison_embeddings1[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings1[1], comparison_embeddings1[2])) # machine summary to original article

tensor([[0.4147]])
tensor([[0.7645]])
tensor([[0.5997]])


In [79]:
comparison1

['There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.',
 'Prison Link Cymru had 1,099 referrals in 2015-16 and',
 'Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation. Workers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders. The Welsh Government said more people than ever were getting help to address housing problems. Changes to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation. Prison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because issues such as children or domestic violence were now considered. However, the same could not be said for men, the charity said, because issues which often affect them, such as post traumatic stress disorder or drug dependency

# Model 2

In [80]:
input2 = tokenizer(article2, return_tensors='pt', truncation=True)
summary_ids2 = model.generate(input2['input_ids'], max_length=60)
machineSummary2 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids2])

In [81]:
machineSummary2 = listToString(machineSummary2)
original2 = listToString(article2)

comparison2 = [summary2, machineSummary2, original2]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings2 = token_model.encode(comparison2)
print(util.pytorch_cos_sim(comparison_embeddings2[0], comparison_embeddings2[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings2[0], comparison_embeddings2[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings2[1], comparison_embeddings2[2])) # machine summary to original article

tensor([[0.7189]])
tensor([[0.5850]])
tensor([[0.6048]])


In [82]:
comparison2

["For a man often described as capricious, Tyson Fury's chaotic reign as world heavyweight champion was strangely predictable.",
 'Fury has been speaking about his mental health struggles for years. The repeated claims from Furys camp that his victory was downplayed by the British media, and that they had an agenda against him from the outset, are delusional. Fury is not the first boxer to lose motivation having reached',

# Model 3

In [83]:
input3 = tokenizer(article3, return_tensors='pt', truncation=True)
summary_ids3 = model.generate(input3['input_ids'], max_length=60)
machineSummary3 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids3])

In [84]:
machineSummary3 = listToString(machineSummary3)
original3 = listToString(article3)

comparison3 = [summary3, machineSummary3, original3]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings3 = token_model.encode(comparison3)
print(util.pytorch_cos_sim(comparison_embeddings3[0], comparison_embeddings3[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings3[0], comparison_embeddings3[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings3[1], comparison_embeddings3[2])) # machine summary to original article

tensor([[0.5551]])
tensor([[0.7642]])
tensor([[0.8500]])


In [85]:
comparison3

['A barrister who was due to move into his own chambers in Huddersfield has pleaded guilty to supplying cocaine.',
 'Omar Khan, 31, had worked at The Johnson Partnership in Nottingham for five years. Partner Digby Johnson said he did not represent Khan, who had set up his own office and was set to leave the company. Erlin Manahasa, Albert Dibra and Naza',
 'Omar Khan, 31, had worked at The Johnson Partnership in Nottingham for five years before he was arrested. Erlin Manahasa, Albert Dibra and Nazaquat Ali joined Khan in admitting the same charge, between 1 October  and 4 December last year, at Nottingham Crown Court. They are due to be sentenced on 15 April. Updates on this story and more from Nottinghamshire The court heard the case involved the recovery of 1kg (2.2lb) of cocaine. Digby Johnson, a partner at the Johnson firm, confirmed they did not represent Khan - who had set up his own office and was set to leave the company. I still find it hard to believe he could do something as

# Model 4

In [86]:
input4 = tokenizer(article4, return_tensors='pt', truncation=True)
summary_ids4 = model.generate(input4['input_ids'], max_length=60)
machineSummary4 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids4])

In [87]:
machineSummary4 = listToString(machineSummary4)
original4 = listToString(article4)

comparison4 = [summary4, machineSummary4, original4]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings4 = token_model.encode(comparison4)
print(util.pytorch_cos_sim(comparison_embeddings4[0], comparison_embeddings4[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings4[0], comparison_embeddings4[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings4[1], comparison_embeddings4[2])) # machine summary to original article

tensor([[0.5436]])
tensor([[0.6342]])
tensor([[0.8264]])


In [88]:
comparison4

['Star Wars fans are being given the opportunity to become Jedi Knights and learn how to wield lightsabers in combat.',
 'The sport began eight years ago in Italy but has only just come to England with the first classes in Cheltenham. Instructor Jordan Court said people were already hooked. The lightsabers used in the sport are all hand-made and are provided for use during the classes.',
 'LudoSport has opened its first academy teaching seven forms of combat from the Star Wars world using flexible blades mounted on weighted hilts. The sport began eight years ago in Italy but has only just come to England with the first classes in Cheltenham. Instructor Jordan Court said people were already hooked. The classes in Cheltenham began last month. So far there are six pupils, but this number is expected to increase. Mr Court attended an international boot camp to learn the different stages of the sport which range in characteristics from defensive in stage one to aggressive and flamboyant in 

# Model 5

In [89]:
input5 = tokenizer(article5, return_tensors='pt', truncation=True)
summary_ids5 = model.generate(input5['input_ids'], max_length=60)
machineSummary5 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids5])

In [90]:
machineSummary5 = listToString(machineSummary5)
original5 = listToString(article5)

comparison5 = [summary5, machineSummary5, original5]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings5 = token_model.encode(comparison5)
print(util.pytorch_cos_sim(comparison_embeddings5[0], comparison_embeddings5[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings5[0], comparison_embeddings5[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings5[1], comparison_embeddings5[2])) # machine summary to original article

tensor([[0.5847]])
tensor([[0.6152]])
tensor([[0.9742]])


In [91]:
comparison5

['Awareness rides are taking place to try and cut the number of people on horseback injured or killed on roads.',
 'The Pass Wide and Slow Wales campaign has collected 1,300 signatures on the assemblys e-petition website. It wants an annual road safety awareness campaign explaining to motorists how to react around horses. The British Horse Society found that since 2010 there have been 2,000 road accidents in',
 'The Pass Wide and Slow Wales campaign has collected 1,300 signatures on the assemblys e-petition website. It wants an annual road safety awareness campaign explaining to motorists how to react around horses. The British Horse Society found that since 2010 there have been 2,000 road accidents in the UK, with 1,500 because of cars passing too closely. As a result of these, 180 horses and 36 riders have died. Awareness rides were planned for Penarth, Vale of Glamorgan, Swansea, Neyland in Pembrokeshire, Machynlleth, Powys, Flintshire and Porthmadog in Gwynedd. Any petition with ov

# Model 6

In [92]:
input6 = tokenizer(article6, return_tensors='pt', truncation=True)
summary_ids6 = model.generate(input6['input_ids'], max_length=60, early_stopping=False)
machineSummary6 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids6])

In [93]:
machineSummary6 = listToString(machineSummary6)
original6 = listToString(article6)

comparison6 = [summary6, machineSummary6, original6]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings6 = token_model.encode(comparison6)
print(util.pytorch_cos_sim(comparison_embeddings6[0], comparison_embeddings6[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings6[0], comparison_embeddings6[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings6[1], comparison_embeddings6[2])) # machine summary to original article

tensor([[0.7071]])
tensor([[0.7340]])
tensor([[0.9464]])


In [94]:
comparison6

['Two new councillors have been elected in a by-election in the City of Edinburgh.',
 'SNP topped the vote in the Leith Walk by-election. Scottish Labour won the second seat from the Greens. Deidre Brock of the SNP and Maggie Chapman of the Scottish Greens stood down. It was the first time the Single Transferable Vote (STV) system had',
 'It was the first time the Single Transferable Vote (STV) system had been used to select two members in the same ward in a by-election. The SNP topped the vote in the Leith Walk by-election, while Scottish Labour won the second seat from the Greens. The by-election was called after Deidre Brock of the SNP and Maggie Chapman of the Scottish Greens stood down. The SNPs John Lewis Ritchie topped the Leith Walk poll with 2,290 votes. He was elected at stage one in the STV process with a swing in first-preference votes of 7.6% from Labour. Labours Marion Donaldson received 1,623 votes, ahead of Susan Jane Rae of the Scottish Greens on 1,381. Ms Donaldson wa

# Model 7

In [95]:
input7 = tokenizer(article7, return_tensors='pt', truncation=True)
summary_ids7 = model.generate(input7['input_ids'], max_length=60, early_stopping=False)
machineSummary7 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids7])

In [96]:
machineSummary7 = listToString(machineSummary7)
original7 = listToString(article7)

comparison7 = [summary7, machineSummary7, original7]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings7 = token_model.encode(comparison7)
print(util.pytorch_cos_sim(comparison_embeddings7[0], comparison_embeddings7[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings7[0], comparison_embeddings7[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings7[1], comparison_embeddings7[2])) # machine summary to original article

tensor([[0.7054]])
tensor([[0.6673]])
tensor([[0.9054]])


In [97]:
comparison7

["Torquay United boss Kevin Nicholson says none of the money from Eunan O'Kane's move to Leeds from Bournemouth will go to the playing squad.",
 ' OKane moved for an undisclosed fee, but Nicholson says any money will go to help the cash-strapped club. The Gulls are still looking for new owners having been taken over by a consortium of local business people last summer. They were forced to close down the clubs academy',
 'The National League sold the Republic of Ireland midfielder to the Cherries for £175,000 in 2012 and had a 15% sell-on clause included in the deal. OKane moved for an undisclosed fee, but Nicholson says any money will go to help the cash-strapped club. I dont think Ill be getting anything, Nicholson told BBC Devon. Theres more important things. The Gulls are still looking for new owners having been taken over by a consortium of local business people last summer. They were forced to close down the clubs academy and drastically reduce the playing budget after millionaire

# Model 8

In [98]:
input8 = tokenizer(article8, return_tensors='pt', truncation=True)
summary_ids8 = model.generate(input8['input_ids'], max_length=60, early_stopping=False)
machineSummary8 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids8])

In [99]:
machineSummary8 = listToString(machineSummary8)
original8 = listToString(article8)

comparison8 = [summary8, machineSummary8, original8]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings8 = token_model.encode(comparison8)
print(util.pytorch_cos_sim(comparison_embeddings8[0], comparison_embeddings8[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings8[0], comparison_embeddings8[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings8[1], comparison_embeddings8[2])) # machine summary to original article

tensor([[0.5923]])
tensor([[0.6410]])
tensor([[0.9681]])


In [100]:
comparison8

['Manufacturers have reported positive business trends, in the latest survey from the Scottish Chambers of Commerce.',
 'Manufacturers reported their highest growth in new orders for nearly three years. In retail, there was also a return to optimism - though only just. In tourism, firms reported improving visitor numbers in the final quarter of the year, but falling sales revenues. Construction is expecting an investment dip.',

# Model 9

In [101]:
input9 = tokenizer(article9, return_tensors='pt', truncation=True)
summary_ids9 = model.generate(input9['input_ids'], max_length=60, early_stopping=False)
machineSummary9 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids9])

In [102]:
machineSummary9 = listToString(machineSummary9)
original9 = listToString(article9)

comparison9 = [summary9, machineSummary9, original9]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings9 = token_model.encode(comparison9)
print(util.pytorch_cos_sim(comparison_embeddings9[0], comparison_embeddings9[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings9[0], comparison_embeddings9[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings9[1], comparison_embeddings9[2])) # machine summary to original article

tensor([[0.8161]])
tensor([[0.8348]])
tensor([[0.8977]])


In [103]:
comparison9

['Of his last 30 matches in 2016, Andy Murray won 28 and lost just two.',
 'The world number one has won 21 of his first 30 matches in 2017. Murray has had shingles and an elbow problem, and now his left hip is proving cause for concern. Opting out of two scheduled exhibition matches at the Hurlingham Club in London may not be too',
 'Media playback is not supported on this device Of his first 30 matches in 2017, the world number one has won 21 and lost nine. Winning his last five tournaments of 2016 to pip Novak Djokovic to the year-end number one position in the final match of the season at Londons O2 Arena was astonishing, dramatic and unforgettable. And yet it appears that relentless run of success, and the 87 matches he played over a season, has come at a price. Murrays straight-set defeat by world number 90 Jordan Thompson in the first round at Queens Club was the sixth time he has lost to a player outside the top 20 this year. He has had shingles and an elbow problem, and now hi

# Model 10

In [104]:
input10 = tokenizer(article10, return_tensors='pt', truncation=True)
summary_ids10 = model.generate(input10['input_ids'], max_length=60, early_stopping=False)
machineSummary10 = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids10])

In [105]:
machineSummary10 = listToString(machineSummary10)
summary10 = listToString(summary10)
original10 = listToString(article10)

comparison10 = [summary10, machineSummary10, original10]
token_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
comparison_embeddings10 = token_model.encode(comparison10)
print(util.pytorch_cos_sim(comparison_embeddings10[0], comparison_embeddings10[1])) # human summary to machine summary similarity
print(util.pytorch_cos_sim(comparison_embeddings10[0], comparison_embeddings10[2])) # human summary to original article
print(util.pytorch_cos_sim(comparison_embeddings10[1], comparison_embeddings10[2])) # machine summary to original article

tensor([[0.7916]])
tensor([[0.7987]])
tensor([[0.7452]])


In [106]:
comparison10

["Manager Brendan Rodgers is sure Celtic can exploit the wide open spaces of Hampden when they meet Rangers in Sunday's League Cup semi-final.",
 "Celtic face Rangers in the Scottish Cup semi-final at Hampden Park. Brendan Rodgers' side beat Rangers 5-1 at Celtic Park last month. Rodgers lost two semi-finals in his time at Liverpool and is aiming to make it third time lucky at the club he joined",
 'Im really looking forward to it - the home of Scottish football, said Rodgers ahead of his maiden visit. I hear the pitch is good, a nice big pitch suits the speed in our team and our intensity. The technical area goes right out to the end of the pitch, but you might need a taxi to get back to your staff. This will be Rodgers second taste of the Old Firm derby and his experience of the fixture got off to a great start with a 5-1 league victory at Celtic Park last month. It was a brilliant performance by the players in every aspect, he recalled. Obviously this one is on a neutral ground, but

# Conclusion

We can see that the machine model had a higher cosine similarity to the original article 70% of the time compared to the human article. However, this may be influenced by the fact that the length of the machine summary was about 3x the size of the average human summary. The argument early_stopping=True/False did not appear to have any real effect on cosine-similarity at the max length size of 60 (we compared the 10 models with and without and obtained similar results). The pre-trained transformers do provide relevant summaries when reviewing these articles so it appears there is a definite use case for providing news article snippets in products like Bloomberg First Word or other content editors. 20% of the models showed the machine vs human summaries having relatively equivalent cosine similarities. It was also interesting that the machine summary generally was more similar to the article than the summary; however, the summary was much shorter and still generally scored relatively high. This indicates that the human-written summaries are more concise and convey more meaningful information through less text and are therefore better summaries. This does make sense since the summaries are generally written by the authors of the articles. It appears that human summaries are shorter and more semantically similar to articles than machine summaries for articles about sports and athletes. This may be an area that huggingface could focus on pretraining new pipelines, transformers, and models in the future to expand their use cases.