# Summarization with Transformers

![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# Automated document summarization methodologies

- __Extractive techniques:__ These methods use mathematical and statistical concepts to extract a key subset of content from the original document such that this subset contains the core information of the entire document. This content could be words, phrases or even sentences. The end result from this approach is a short executive summary of a couple of lines which are taken or extracted from the original document. No new content is generated in this technique hence the name "extractive".


- __Abstractive techniques:__ These methods are more complex and sophisticated and leverage language semantics to create representations and also make use of natural language generation (NLG) techniques where the machine makes use of knowledge bases and semantic representations to generate text on its own and create summaries just like a human would write them.

# Abstractive Summarization by Fine-tuning Transformers

In this section we’ll take a look at how Transformer models can be used to condense long documents into summaries, a task known as abstractive text summarization.

This is one of the most challenging NLP tasks as it requires a range of abilities, such as understanding long passages and generating coherent text that captures the main topics in a document.

However, when done well, text summarization is a powerful tool that can speed up various business processes by relieving the burden of domain experts to read long documents in detail.

Remember this is a sequence to sequence problem and requires past documents with the large and short (summary) form to make the model learn enough patterns to take in a new large document in the future and summarize it



## Install Relevant Libraries



In [None]:
!pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
!pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You will be leveraging 🤗 Transformers and 🤗 Datasets as well as other dependencies

## Load Dataset

Here we load The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

You can find the dataset in [HuggingFace Datasets](https://huggingface.co/datasets/cnn_dailymail)

For each instance, there is a string for the article, a string for the highlights, and a string for the id.

![](rhRwLI1.png)

In [None]:
from datasets import load_dataset, load_metric

cnn_data = load_dataset("cnn_dailymail", '3.0.0')

100%|██████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 238.15it/s]


In [None]:
cnn_data.keys()

dict_keys(['train', 'validation', 'test'])

In [None]:
cnn_data['train'][:2]

{'article': ['LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office cha

Given infrastructure constraints we subset our dataset and limit training on only 10000 records.

In [None]:
cnn_data['train'] = cnn_data['train'].shuffle(seed=42).select(range(10000))
cnn_data['validation'] = cnn_data['validation'].shuffle(seed=42).select(range(2000))
cnn_data['test'] = cnn_data['test'].shuffle(seed=42).select(range(2000))

Loading cached shuffled indices for dataset at /home/hp/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de/cache-380ef232430d96da.arrow
Loading cached shuffled indices for dataset at /home/hp/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de/cache-fe8ce4e4283768ec.arrow
Loading cached shuffled indices for dataset at /home/hp/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de/cache-914896ee6a47e4aa.arrow


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [None]:
cnn_data

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 2000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
cnn_data["train"][0]

{'article': "By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in Camborne, Cornwall . It is also believed there was no working carbon monoxide detect

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(cnn_data["train"])

Unnamed: 0,article,highlights,id
0,"The United States has no plans to send troops back into Iraq despite the bloody resurgence of an al-Qaeda faction that has captured major cities, seized hundreds of millions of dollars and forced more than a half-million people to flee their homes this week. America has 35,000 troops station around the Middle East, a Pentagon official confirmed on Wednesday, including 10,000 in nearby Kuwait. A State Department official told MailOnline on background that there are no plans to use them. The Sunni-led group called the Islamic State of Iraq and the Levant (ISIL) – also known as the Islamic State of Iraq in Syria (ISIS) – was formerly known as Al-Qaeda in Iraq. On Tuesday the White House touted its past support of the Iraqi government with copious amounts of military hardware, but hinted that a return of armed personnel would be out of the question. SCROLL DOWN FOR VIDEO . U.S.Special Forces participated in the 'Eager Lion' joint military exercises in the Gulf of Aqaba on June 5, along with Kuwait, Jordan and France; the US has 2,000 troops stations there, along with a contingent of F-16 fighter jets, and left a Patriot missile battery behind after the war games concluded . Warlord Abu Bakr al-Baghdadi has seized control of the Iraqi provincial capital of Tikrit just a day of gaining power in the country's second biggest city Mosul. ISIS militants gathered in this photo in Iraq's Nineveh province . Iraqi soldiers were no match for the group formerly known as Al-Qaeda in Iraq as Jihadists seized all of Mosul and Nineveh province and also took areas in Kirkuk province, to its east, and Salaheddin to the south . 'Our shipments, in terms of assistance to Iraq, have included the delivery of 300 Hellfire missiles, millions of rounds of small arms fire, thousands of rounds of tank ammunition, helicopter-fired rockets, machine guns, grenades, flares, sniper rifles, M16s and M4 rifles to the Iraqi security forces,' Deputy White House Press Secretary Joshua Earnest eagerly cataloged. And a Pentagon source told MailOnline that the U.S. expected to speed up a planned transfer of Apache helicopters – a sale that was put on the back burner after Iraq made a large purchase of guns and ammunition from neighboring Iran. America's support of Nouri al-Maliki's Shia Muslim-led government, Earnest said Tuesday, has been 'rapid, comprehensive, and is continuing.' But Maliki, he said, must 'step up to the plate' and 'better meet the needs of the Iraqi people,' rather than counting on America to ride to the rescue. Separately, Earnest praised former Secretary of State Hillary Clinton for 'ending the war in Iraq, responsibly winding down the war in Afghanistan, and decimating and destroying core al-Qaeda.' Clinton is a likely Democratic front-runner for the presidency in 2016. Aboard Air Force One en route to Massachusetts on Wednesday, Earnest stayed away from any suggestion that President Obama and Defense Secretary Chuck Hagel might intervene with boots on the ground. 'The United States is deeply concerned about the continued aggression of ISIL in Iraq,' he said, referring to the deterioration of security there as a humanitarian issue. 'The situation in Iraq is grave,' Earnest said, according to a White House pool reporter. There is no doubt that the situation has deteriorated over the last 24 hours.' Earnest, the pool reporter wrote, 'said Washington was continuing to work with the Iraqi government to see how it could help.' The United States has considerable forces at its disposal that could be sent into Iraq, or maneuvered to patrol its waters, if Obama should decide to intervene. A Pentagon official told MailOnline that the U.S. maintains a fighting force of approximately 10,000 troops in Kuwait and 2,000 in Jordan. The units in Jordan include a detachment of F-16 fighter jets and a Patriot missile battery that remained behind after the 2013 joint 'Eager Lion' drills with the Jordanian military. The U.S. also maintains a Combined Air Operations Center in Qatar, and the Navy's 5th fleet in Bahrain. Asked for a comprehensive list of forces in the area, . U.S. Central Command spokesman Commander Bill Speaks told MailOnline: . 'That's kind of a loaded question. Certainly we have a significant . military presence there.' 'Of course, we will not provide details of military assets,' Cmdr. Speaks said in a followup email, 'but there are roughly 35,000 total US forces in the Middle East region.' The U.S. Navy's Fifth Fleet is based in the island nation of Bahrain, just 300 miles from Iraq's shores on the Persian Gulf . America has an estimated 35,000 troops in the Middle East, many of whom could quickly reach Iraq if the White House should decide to intervene against ISIS . EXODUS: As many as 500,000 Iraqis have been forced to flee the country's second biggest city of Mosul after militants from an al-Qaeda splinter group seized control . The State Department said Tuesday that the U.S. 'supports a strong, coordinated response to push back against this aggression in Mosul,' Iraq's oil-rich and second largest city, which is now in ISIS hands. Abu Bakr al-Baghdadi, the head of the so called Islamic State of Iraq and the Levant, is a warlord considered more dangerous than the late Osama bin Laden . But it made no suggestion that American troops should be part of that response. On Wednesday, ISIS and its terrorist warlord Abu Bakr al-Baghdadi took control of Tikrit, another Iraqi city. Maliki has asked his parliament to declare martial law throughout the country. But the U.S. government, like Britain's, has signaled that moral support and armaments will be the limit of its help. The U.S. pulled its last ground troops out of Iraq in December 2011, following nearly nine years of costly and controversial deployments involving 1.5 million troops. More than 30,000 Americans were wounded in the conflict, and nearly 4,500 were killed. The Obama administration took a victory lap at the time of the final pullout, with the president declaring in a speech at Fort Bragg, N.C. that from that day forward 'Iraqis future will be in the hands of its people. America's war in Iraq will be over.' 'It's harder to end a war than begin one,' Obama said, presaging the slogan that has marked his military withdrawal from Afghanistan. 'Indeed, everything that American troops have done in Iraq – all the fighting and all the dying, the bleeding and the building, and the training and the partnering – all of it has led to this moment of success.' ISIS now controls Mosul, Tikrit and parts of Syria . He also claimed then that U.S. forces had 'broken the momentum of the Taliban' in Afghanistan, and had 'gone after al-Qaeda so that terrorists who threaten America will have no safe haven and Osama bin Laden will never again walk the face of this Earth.' But according to a senior U.S. intelligence official who spoke with The Washington Post, Abu Bakr al-Baghdadi, the ISIS leader who also goes by the moniker 'Abu Dua,' is 'more violent, more virulent, [and] more anti-American' than bin Laden. He claims to be a direct descendant of the Muslim prophet Muhammad. The U.S. currently has a $10 million bounty on his head. Republican Senators John McCain, Lindsey Graham and Kelly Ayotte said Tuesday that a 'growing threat to our national security interests is the cost of President Obama’s decision to withdraw all of our troops from Iraq in 2011, against the advice of our commanders and regardless of conditions on the ground.' 'Unfortunately,' they said in a statement, the president is now making the same disastrous mistake in Afghanistan, increasing the risk that al-Qaeda and its terrorist allies will return there just as they are in Iraq.'","The US has 35,000 troops stationed in the Middle East including 10,000 in Kuwait – plus 10,000 troops, an F-16 detachment and a .\nPatriot missile battery in Jordan .\nPresident Obama completed his troop withdrawal from Iraq in December 2011, leaving the country in the .\nhands of government security forces .\nThe White House said Tuesday that Hillary Clinton deserves credit for 'ending the war in Iraq, responsibly winding down the war in Afghanistan, and decimating and destroying core al-Qaeda'\nThree GOP senators warned that the Iraq mess is a preview of Afghanistan once the U.S. completes the Obama-led troop draw-down there .\nThe Islamic State of Iraq in Syria (ISIS), formerly known as Al-Qaeda in Iraq Islamic State of Iraq, is capturing cities, seizing money and oil, and displacing hundreds of thousands of people .\nAmerica has provided the Iraqi government with copious military materiel but doesn't plan to respond to ISIS's advances with troops .\nInstead, Washington has told Baghdad to 'step up to the plate' and help its people in ways that freeze out terror groups .",27e95991703b8af7e4ab4cf003792c5c33a35851
1,"A bomb squad has been called to a residential street when police discovered possible explosives in the back of a vehicle. However, after 10 hours of investigations the squad was still unable to confirm whether or not the suspicious parcel contained explosives. A woman had been stopped by a Random Breath and Drug Testing Unit at 2.50am in inner northern Perth suburb Mount Lawley. The driver 'aroused suspicion' leading to a search of the car and police discovering the questionable package, according to ABC News. The female driver has been detained following the discovery of 'possible explosives' in her vehicle . It was at this point that the possible explosives were detected. Police cordoned off Lawley Crescent and for much of the day residents were advised to stay inside. The woman in question has been detained and is assisting police with their enquiries. The vehicle has now been taken away for testing as police continue investigations to determine the contents of the package.","Police discovered possible explosives during a routine RBT in Perth .\nThe bomb squad was called in and the residential street was cordoned off .\nThe female driver's behaviour 'aroused suspicion', leading police to search her vehicle and uncover the suspicious package .\nAfter 10 hours, squad was still unable to determine the package's contents .\nThe driver has been detained and is helping police with investigations .\nThe car has been taken away for testing as police investigations continue .",dcdc1d0bcdc21ddd0ed9c7b362ba9d714661b914
2,"A homeless man was shot dead by police after he hit two officers with rocks and refused to put down other stones, authorities have said. Police Chief Bob Metzger told a news conference that officers had used a stun gun on Antonio Zambrano-Montes in Pasco, Washington, but it had no effect. He added because of Zambrano-Montes's 'threatening behavior', police fired their guns. Metzger said he did not know whether a weapon was found. But multiple witnesses say the man was running away from the scene when he was killed at about 5pm on Tuesday. Scroll down for video . Erika Zambrano holds a photo of shooting victim Antonio Zambrano-Montes outside the city hall building in Pasco, Washington. He was shot and killed by Pasco police officers during a confrontation on Tuesday . They told the Tri-City Herald the man had run about half a block when he was killed about 5pm on Tuesday near the Fiesta Foods store. The 35-year-old's last address was a Pasco homeless shelter, according to Franklin County Coroner Dan Blasdel. He was an orchard worker raised in Michoacan, Mexico, who had lived in Pasco for the last ten years and didn't speak any English, the Tri-City Herald reported. His cousin Blanca Zambrano told the newspaper: 'He was a kind person, family-oriented. He was hardworking.' The shooting occurred after officers responded to a report of a man throwing rocks at cars at a busy intersection near a grocery store. Dario Infante, 21, of Pasco, recorded video from a vehicle about 50 feet away as the scene unfolded. He said he decided to start recording when he saw an officer trying to use a stun gun on the man. Infante said he saw the man throw a few rocks at police officers but he didn't see him hit any officers. Five 'pops' are audible shortly after the video begins, and the man can be seen running away, across a street and down a sidewalk, pursued by three officers. Police investigate the scene of an officer involved shooting at the intersection of 10th Avenue and Lewis Street in Pasco, Washington . As the officers draw closer to the running man, he stops, turns around and faces them. Multiple 'pops' are heard and the man falls to the ground. 'He didn't throw any rocks after he started running,' Infante said. Several dozen people gathered at Pasco City Hall yesterday afternoon to raise concerns about the shooting. The ACLU of Washington also issued a statement, calling the incident 'very disturbing.' The group's executive director, Kathleen Taylor said: 'Fleeing from police and not following an officer's command should not be sufficient for a person to get shot,' She added deadly force should be used only as a last resort. Pasco residents, pictured from left, Angel Morgan, five, and his brother Jose Morgan, six, and Alex Gonzalez, four, and his brother Angel Gonzalez, eight, gather around a candlelit vigil yesterday in memory of Antonio Zambrano-Montes . Ben Patrick told the newspaper police fired at the man as his back was turned. 'I really thought they were just going walk up and tackle or tase him,' he said. 'But they opened fire. His back was turned.' Patrick's wife, Shannon, also said the man was running away. The shooting happened in front of her young children. 'He turned around to take off running and the cops just shot him,' she said. 'All he was trying to do was walk away.' Other witnesses heard officers give the man orders to stop and drop the rock. They said the man refused to listen. Metzger has identified the three officers involved in the shooting. They were placed on leave for the investigation, a standard practice. The Tri-City Special Investigation Unit, which will not include Pasco police, will investigate. Investigators are looking at cellphone video of the scene that has been posted online. Carlos Sanchez, who witnessed the shooting from the grocery store parking lot, also said it looked like the man was running away from officers when he was killed. 'They started shooting and they kept on shooting him,' he said. The case is the fourth fatal shooting involving a Tri-City police officer in Pasco in the last six months. Officers have been cleared of any wrongdoing in all three previous cases.","Antonio Zambrano-Montes, 35, was shot dead for his 'threatening behavior'\nBut multiple witnesses report he was running away when he was killed .\nIt's the fourth fatal shooting involving a Tri-City police officer in Pasco in the last six months .\nOfficers have been cleared of any wrongdoing in all three previous cases .",07e6e604c46097159d01a4d8a4b32fe96ec467eb
3,"By . Sophie Jane Evans . A Chinese meat factory has been shut down following allegations that it supplied out-of-date meat to American fast-food chains across the country. Shanghai Husi Food Co, a unit of U.S.-based food supplier OSI Group, was temporarily closed after allegedly selling expired chicken and beef to Chinese branches of McDonald's and KFC. A TV report showed workers apparently picking up meat from the factory floor, as well as mixing meat beyond its expiration date with fresh produce. Scroll down for video . Shut down: Shanghai Husi Food Co has been shut down following allegations that it supplied out-of-date meat to American fast-food chains across China. Above, employees work at the factory prior to its closure . Shockingly, employees were even heard saying that if their clients knew what they were doing, the firm would lose its contracts. McDonald's and Yum Brands Inc - owner of KFC, Pizza Hut and Taco Bell, with over 6,200 Chinese branches collectively - immediately stopped using the supplier after the Dragon TV report aired. Meanwhile, the Shanghai office of China's food and drug agency said it was . investigating the allegations, and told customers to suspend use of the supplier's . products. 'At present, the company has been sealed and suspect products seized,' the Shanghai Municipal Food and Drug Administration said on its website. Under investigation: The factory allegedly sold expired chicken and beef to branches of McDonald's and KFC . Angry: McDonald's (pictured) and Yum Brands Inc - owner of KFC, Pizza Hut and Taco Bell, with over 6,200 Chinese branches collectively - immediately stopped using the supplier after the allegations became public . McDonald's . sealed 4,500 cases of beef, pork, chicken and other products supplied . by Husi for investigation, the city government said in a statement. The Communist Party secretary of . Shanghai, Han Zheng, has reportedly called for 'severe punishment' of any wrongdoing. It is the latest food safety scare for McDonald's and KFC, which were hurt by a safety scandal in 2012 involving chicken allegedly pumped with unapproved antibiotic drugs and growth hormones. Today, the chains apologised to customers following the TV report, adding that the factory had served restaurants in the Shanghai area. 'We will not tolerate any violations of government laws and regulations from our suppliers,' said Yum China, which ordered all of its KFC and Pizza Hut restaurants to seal up and stop using all meat materials supplied by the Husi factory. Meanwhile a spokesman for McDonald's, which was provided with chicken, beef and lettuce by Husi, told Reuters: 'If proven, the practices outlined in the . reports are completely unacceptable to McDonald's anywhere in the . world. The fast-food branches also said they were conducting their . own investigations. China is McDonald's third-biggest market as measured in number of restaurants, while Yum's KFC, based in Louisville, Kentucky, is China's biggest . restaurant chain, with more than 4,000 outlets and plans to open 700 . more this year. 'I think this is going to be really challenging for both these firms,' said Benjamin Cavender, Shanghai-based principal at China Market Research Group. Scandal: It is the latest food safety scare for McDonald's and KFC (pictured), which were hurt by a safety scandal in 2012 involving chicken allegedly pumped with unapproved antibiotic drugs and growth hormones . 'I don't know that this is something an apology can fix so easily, because at this point people don't have a whole lot of trust that they have good systems in place.' Yum shares were down 3.5 percent at $74.72 and McDonald's shares were down 0.9 percent at $98.13 on Monday afternoon on the New York Stock Exchange. The Shanghai Municipal Food and Drug Administration shut down Husi on Sunday after the local Chinese TV broadcast aired. OSI said on its Chinese website that management was 'appalled by the report.' The company has formed its own investigation team, is fully cooperating with government inspectors and will take all necessary actions based on results of the investigation. 'Management believes this to be an isolated event, but takes full responsibility for the situation,' OSI said. OSI, which has close to 60 manufacturing facilities worldwide and had revenue of more than $5 billion in 2012, has been supplying McDonald's in China since 1992 and KFC and Pizza Hut parent Yum since 2008, according to its website. News of the scare spread quickly to diners negotiating Shanghai's lunch-hour rush today. 'For now I won't go to eat at McDonald's or KFC, at least until this whole thing settles down,' said Xu Xinyu, 24, a financial services worker, eating at a noodle shop near a McDonald's outlet in downtown Shanghai. Yet some Chinese consumers appear to have developed a comparatively thick skin when it comes to food scandals. 'Isn't everywhere like this?' asked student Li Xiaoye, 20, eating a beef burger in a Shanghai McDonald's outlet. 'I'll keep going because wherever I eat, the issues are all the same.' The incident highlights the difficulty in ensuring quality and safety along the supply chain in China. Wal-Mart Stores Inc came under the spotlight this year after a supplier's donkey meat product was found to contain fox meat. It also came under fire for selling expired duck meat in 2011. OSI is one of McDonald's key meat suppliers and has a good reputation, according to an industry insider speaking on condition of anonymity. He added the incident highlighted the issue firms faced enforcing strict processes with local staff. As well as Yum and McDonald's, OSI listed Starbucks Corp , Japan's Saizeriya Co Ltd, Papa John's International Inc, Burger King Worldwide Inc and Doctor's Associates Inc's Subway brand as clients in China, according to a 2012 press release. A Starbucks spokesman told Reuters that the company does not now have any direct business dealings with Husi Food. Burger King, Subway, Papa John's and Saizeriya did not immediately respond to requests for comment. A woman who answered the phone at Husi's headquarters said no one was available to comment. But a company manager, Yang Liqun, told Xinhua News Agency that Husi has a strict . quality control system and will cooperate in the investigation.",Shanghai Husi Food Co Ltd temporarily shut down by Chinese authorities .\nAllegedly supplied out-of-date meat to U.S. fast food chains across China .\nTV report also showed workers apparently picking up meat from the floor .\nMcDonald's and KFC immediately stopped using supplier following report .\nChina's food and drug agency is investigating allegations against factory .\nShanghai Husi is the Chinese unit of U.S.-based food supplier OSI Group .,7a2c0ba18336842e1137d24c04338b6c67a4d724
4,"By . Steve Robson . PUBLISHED: . 03:37 EST, 12 August 2013 . | . UPDATED: . 04:57 EST, 12 August 2013 . A cost-cutting local council has sparked anger after spending £76,000 on a bespoke 3D sign welcoming visitors to the town. Bournemouth Borough Council, which has cut millions from its budget and shed scores of staff, believes the new signage will 'promote a sense of arrival for visitors'. The 'Welcome to Bournemouth' sign, which sits above the A338 road, has been criticised by councillors as a waste of taxpayers' money. Costly: The £76,000 'Welcome to Bournemouth' sign which has been erected on the A338 . Controversial: The council said the new sign will make visitors feel more welcome and more likely to return to Bournemouth . Labour councillor Ben Grower described it as a 'flight of fancy'. He told The Sun newspaper: 'Jobs are being cut and services not expanded. This is the biggest two fingers to the people I have seen in many years.' Tourism bosses at the Conservative-led authority, which has plans to save £76million over five years, believe it will make visitors feel more welcome and more likely to return to the coastal town. Councillor Lawrence Williams said: . 'First impressions are everything and this is why it's important to make . the gateway into our town as welcoming as possible. This type of . signage makes a statement about our town and the community it represents . as well as a significant contribution to the way an area is perceived. 'Tourism is worth over £600m to the . local economy, so the more welcoming our town is, the more the . likelihood is that visitors will make return trips in the future, . hopefully staying for longer and spending more time and money in the . Borough. Hot spot: Tourists enjoy warm weather on Bournemouth beach over the weekend . 'This in turn continues to boost the local economy. We are proud of our town and the new signage is a great way to greet people coming to spend time here.' Mike Francis, president of the Bournemouth Tourism Management Board, also backed the project. 'It demonstrates that we value the immeasurable contribution tourism makes to the local economy, and the signage is something the industry has been working with the Council to provide for a long time, to further reflect the significance of Bournemouth as one of the UK’s premier resorts,' he said. A spokesman for the council added that . the £76,000 accounts for the design, installation, traffic management . and power supply for the sign.",Bournemouth Borough Council erects costly sign above A338 road .\nCritics brand it a 'flight of fancy' and a waste of taxpayers' money .\nAuthority says it will make visitors feel more welcome .,8d34a2d497a007435170115f043f38dfcc3eb7c1


We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the function `load_metric`.  

In [None]:
!pip install rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
metric = load_metric("rouge")
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

For summarization, one of the most commonly used metrics is the ROUGE score (short for Recall-Oriented Understudy for Gisting Evaluation).

The basic idea behind this metric is to compare a generated summary against a set of reference summaries that are typically created by humans.

To make this more precise, suppose we want to compare the following two summaries:

```
generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"
```

One way to compare them could be to count the number of overlapping words, which in this case would be 6.

However, this is a bit crude, so instead ROUGE is based on computing the precision and recall scores for the overlap.

For ROUGE, recall measures how much of the reference summary is captured by the generated one. If we are just comparing words, recall can be calculated according to the following formula:

![](iYgPhYB.png)

For our simple example above, this formula gives a perfect recall of 6/6 = 1; i.e., all the words in the reference summary have been produced by the model.


This may sound great, but imagine if our generated summary had been “I really really loved reading the Hunger Games all night”. This would also have perfect recall, but is arguably a worse summary since it is verbose.

To deal with these scenarios we also compute the precision, which in the ROUGE context measures how much of the generated summary was relevant:

![](4aadAXM.png)


Applying this to our verbose summary gives a precision of 6/10 = 0.6, which is considerably worse than the precision of 6/7 = 0.86 obtained by our shorter one.

In practice, both precision and recall are usually computed, and then the F1-score (the harmonic mean of precision and recall) is reported.



In [1]:
#!wget http://i.imgur.com/rhRwLI1.png

In [None]:
generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"
scores = metric.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)),
 'rouge2': AggregateScore(low=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), mid=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), high=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272)),
 'rougeL': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)),
 'rougeLsum': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.92307692307

🤗 Datasets actually computes confidence intervals for precision, recall, and F1-score; these are the low, mid, and high attributes you can see here.

Moreover, 🤗 Datasets computes a variety of ROUGE scores which are based on different types of text granularity when comparing the generated and reference summaries.

- rouge1 is the overlap of unigrams — this is just a fancy way of saying the overlap of words

- rouge2 measures the overlap between bigrams (think the overlap of pairs of words)

- rougeL and rougeLsum measure the longest matching sequences of words by looking for the longest common substrings in the generated and reference summaries

- The “sum” in rougeLsum refers to the fact that this metric is computed over a whole summary, while rougeL is computed as the average over individual sentences

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library.

Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint.

![](MFE2vfu.png)

T5 can be used for a variety of tasks and we will fine-tune it for summarization.

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Hello, this is a sentence!")

{'input_ids': [8774, 6, 48, 19, 3, 9, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this is a sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 19, 3, 9, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




If you are using the T5 model, we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [None]:
prefix = "summarize: "

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`.

This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [None]:
max_input_length = 1024
max_target_length = 128

In [None]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["highlights"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(cnn_data["train"][:2])

{'input_ids': [[21603, 10, 938, 3, 5, 11016, 12528, 3, 5, 3, 10744, 8775, 20619, 2326, 10, 3, 5, 10668, 10, 4928, 3, 6038, 6, 204, 1332, 2038, 3, 5, 1820, 3, 5, 3, 6880, 4296, 11430, 10, 3, 5, 12046, 10, 4560, 3, 6038, 6, 204, 1332, 2038, 3, 5, 5245, 724, 13, 8, 337, 384, 113, 3977, 16, 3, 9, 14491, 22133, 45, 4146, 1911, 6778, 15, 14566, 53, 133, 43, 118, 25429, 3, 31, 4065, 77, 676, 31, 6, 16273, 7, 243, 469, 5, 37, 5678, 13, 4464, 1158, 1079, 11, 31423, 6176, 130, 3883, 5815, 70, 3062, 6, 7758, 60, 35, 6, 44, 8, 1156, 234, 79, 2471, 30, 4691, 1635, 109, 1210, 1061, 16, 5184, 12940, 6, 4653, 26334, 5, 37, 16, 10952, 7, 43, 230, 2946, 139, 8, 14319, 336, 1856, 6, 28, 16273, 7, 2145, 8, 386, 3977, 590, 28, 8, 384, 31, 7, 3947, 1782, 6, 13, 4146, 1911, 6778, 15, 14566, 53, 45, 3, 9, 21859, 5, 21902, 447, 10, 37, 16, 10952, 7, 43, 2946, 139, 8, 14319, 13, 386, 724, 13, 8, 337, 384, 113, 130, 435, 16, 70, 14491, 22133, 336, 1851, 5, 1079, 11, 31423, 6176, 33, 3, 22665, 3, 5, 71, 210, 1329

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier.

This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
tokenized_datasets = cnn_data.map(preprocess_function, batched=True)
no_deprecation_warning=True



Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook.

The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data).

For instance, it will properly detect if you change the task in the first cell and rerun the notebook.

## Fine-tuning the Transformer Model

Now that our data is ready, we can download the pretrained model and fine-tune it.

Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class.

Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

To instantiate a `Seq2SeqTrainer`, we will need to define three more things.

The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training.

It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
#!pip install --upgrade accelerate
#!pip uninstall -y transformers accelerate
#!pip install transformers accelerate

In [None]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

Here,

- we set the evaluation to be done at the end of each epoch
- tweak the learning rate
- use the `batch_size` defined at the top of the cell
- customize the weight decay

Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum.

Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
import numpy as np
import nltk

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/hp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions.

We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [None]:
import time

In [None]:
%%time
# train
trainer.train()



Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.8728,1.694398,24.3691,11.7584,20.0911,22.9119,19.0
2,1.875,1.692535,24.3472,11.719,20.0292,22.9016,19.0
3,1.8652,1.690754,24.4292,11.8036,20.1069,22.9629,19.0


CPU times: user 26min 37s, sys: 12.7 s, total: 26min 50s
Wall time: 26min 49s


TrainOutput(global_step=1875, training_loss=1.8704330403645832, metrics={'train_runtime': 1609.1057, 'train_samples_per_second': 18.644, 'train_steps_per_second': 1.165, 'total_flos': 8120055539171328.0, 'train_loss': 1.8704330403645832, 'epoch': 3.0})

# Using your fine-tuned model for Summarization

Once you’ve fine-tuned the model you can use it with a pipeline object, for inference as follows:

In [None]:
from transformers import pipeline

In [None]:
model.to('cpu')

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [None]:
summarize = pipeline(task='summarization', model=model, tokenizer=tokenizer)

In [None]:
# An official announcement https://blog.google/technology/ai/gemini-collection/

document = """
Learn more about Gemini, our most capable AI model
Dec 06, 2023 3 articles
Gemini is a multimodal AI model, meaning it can process and generate different formats of data, including text, code, audio, images, and video.
This sets it apart from previous models, which were primarily focused on text-based tasks
Today we introduced Gemini, our largest and most capable AI model — and the next step on our journey toward making AI helpful for everyone. Built from the ground up to be multimodal, Gemini can generalize and seamlessly understand, operate across and combine different types of information, including text, images, audio, video and code. This means it has sophisticated multimodal reasoning and advanced coding capabilities. And with three different sizes — Ultra, Pro and Nano — Gemini has the flexibility to run on everything from data centers to mobile devices. We trained Gemini at scale on our AI-optimized infrastructure using Google's Tensor Processing Units (TPUs) v4 and v5e. Today, we also announced our most powerful and scalable TPU system to date, Cloud TPU v5p.
Gemini is available in some of our core products starting today: Bard is using a fine-tuned version of Gemini Pro for more advanced reasoning, planning, understanding and more. Pixel 8 Pro is the first smartphone engineered for Gemini Nano, using it in features like Summarize in Recorder and Smart Reply in Gboard. And we’re already starting to experiment with Gemini in Search, where it's making our Search Generative Experience (SGE) faster. Early next year, we’ll bring Gemini Ultra to a new Bard Advanced experience. And in the coming months, Gemini will power features in more of our products and services like Ads, Chrome and Duet AI.
Android developers who want to build Gemini-powered apps on-device can now sign up for an early preview of Gemini Nano, our most efficient model, via Android AICore. Starting December 13, developers and enterprise customers will be able to access Gemini Pro via the Gemini API in Vertex AI or Google AI Studio, our free web-based developer tool. And as we continue to refine Gemini Ultra, including completing extensive trust and safety checks, we’ll make it available to select groups before opening it up broadly to developers and enterprise customers early next year.
Explore the collection to learn more about our newest model, and the start of the Gemini era.
"""

In [None]:
document

"\nLearn more about Gemini, our most capable AI model\nDec 06, 2023 3 articles\nGemini is a multimodal AI model, meaning it can process and generate different formats of data, including text, code, audio, images, and video. \nThis sets it apart from previous models, which were primarily focused on text-based tasks\nToday we introduced Gemini, our largest and most capable AI model — and the next step on our journey toward making AI helpful for everyone. Built from the ground up to be multimodal, Gemini can generalize and seamlessly understand, operate across and combine different types of information, including text, images, audio, video and code. This means it has sophisticated multimodal reasoning and advanced coding capabilities. And with three different sizes — Ultra, Pro and Nano — Gemini has the flexibility to run on everything from data centers to mobile devices. We trained Gemini at scale on our AI-optimized infrastructure using Google's Tensor Processing Units (TPUs) v4 and v5e

In [None]:
authors_summary = """Gemini is a multimodal AI model, meaning it can process and generate different formats of data, including text, code, audio, images, and video.
This sets it apart from previous models, which were primarily focused on text-based tasks"""


In [None]:
summary = summarize(document)[0]['summary_text']

In [None]:
print('\n'.join(nltk.sent_tokenize(summary)))

Gemini is a multimodal AI model, meaning it can process and generate different formats of data, including text, code, audio, images, and video .
Built from the ground up to be multimodal, Gemini can generalize and seamlessly understand, operate across and combine different types of information .
This means it has sophisticated multimodal reasoning and advanced coding capabilities .


We can feed some examples from the test set (which the model has not seen) to our pipeline to get a feel for the quality of the summaries.

In [None]:
for item in cnn_data['test'].shuffle(seed=42).select(range(10)):
  print('Acutal Headline:-\n', item['highlights'])
  print()
  summary = summarize(item['article'])[0]['summary_text']
  print('Summarized Headline:-\n', '\n'.join(nltk.sent_tokenize(summary)))
  print('\n')



Acutal Headline:-
 Battle between lenders has intensified recently, causing rates to plummet .
HSBC have now announced 1.99% interest deal on a five-year fix mortgage .
Offer expected to spark flood of rate cuts by banks and building societies .
Experts have described cheapest deal ever of its kind as 'astonishing'

Summarized Headline:-
 The battle between lenders has intensified in recent months, plunging home loan rates to their lowest in history .
But the mortgage wars will erupt again next week after HSBC announced a 1.99 per cent interest rate on a five-year fix .
Fifteen lenders had already cut rates across their ranges in the past week, but more are likely to follow .


Acutal Headline:-
 BBC's Back In Time For Dinner claims that grubs are the future of food .
The Robshaw family dig into cricket tacos, worm tarts and insect burgers .
Meat will become scarce or more expensive as demand for it grows .
Insects are full of protein, low in fat and packed full of nutrients .

Summari