## Installing required huggingface libraries/modules

In [1]:
pip install transformers datasets evaluate rouge_score

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ev

In [2]:
pip install sentencepiece



## Logging  to Hugging Face account so model can be shared and uploaded with community.

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Mounting the Google Drive

In [8]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


## TODO Recording:

- Go to the BBC folder
- Show the subfolders
- Click through "News Articles" and show the categories
- Click through to a category and show the files


In [9]:
import re

def extract(filepath):
    pattern = r"(\w+)/(\d+)\.txt$"

    category, file_id = re.search(pattern, str(filepath)).groups()

    with open(filepath, "r", encoding = "unicode_escape") as f:
        text = f.read()
        return category, file_id, text

## **Creating the dataframe of Articles and Summaries**

In [10]:
import pandas as pd
from pathlib import Path

dataset_path = Path("/content/drive/MyDrive/2. Generative AI/3. Codes/1. Text Summarization using Hugging Face Models/BBC/BBC News Summary")

articles_data = list(map(extract, dataset_path.glob("News Articles/*/*.txt")))
summaries_data = list(map(extract, dataset_path.glob("Summaries/*/*.txt")))

articles_df = pd.DataFrame(articles_data, columns = ("Category", "ID", "Article"))
summaries_df = pd.DataFrame(summaries_data, columns = ("Category", "ID", "Summary"))

news_summary_df = articles_df.merge(summaries_df, how = "inner", on = ("Category", "ID"))

news_summary_df.head()

Unnamed: 0,Category,ID,Article,Summary
0,business,1,Ad sales boost Time Warner profit\n\nQuarterly...,TimeWarner said fourth quarter sales rose 2% t...
1,business,2,Dollar gains on Greenspan speech\n\nThe dollar...,The dollar has hit its highest level against t...
2,business,3,Yukos unit buyer faces loan claim\n\nThe owner...,Yukos' owner Menatep Group says it will ask Ro...
3,business,4,High fuel prices hit BA's profits\n\nBritish A...,"Rod Eddington, BA's chief executive, said the ..."
4,business,5,Pernod takeover talk lifts Domecq\n\nShares in...,Pernod has reduced the debt it took on to fund...


### Take a look at an example of Article and its summary

In [11]:
news_summary_df["Article"].loc[10]

"Ask Jeeves tips online ad revival\n\nAsk Jeeves has become the third leading online search firm this week to thank a revival in internet advertising for improving fortunes.\n\nThe firm's revenue nearly tripled in the fourth quarter of 2004, exceeding $86m (Â£46m). Ask Jeeves, once among the best-known names on the web, is now a relatively modest player. Its $17m profit for the quarter was dwarfed by the $204m announced by rival Google earlier in the week. During the same quarter, Yahoo earned $187m, again tipping a resurgence in online advertising.\n\nThe trend has taken hold relatively quickly. Late last year, marketing company Doubleclick, one of the leading providers of online advertising, warned that some or all of its business would have to be put up for sale. But on Thursday, it announced that a sharp turnaround had brought about an unexpected increase in profits. Neither Ask Jeeves nor Doubleclick thrilled investors with their profit news, however. In both cases, their shares f

In [12]:
news_summary_df["Summary"].iloc[10]

'Ask Jeeves has become the third leading online search firm this week to thank a revival in internet advertising for improving fortunes.Its $17m profit for the quarter was dwarfed by the $204m announced by rival Google earlier in the week.During the same quarter, Yahoo earned $187m, again tipping a resurgence in online advertising.Neither Ask Jeeves nor Doubleclick thrilled investors with their profit news, however.Ask Jeeves, once among the best-known names on the web, is now a relatively modest player.'

## Representing Dataset as Hugging Face Dataset object using pandas

In [13]:
from datasets import Dataset, DatasetDict

news_summary_ds = Dataset.from_pandas(news_summary_df, preserve_index = False)

news_summary_ds

Dataset({
    features: ['Category', 'ID', 'Article', 'Summary'],
    num_rows: 2225
})

## **Cleaning the Text**

In [14]:
def clean_txt(example):
    for txt in ["Article", "Summary"]:
       example[txt]  = example[txt].lower()
       example[txt]  = example[txt].replace("\\", " ")
       example[txt]  = example[txt].replace("/", " ")
       example[txt]  = example[txt].replace("\n"," ")
       example[txt]  = example[txt].replace("'s"," ")
       example[txt]  = example[txt].replace('"', ' ')
    return example

In [15]:
cleaned_news_summary_ds = news_summary_ds.map(clean_txt)

cleaned_news_summary_ds

Map:   0%|          | 0/2225 [00:00<?, ? examples/s]

Dataset({
    features: ['Category', 'ID', 'Article', 'Summary'],
    num_rows: 2225
})

### Observing one example of raw data

In [16]:
news_summary_ds[0]

{'Category': 'business',
 'ID': '001',
 'Article': 'Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL\'s underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free 

### Now observing cleaned up version of the same raw data

In [17]:
cleaned_news_summary_ds[0]

{'Category': 'business',
 'ID': '001',
 'Article': "ad sales boost time warner profit  quarterly profits at us media giant timewarner jumped 76% to $1.13bn (â£600m) for the three months to december, from $639m year-earlier.  the firm, which is now one of the biggest investors in google, benefited from sales of high-speed internet connections and higher advert sales. timewarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. its profits were buoyed by one-off gains which offset a profit dip at warner bros, and less users for aol.  time warner said on friday that it now owns 8% of search-engine google. but its own internet business, aol, had has mixed fortunes. it lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. however, the company said aol  underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. it hopes to increase subscribers by offering the online service free to timew

# **Using Google's Pegasus**

## Preprocessing

In [18]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# PegasusForConditionalGeneration allows us to access the pegasus model
# PegasusTokenizer allows us to access models tokenizer

### Pegasus model tokenizer and model are instantiated

In [19]:
PEGASUS_MODEL = "google/pegasus-cnn_dailymail"  #pegasus model

tokenizer = PegasusTokenizer.from_pretrained(PEGASUS_MODEL)
model = PegasusForConditionalGeneration.from_pretrained(PEGASUS_MODEL)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

### Observing Pegasus tokenizer info

In [20]:
tokenizer

PegasusTokenizer(name_or_path='google/pegasus-cnn_dailymail', vocab_size=96103, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'mask_token': '<mask_2>', 'additional_special_tokens': ['<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>'

### Observing Pegasus Model which contains both Encode and Decoder

In [21]:
model

PegasusForConditionalGeneration(
  (model): PegasusModel(
    (shared): Embedding(96103, 1024, padding_idx=0)
    (encoder): PegasusEncoder(
      (embed_tokens): Embedding(96103, 1024, padding_idx=0)
      (embed_positions): PegasusSinusoidalPositionalEmbedding(1024, 1024)
      (layers): ModuleList(
        (0-15): 16 x PegasusEncoderLayer(
          (self_attn): PegasusAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): ReLU()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_no

## **Applying Model for summarization of a text**

In [22]:
EXAMPLE_TEXT_INDEX = 5

example_text = cleaned_news_summary_ds["Article"][EXAMPLE_TEXT_INDEX]

example_text

'japan narrowly escapes recession  japan  economy teetered on the brink of a technical recession in the three months to september, figures show.  revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. on an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought. a common technical definition of a recession is two successive quarters of negative growth.  the government was keen to play down the worrying implications of the data.  i maintain the view that japan  economy remains in a minor adjustment phase in an upward climb, and we will monitor developments carefully,  said economy minister heizo takenaka. but in the face of the strengthening yen making exports less competitive and indications of weakening economic conditions ahead, observers were less sanguine.  it  painting a picture of a recovery... much patchier than previously thought,  said paul sheard, e

> ### Summarisation on text is obtained.
> ### **`Setting truncation=True`** as some sentences are long enough to raise index error as maximum input sequence length of 1024 is getting exceeded

### Generating Summary using Pegasus model

In [23]:
from transformers import pipeline

summarizer = pipeline("summarization", model = PEGASUS_MODEL, truncation = True)
summary_txt = summarizer(example_text)

summary_txt

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': 'revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter .<n>on an annual basis, the data suggests annual growth of just 0.2% .<n>The government was keen to play down the worrying implications of the data .'}]

### Refering the Actual Summary from Dataset

In [24]:
ref_txt = cleaned_news_summary_ds["Summary"][EXAMPLE_TEXT_INDEX]

ref_txt

'on an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought.a common technical definition of a recession is two successive quarters of negative growth.revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter.japan  economy teetered on the brink of a technical recession in the three months to september, figures show.'

## Evaluating the Model Performance

In [25]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Rouge score for that text summary is obtained

In [26]:
summary_result = rouge.compute(predictions = [summary_txt[0]["summary_text"]],
                               references = [ref_txt], use_stemmer = True)
summary_result

{'rouge1': 0.5762711864406779,
 'rouge2': 0.4827586206896552,
 'rougeL': 0.3728813559322034,
 'rougeLsum': 0.3728813559322034}

## Defining all article texts and summaries in a list

In [27]:
article_texts = cleaned_news_summary_ds["Article"]

article_summaries = cleaned_news_summary_ds["Summary"]

Generating summaries(zero shot) for first 50 rows only as it will takes a very long time in summarising whole set of examples

In [28]:
from tqdm import tqdm # To show progress bar

candidate_summaries = []

for i, text in enumerate(tqdm(article_texts[:50])):
    candidate = summarizer(text)
    candidate_summaries.append(candidate[0]["summary_text"])

100%|██████████| 50/50 [29:49<00:00, 35.80s/it]


Aggregated rouge scores are obtained

In [29]:
result_agg = rouge.compute(predictions = candidate_summaries, references = article_summaries[:50],
                           use_stemmer = True)
result_agg

{'rouge1': 0.3755433863144791,
 'rouge2': 0.2802476305511823,
 'rougeL': 0.28773605451884543,
 'rougeLsum': 0.2880554842899956}

In [30]:
result_unagg = rouge.compute(predictions = candidate_summaries, references = article_summaries[:50],
                             use_stemmer = True, use_aggregator = False)

Minimum and maximum Rouge score indices are obtained to check best and worst summaries

In [31]:
import numpy as np

result_unagg_rsum = np.array(result_unagg["rougeLsum"])

np.argmax(result_unagg_rsum), np.argmin(result_unagg_rsum)

(15, 48)

> #### 15th row of data got good summary and 48th row of data got worst summary

### Actual vs Predicted summaries dataframe is obtained

In [32]:
act_vs_pred_summaries_df = pd.DataFrame(list(zip(candidate_summaries, article_summaries[:50])),
                                        columns = ["Predicted_Summaries", "Reference_summaries"])
act_vs_pred_summaries_df.head()

Unnamed: 0,Predicted_Summaries,Reference_summaries
0,Fourth quarter sales rose 2% to $11.1bn from $...,timewarner said fourth quarter sales rose 2% t...
1,The dollar has hit its highest level against t...,the dollar has hit its highest level against t...
2,Russian state-owned rosneft bought the yugansk...,yukos' owner menatep group says it will ask ro...
3,british airways blames high fuel prices for a ...,"rod eddington, ba chief executive, said the r..."
4,allied domecq shares in London rose 4% by 1200...,pernod has reduced the debt it took on to fund...


Taking a look at the highest and lowest rouge score case's predicted and reference summaries

In [34]:
print("Actual Summary")
print(act_vs_pred_summaries_df._get_value(15, "Predicted_Summaries"))

print("Reference Summary")
print(act_vs_pred_summaries_df._get_value(15, "Reference_summaries"))

Actual Summary
curbs were introduced earlier this year to ward off the risk that rapid expansion might lead to soaring prices .<n>growth in china remains at a breakneck 9.1%, and corporate investment is growing at more than 25% a year .<n>Government has a 7% growth target, but continues to insist that the overshoot does not mean a hard landing in the shape of an overbalancing economy .
Reference Summary
the breakneck pace of economic expansion has kept growth above 9% for more than a year.rapid tooling-up of china  manufacturing sector means a massive demand for energy - one of the factors which has kept world oil prices sky-high for most of this year.growth in china remains at a breakneck 9.1%, and corporate investment is growing at more than 25% a year.the curbs were introduced earlier this year to ward off the risk that rapid expansion might lead to soaring prices.in theory, the government has a 7% growth target, but continues to insist that the overshoot does not mean a  hard landi

In [35]:
print("Actual Summary")
print(act_vs_pred_summaries_df._get_value(48, "Predicted_Summaries"))

print("Reference Summary")
print(act_vs_pred_summaries_df._get_value(48, "Reference_summaries"))

Actual Summary
200 new jobs to be created at the oxford factory, including modernised machinery and a new body shell production building .<n>The rise, from 189,000 last year, is a response to rapidly-rising demand and could help wipe out waiting lists .
Reference Summary
less than four years after the new mini was launched, german car maker bmw has announced â£100m of new investment.last year, almost one in six cars sold by the bmw group was a mini.initially, bmw said it would produce 100,000 mini models a year at its vast cowley factory on the outskirts of oxford, but the target was quickly reached, then raised, time and time again.when it was launched, the cheapest mini cost just more than â£10,000.these days, buyers will have to fork out almost â£11,500 to own a new mini one, or even more for the cooper s which costs up to â£17,730.the mini convertible, which was launched last spring, costs up to â£15,690 for the top model, and there is even a waiting list.last year, mg rover, which

## **Using BART**
***
### BART is a Seq2Seq Model with Encoder and Decoder with Bidirectional Encoders
### It looks the entire text at once since it is bidirectional ratherthan word by word
### BART also has autoregressor which helps to check the text generated in previous input
### This makes BART model to summarize text better
***

### Loading a  Barttokenizer to process `text` and `summary`:

In [36]:
from transformers import BartTokenizer, BartForConditionalGeneration

Bart tokenizer and Bart large cnn model are instantiated

In [37]:
BART_MODEL = "facebook/bart-large-cnn"

tokenizer = BartTokenizer.from_pretrained(BART_MODEL)
model = BartForConditionalGeneration.from_pretrained(BART_MODEL)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [38]:
tokenizer

BartTokenizer(name_or_path='facebook/bart-large-cnn', vocab_size=50265, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True),
}

In [39]:
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
    

### Summarisation on text is obtained.
### Setting **`truncation=True`** as some sentences are long enough to raise index error

In [40]:
example_text

'japan narrowly escapes recession  japan  economy teetered on the brink of a technical recession in the three months to september, figures show.  revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. on an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought. a common technical definition of a recession is two successive quarters of negative growth.  the government was keen to play down the worrying implications of the data.  i maintain the view that japan  economy remains in a minor adjustment phase in an upward climb, and we will monitor developments carefully,  said economy minister heizo takenaka. but in the face of the strengthening yen making exports less competitive and indications of weakening economic conditions ahead, observers were less sanguine.  it  painting a picture of a recovery... much patchier than previously thought,  said paul sheard, e

In [41]:
summarizer = pipeline("summarization", model = BART_MODEL, truncation = True)

summary_txt = summarizer(example_text)

summary_txt

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'summary_text': ' revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. on an annual basis, the data suggests annual growth of 0.2%, suggesting a much more hesitant recovery than had previously been thought. A common technical definition of a recession is two successive quarters of negative growth.'}]

In [42]:
ref_txt

'on an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought.a common technical definition of a recession is two successive quarters of negative growth.revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter.japan  economy teetered on the brink of a technical recession in the three months to september, figures show.'

Rouge score for that text summary using Bart model is obtained

In [43]:
summary_result = rouge.compute(predictions = [summary_txt[0]["summary_text"]],
                               references = [ref_txt], use_stemmer = True)
summary_result

{'rouge1': 0.8503937007874016,
 'rouge2': 0.816,
 'rougeL': 0.5826771653543307,
 'rougeLsum': 0.5826771653543307}

### Generating summaries(zero shot) for first 50 rows only as it will takes a very long time in summarising whole set of examples

In [44]:
candidate_summaries = []

for i, text in enumerate(tqdm(article_texts[:50])):

    candidate = summarizer(text)

    candidate_summaries.append(candidate[0]["summary_text"])

100%|██████████| 50/50 [14:01<00:00, 16.83s/it]


### Aggregated rouge scores are obtained

In [45]:
result_agg = rouge.compute(predictions = candidate_summaries, references = article_summaries[:50],
                           use_stemmer = True)
result_agg

{'rouge1': 0.4043996044852478,
 'rouge2': 0.29212576149903624,
 'rougeL': 0.29742628172956975,
 'rougeLsum': 0.29655007960610336}

> ### BART works better when compared to Pegasus

### Here case by case rouge scores are obtained.

In [46]:
result_unagg = rouge.compute(predictions = candidate_summaries, references = article_summaries[:50],
                             use_stemmer = True, use_aggregator = False)

Minimum and maximum Rouge score indices are obtained to check best and worst summaries

In [47]:
result_unagg_rsum = np.array(result_unagg["rougeLsum"])

np.argmax(result_unagg_rsum), np.argmin(result_unagg_rsum)

(5, 3)

> #### 5th row of data got good summary and 3rd row of data got worst summary

Actual vs Predicted summaries dataframe is obtained

In [48]:
act_vs_pred_summaries_df = pd.DataFrame(list(zip(candidate_summaries, article_summaries[:50])),
                                        columns = ["Predicted_Summaries", "Reference_summaries"])
act_vs_pred_summaries_df.head()

Unnamed: 0,Predicted_Summaries,Reference_summaries
0,Timewarner profits up 76% to $1.13bn for the t...,timewarner said fourth quarter sales rose 2% t...
1,Dollar hits its highest level against the euro...,the dollar has hit its highest level against t...
2,State-owned rosneft bought the yugansk unit fo...,yukos' owner menatep group says it will ask ro...
3,High fuel prices hit ba profits british airw...,"rod eddington, ba chief executive, said the r..."
4,Pernod ricard said it was seeking acquisitions...,pernod has reduced the debt it took on to fund...


Taking a look at the highest and lowest rouge score case's predicted and reference summaries

In [50]:
print("Actual Summary")
print(act_vs_pred_summaries_df._get_value(5, "Predicted_Summaries"))

print("Reference Summary")
print(act_vs_pred_summaries_df._get_value(5, "Reference_summaries"))

Actual Summary
 revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. on an annual basis, the data suggests annual growth of 0.2%, suggesting a much more hesitant recovery than had previously been thought. A common technical definition of a recession is two successive quarters of negative growth.
Reference Summary
on an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought.a common technical definition of a recession is two successive quarters of negative growth.revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter.japan  economy teetered on the brink of a technical recession in the three months to september, figures show.


In [51]:
print("Actual Summary")
print(act_vs_pred_summaries_df._get_value(3, "Predicted_Summaries"))

print("Reference Summary")
print(act_vs_pred_summaries_df._get_value(3, "Reference_summaries"))

Actual Summary
High fuel prices hit ba  profits  british airways has blamed high fuel prices for a 40% drop in profits. The airline made a pre-tax profit of â£75m ($141m) compared with â £125m a year earlier. ba last year introduced a fuel surcharge for passengers to help offset the increased price of aviation fuel.
Reference Summary
rod eddington, ba  chief executive, said the results were  respectable  in a third quarter when fuel costs rose by â£106m or 47.3%.to help offset the increased price of aviation fuel, ba last year introduced a fuel surcharge for passengers.ba had previously forecast a 2% to 3% rise in full-year revenue. it is quite good on the revenue side and it shows the impact of fuel surcharges and a positive cargo development, however, operating margins down and cost impact of fuel are very strong,  he said.yet aviation analyst mike powell of dresdner kleinwort wasserstein says ba  estimated annual surcharge revenues - â£160m - will still be way short of its additiona

## **Conclusion:**

> ### BART works better compared to Pegasus and finetuning this will give even more better results