**BART Trained on XL SUM**

**Packages**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import math

from datasets import load_dataset
import evaluate

import inspect

#let's make longer output readable without horizontal scrolling
from pprint import pprint

import warnings

import regex as re

import os, re
import time

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# These auto classes load the right type of tokenizer and model based on a model name
from transformers import AutoTokenizer, TFAutoModel
from transformers import pipeline
from transformers import AutoModel

  from .autonotebook import tqdm as notebook_tqdm


**Necessary Functions**

In [2]:
rouge = evaluate.load('rouge')

In [3]:
chrf = evaluate.load("chrf")

In [4]:
def get_default_args(func):
    signature = inspect.signature(func)
    return {
        k: v.default
        for k, v in signature.parameters.items()
        if v.default is not inspect.Parameter.empty
    }

In [23]:
def data_organize(sample_index):

    article = []
    summary = []

    for i in sample_index["index"]:
        summary.append(" ".join(dataset["train"][i]['summary'].split(" ")[:128]))
        article.append(dataset["train"][i]['text'])

    return article, summary

**Data**

In [6]:
dataset = load_dataset("csebuetnlp/xlsum", "english")

Found cached dataset xlsum (/home/ubuntu/.cache/huggingface/datasets/csebuetnlp___xlsum/english/2.0.0/518ab0af76048660bcc2240ca6e8692a977c80e384ffb18fdddebaca6daebdce)
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 340.51it/s]


**Sampling the Data**

**Train, Val, and Test sets for all XL Sum**

In [7]:
index = pd.DataFrame({"index": list(range(len(dataset['train'])))})
sample_index = index.sample(n=2000, replace=False, random_state=1004)
sample_index[:5]

Unnamed: 0,index
235420,235420
172024,172024
253546,253546
224954,224954
214134,214134


In [24]:
article, summary = data_organize(sample_index)

In [25]:
d = {'text': article[:1000],  'summary': summary[:1000]}
df = pd.DataFrame(data=d)
df.to_csv('../w266_projectxl_sum_sample_train.csv', index = False)
#df.head(5)

In [26]:
d = {'text': article[1000:1100],  'summary': summary[1000:1100]}
df = pd.DataFrame(data=d)
df.to_csv('../w266_project/xl_sum_sample_val.csv', index = False)
df.head(5)

Unnamed: 0,text,summary
0,Anthony ZurcherNorth America reporter@awzurche...,On day three of public hearings in the impeach...
1,It made a net profit of $281m (£185m) in the t...,"Yum Brands, owner of KFC and Pizza Hut restaur..."
2,Police sources told local media that the boy h...,Four members of the same family have been arre...
3,Zelda Perkins told the Financial Times she sig...,A British former assistant of Harvey Weinstein...
4,Bus workers walked out on Monday over changes ...,Bus drivers in Jersey have agreed to meet with...


In [27]:
d = {'text': article[1100:1200],  'summary': summary[1100:1200]}
df = pd.DataFrame(data=d)
df.to_csv('../w266_project/xl_sum_sample_test.csv', index = False)
df.head(5)

Unnamed: 0,text,summary
0,"In a statement, the White House said more than...",Hackers who breached US government networks st...
1,Fourteen fire engines were sent to tackle the ...,"A ""major"" fire on Bognor Regis seafront has be..."
2,The 240-turbine Atlantic Array would be double...,The European Commission (EC) has launched an i...
3,It is believed the patient contracted the infe...,A patient has been diagnosed with the rare vir...
4,By Chris JohnstonBusiness reporter The rise in...,UK inflation will quadruple to about 4% in the...


**Train, Val, and Test sets for each category**

In [None]:
def find_indices(list_to_check, item_to_find):
    indices = []
    for idx, value in enumerate(list_to_check):
        if value == item_to_find:
            indices.append(idx)
    return indices

In [None]:
categories = []

for i in range(len(dataset['train'])):
    cat = dataset['train'][i]['id']
    result = re.sub('\d','',cat)[:-1]
    result = result.split('-')[0].split('.')[0]
    categories.append(result)

**Category 1: uk**

In [None]:
uk = find_indices(categories, 'uk')
index = pd.DataFrame({"index": uk})
sample_index = index.sample(n=2000, replace=False, random_state=1004)

article, summary = data_organize(sample_index)

d = {'text': article[:1000],  'summary': summary[:1000]}
df = pd.DataFrame(data=d)
df.to_csv('xl_sum_sample_train_uk')
df.head(5)

d = {'text': article[1000:1100],  'summary': summary[1000:1100]}
df = pd.DataFrame(data=d)
df.to_csv('xl_sum_sample_val_uk')
df.head(5)

d = {'text': article[1100:1200],  'summary': summary[1100:1200]}
df = pd.DataFrame(data=d)
df.to_csv('xl_sum_sample_test_uk')
df.head(5)

**Training the Model**

https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization.py

https://www.databricks.com/blog/2023/03/20/fine-tuning-large-language-models-hugging-face-and-deepspeed.html

**All Categories**

All Categories: Model 0

In [None]:
!python3 transformers/examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path=facebook/bart-base \
    --do_train \
    --do_eval \
    --train_file 'w266_project/xl_sum_sample_train.csv' \
    --validation_file 'w266_project/xl_sum_sample_val.csv' \
    --text_column text \
    --summary_column summary \
    --push_to_hub=True \
    --max_source_length 128 \
    --max_target_length 32 \
    --num_train_epochs 1 \
    --per_device_train_batch_size=32 \
    --per_device_eval_batch_size=32 \
    --output_dir='finetuned-BART-all-categories' \
    --overwrite_output_dir=True \
    --predict_with_generate 

In [None]:
#remember to push new model to huggingface repo after running prev cell so that this cell uses the most recent model!!!!!!!

summarizer = pipeline("summarization", model="arisanguyen/finetuned-BART-all-categories")

Downloading (…)lve/main/config.json: 100%|██| 1.74k/1.74k [00:00<00:00, 246kB/s]
Downloading pytorch_model.bin: 100%|█████████| 558M/558M [00:06<00:00, 85.0MB/s]
Downloading (…)neration_config.json: 100%|█████| 297/297 [00:00<00:00, 47.4kB/s]
Downloading (…)okenizer_config.json: 100%|██████| 384/384 [00:00<00:00, 158kB/s]
Downloading (…)olve/main/vocab.json: 100%|████| 798k/798k [00:00<00:00, 101MB/s]
Downloading (…)olve/main/merges.txt: 100%|███| 456k/456k [00:00<00:00, 68.6MB/s]
Downloading (…)/main/tokenizer.json: 100%|██| 2.11M/2.11M [00:00<00:00, 112MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████| 280/280 [00:00<00:00, 133kB/s]


In [35]:
df = pd.read_csv('../w266_project/xl_sum_sample_val.csv')
df.head(5)

Unnamed: 0,text,summary
0,Anthony ZurcherNorth America reporter@awzurche...,On day three of public hearings in the impeach...
1,It made a net profit of $281m (£185m) in the t...,"Yum Brands, owner of KFC and Pizza Hut restaur..."
2,Police sources told local media that the boy h...,Four members of the same family have been arre...
3,Zelda Perkins told the Financial Times she sig...,A British former assistant of Harvey Weinstein...
4,Bus workers walked out on Monday over changes ...,Bus drivers in Jersey have agreed to meet with...


In [None]:
bart_r1 = []
bart_r2 = []
bart_rL = []
bart_rLs = []
bart_chrf = []

for i in range(int(len(df['text']))):
    
    #art = ' '.join(df['article'][i].split(' ')[:1024]) #truncated to first 1024 words, because that is all the model can handle
    
    candidate = summarizer(df['text'][i], 
                           truncation = True, #truncated to first 1024 words, because that is all the model can handle
                             #max_length=130, min_length=30, do_sample=False
                            )[0]
    candidate = [candidate['summary_text']]
    #pprint(candidate[0], compact=True)
    
    ref = [df['summary'][i]]
    
    results = rouge.compute(predictions=candidate,
                            references=ref)
    
    bart_r1.append(results['rouge1'])
    bart_r2.append(results['rouge2'])
    bart_rL.append(results['rougeL'])
    bart_rLs.append(results['rougeLsum'])
    
    results = chrf.compute(predictions=candidate,
                            references=ref)
    
    bart_chrf.append(results['score'])
    
    if i in np.arange(0, 2000, 100):
        data = {'rouge1': bart_r1, 'rouge2': bart_r2, 'rogueL': bart_rL, 'rogueLs': bart_rLs, 'chrf': bart_chrf}
        scores = pd.DataFrame(data)
        scores.to_csv(r'BART_trained_0_scores.csv', index=False)
        print(i)
        
data = {'rouge1': bart_r1, 'rouge2': bart_r2, 'rogueL': bart_rL, 'rogueLs': bart_rLs,'chrf': bart_chrf}
scores = pd.DataFrame(data)
scores.to_csv(r'BART_trained_0_scores.csv', index=False)

In [37]:
print('Last Article', df['text'][i])
print('Last Reference Summary', ref)
print('Last Candidate Summary', candidate)

Last Article She tweeted that the clothes would be "archived & expertly cared for in the spirit & love of Michael Jackson, his bravery, & fans worldwide". The auction included a jacket worn during Jackson's Bad tour, that went for $240,000 (£148,000) and two crystal gloves. The items were all made by designers Dennis Tompkins and Michael Bush. Lady Gaga also tweeted a picture of herself and her bidding paddle at the auction. More than $5m (£3.1m) was raised by the sale, according to LA-based Julien's Auctions. Other items that went under the hammer included jackets from Michael Jackson's Dangerous and Thriller tours and a pair of jeans that went for $50,000 (£31,000). Some of the money raised by the auction is being donated to a guide dogs charity and a hospice in Las Vegas. American costume designers Michael Bush and Dennis Tompkins created thousands of original pieces for Michael Jackson during his career. However, despite Lady Gaga's assurances, some fans expressed their anger onlin

In [38]:
print('rouge1 average :', np.mean(bart_r1))
print('rouge2 average :', np.mean(bart_r2))
print('rougeL average :', np.mean(bart_rL))
print('rougeLs average :', np.mean(bart_rLs))
print('chrf average :', np.mean(bart_chrf))

rouge1 average : 0.3192639042917483
rouge2 average : 0.10466960084268205
rougeL average : 0.24501046685959502
rougeLs average : 0.24501046685959502
chrf average : 27.774860738085867
