## 410 Final Project: Generating Summaries for News Articles
Aaron Kuhstoss, Shalin Mehta, and Aleksandra Grigortsuk

### Imports

In [42]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from langdetect import detect
import random 

from rouge import Rouge
from bert_score import score

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

 ### Data Preprocessing
 

In [43]:
# Import the dataset
df = pd.read_csv("Latest_News.csv")
print(len(df))

# Filtering dataset with English articles and non-NA 
def detect_language(text):
    try:
        return detect(text)
    except:
        return None

# Subset df to make it workable, since python is computationally slow
random.seed(410)
df_subset = df.sample(n=15000)

# Apply the language detection function to df
df_subset['detected_language'] = df_subset['content'].apply(detect_language)
english_articles = df_subset[df_subset['detected_language'] == 'en']

86560


In [44]:
# Print the total number of articles in the 'english_articles' DataFrame
print(len(english_articles))

# Calculate and print the average length of the 'content' field in the 'english_articles' DataFrame
print(english_articles['content'].str.len().mean())

# Calculate and print the average length of the 'description' field in the 'english_articles' DataFrame
print(english_articles['description'].str.len().mean())

# Count and print the number of missing (null) values in the 'content' field of the 'english_articles' DataFrame
print(english_articles['content'].isnull().sum())

# Count and print the number of missing (null) values in the 'title' field of the 'english_articles' DataFrame
print(english_articles['title'].isnull().sum())

# Display the first few rows of the 'english_articles' DataFrame to get a quick overview of the data
(english_articles.head())

1213
2222.113767518549
233.90956979806847
0
0


Unnamed: 0,title,link,keywords,creator,video_url,description,content,pubDate,full_description,image_url,source_id,detected_language
41141,Analyzing How Injury To Zach Wilson Will Alter...,https://www.forbes.com/sites/jppelzman/2021/10...,"['SportsMoney', '/sportsmoney', 'Business', '/...","['J.P. Pelzman', ' Senior Contributor']",,This year for the Jets was supposed to be all ...,This year for the Jets was supposed to be all ...,2021-10-25 10:03:16,Share to Facebook Share to Twitter Share to Li...,https://thumbor.forbes.com/thumbor/fit-in/0x0/...,forbes,en
7771,"Lil Bean Comes Through With ""Care Package For ...",https://www.hotnewhiphop.com/lil-bean-comes-th...,,['Aron A.'],,Lil Bean drops off a brand new package to foll...,Lil Bean has been a buzzing force out of the B...,2021-10-26 01:14:29,Lil Bean has been a buzzing force out of the B...,https://www.hotnewhiphop.com/image/620x412c/co...,realhotnewhiphop,en
38551,Why It Makes Sense to Buy Apple Stock Right Now,https://www.fool.com/investing/2021/10/25/why-...,,['newsfeedback@fool.com (Harsh Chauhan)'],,The iPhone maker could soon regain its mojo.,Apple (NASDAQ: AAPL) will release its fiscal 2...,2021-10-25 11:02:00,Apple ( NASDAQ:AAPL ) will release its fiscal ...,,fool,en
75968,"Moyes warns Man Utd, Chelsea as Rice value gro...",https://www.teamtalk.com/west-ham-united/moyes...,"['English Premier League', 'All The News', 'Pr...",['Neil Foster'],https://content.jwplatform.com/videos/Exe5Nc4M...,The midfielder may leave the London Stadium at...,West Ham United boss David Moyes has made it c...,2021-10-24 14:04:43,West Ham United boss David Moyes has made it c...,,teamtalk,en
22286,Six reasons why you may be experiencing freque...,https://www.hola.com/us/lifestyle/202110253058...,,['Shirley Gómez'],,"Leg cramps are very common, and probably you h...","Leg cramps are very common, and probably you h...",2021-10-25 17:06:34,"By Shirley Gómez - New York October 25, 2021 1...",https://www.hola.com/us/images/026e-139072af71...,hola,en


### Pipeline Construction
##### Summarization pipeline

A detailed description of what the function does and how it achives its summaries is found in the comments:

In [45]:
def summarize(text, per):
    # Load the English language model from Spacy
    nlp = spacy.load('en_core_web_sm')

    # Process the input text and tokenize it
    doc = nlp(text)
    tokens = [token.text for token in doc]

    # Initialize a dictionary to hold word frequencies
    word_frequencies = {}

    # Calculate the frequency of each word in the text, excluding stopwords and punctuation
    for word in doc:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1

    # Normalize word frequencies by dividing each by the maximum frequency
    max_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word] = word_frequencies[word] / max_frequency

    # Break the text into sentences
    sentence_tokens = [sent for sent in doc.sents]
    sentence_scores = {}

    # Score sentences based on the frequency of the words they contain
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():                            
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]

    # Determine the number of sentences to include in the summary
    select_length = int(len(sentence_tokens) * per)

    # Select the sentences with the highest scores
    summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)

    # Combine the selected sentences into a final summary
    final_summary = [word.text for word in summary]
    summary = ''.join(final_summary)

    # Return the summarized text
    return summary

### Creating Summaries

The first cell directly below prints out 10 summaries from the summaries list using our algorithm. It tends to be a little time consuming (taking about 10 minutes), but does indeed create the summaries for each article in our pre-preprocessed data. 

The second cell selectively generates summaries and stores them along with the corresponding articles but only does so for 5 articles starting from the 500th article. This makes the generation time alot quicker, and is what we use in our model evaluation.

Both cells use a summary ratio of 0.1 for the creation of the summaries. 

In [46]:
# Initialize an empty list to store the generated summaries
summaries_1 = []

# Iterate over each article in the 'english_articles' DataFrame
for i in range(len(english_articles)):
    # Generate a summary of the article's content, summarized to 10% of its original length
    summary = summarize(english_articles.iloc[i,6], 0.1)

    # If a summary is successfully generated, add it to the summaries list
    if summary:
        summaries_1.append(summary)

# Retrieve and display the first 10 summaries from the generated list
summaries_1[:10]

['West Ham United boss David Moyes has made it clear that any interested suitors will now need to pay well in excess of £100m if they want to sign star midfielder Declan Rice.But Moyes has now stressed that the youngster’s value has skyrocketed and any thoughts of ‘just’ £100m are over.',
 'RELATED:Plant-based diets: 5 plants that have more protein than meat5 Holistic ways to promote a better night sleepNutritionist’s top four ways to incorporate more vegetables into your diet Low Sodium Levels We tend to vilify salt, believing it will cause high blood pressure and heart disease.GettyImages “Often muscle cramps occur during rest and can awaken you due to the pain and spasming of the muscle affected, often in your legs or feet.Aging With age can come decreased muscle mass which can lead to muscle cramps.',
 'Image Source: The Verge Davison did not explicitly name the types of links that wouldn’t be allowed, but he suggested links to OnlyFans wouldn’t be accepted because porn links are b

In [47]:
# Initialize lists to store generated summaries and corresponding raw articles
summaries_2 = []
raw_articles = []

# Start at the 500th article in the 'english_articles' DataFrame
index = 500

# Continue generating summaries until we have 5 of them
while len(summaries_2) < 5:
    # Retrieve the article at the current index
    article = english_articles.iloc[index,:]
    # Extract the content of the article
    article_content = english_articles.iloc[index,6]

    # Generate a summary of the article content, summarized to 10% of its original length
    summary = summarize(article_content, 0.1)

    # If a summary is successfully generated, add it to the summaries list and the corresponding article to the raw articles list
    if summary:
        summaries_2.append(summary)
        raw_articles.append(article)

    # Move to the next article
    index += 1

### Model Evaluation

In [48]:
"""
NOTE: A fundamental issue with using ROUGE and BLEU metrics is their dependence on high-quality reference summaries.
Using the "description" field as a pseudo summary provides a good amount of reference data. However, the quality of these descriptions varies, and some are null.
Another approach is to write summaries by hand for comparison, but this is time-consuming and limits the data points.
"""

# Obtain original descriptions for use as reference "summaries"
descriptions = [row.iloc[5] for row in raw_articles]

# Get raw content of articles for BERTscoring
orig_articles = [row.iloc[6] for row in raw_articles]

# Uncomment the below lines to display raw article content and descriptions
# print('Raw Articles:', '\n'.join(orig_articles))
# print('Descriptions:', '\n'.join(descriptions))

# Uncomment the below line to display summaries for qualitative evaluation
# print('Summaries:', '\n'.join(summaries))

# Hand-written summaries for evaluation
own_summaries = [
    "Bryan Cranstons portrayal of the morally gray character Walter White in the hit series Breaking Bad has been loved by fans. However, AMC had other choices for the lead, with Breaking Bad creator Vince Gilligan ultimately persuading executives to choose Cranston.",
    "Recently on TikTok, the term weaponized incompetence is gaining a lot of attention. According to psychotherapist and writer, Emily Mendez, M.S. EdS, “Weaponized incompetence refers to pretending not to know how to do something when you do really know how to do it.” The term has 21.8M views on TikTok as example, mostly of women, whose colleagues, partners, and family members use weaponized incompetence to get out of work.",
    "The All India Congress is going to launch a country wide protest from November 14 against the abnormal rise of fuel. The massive protest against the high fuel price will start from November 14 and will continue till November 29, after five consecutive days of rising fuel prices across the country.",
    "Numerous artists across multiple genres, such as Lil Nas X, Ariana Grande, and Olivia Rodrigo are entering songs for consideration in the upcoming Grammy award season. This includes Justin Bieber, whose smash hit “Peaches” (featuring Daniel Caesar and Giveon) is vying for a Grammy nomination as best R&B performance.",
    "Singer-songwriter Ed Sheeran announced Sunday he had tested positive for COVID-19 and would be self-isolating in his home five days before he is scheduled to release his fourth studio album. Sheeran's upcoming album, titled '=' is scheduled to be released on October 29."
]


##### To see the evaluation metrics we have for each of the 5 articles using BERT and ROUGE, please select the "scrollable element" option in the output below (because it is truncated by default)

##### We decided to compare our summaries using both metrics on our own hand-written samples as well as the description given in the dataset. 

In [49]:
# Function to evaluate summaries using Rouge and BERTscores
def eval_summaries(generated_summaries, hand_written_summaries, descriptions, content):
    """
    This function evaluates the generated summaries using ROUGE and BERTscores,
    comparing them with hand-written summaries and article descriptions.
    It prints the precision, recall, and F1 scores for each summary,
    indicating the type of reference used.
    """
    # Evaluate against hand-written summaries
    print("===== Evaluation Against Hand-Written Summaries =====")
    _evaluate_each_summary(generated_summaries, hand_written_summaries, content, "Hand-Written")

    # Evaluate against article descriptions
    print("\n===== Evaluation Against Article Descriptions =====")
    _evaluate_each_summary(generated_summaries, descriptions, content, "Article Description")

def _evaluate_each_summary(summaries, references, content, reference_type):
    # Initialize the ROUGE object
    rouge = Rouge()
    rouge_scores = rouge.get_scores(summaries, references)

    # Displaying ROUGE scores
    for index, score_set in enumerate(rouge_scores):
        print(f"--- Summary {index + 1} (Compared with {reference_type}) ---")
        print("ROUGE Scores:")
        for rouge_key, values in score_set.items():
            print(f"  {rouge_key.upper()}:")
            for metric, value in values.items():
                print(f"    {metric.capitalize()}: {value:.4f}")
        print()

    # Get BERTscore
    P, R, F1 = score(summaries, content, lang='en')

    # Display BERTscore results
    print("BERTscore Results:")
    for i in range(len(summaries)):
        print(f"Summary {i+1} (Compared with {reference_type}):")
        print(f"  Precision: {P[i].item():.4f}, Recall: {R[i].item():.4f}, F1: {F1[i].item():.4f}")
        print()

# Evaluating generated summaries
eval_summaries(summaries_2, own_summaries, descriptions, orig_articles)


===== Evaluation Against Hand-Written Summaries =====
--- Summary 1 (Compared with Hand-Written) ---
ROUGE Scores:
  ROUGE-1:
    R: 0.0811
    P: 0.1154
    F: 0.0952
  ROUGE-2:
    R: 0.0513
    P: 0.0645
    F: 0.0571
  ROUGE-L:
    R: 0.0811
    P: 0.1154
    F: 0.0952

--- Summary 2 (Compared with Hand-Written) ---
ROUGE Scores:
  ROUGE-1:
    R: 0.1636
    P: 0.1216
    F: 0.1395
  ROUGE-2:
    R: 0.0000
    P: 0.0000
    F: 0.0000
  ROUGE-L:
    R: 0.1273
    P: 0.0946
    F: 0.1085

--- Summary 3 (Compared with Hand-Written) ---
ROUGE Scores:
  ROUGE-1:
    R: 0.1892
    P: 0.0737
    F: 0.1061
  ROUGE-2:
    R: 0.0000
    P: 0.0000
    F: 0.0000
  ROUGE-L:
    R: 0.1892
    P: 0.0737
    F: 0.1061

--- Summary 4 (Compared with Hand-Written) ---
ROUGE Scores:
  ROUGE-1:
    R: 0.1556
    P: 0.1061
    F: 0.1261
  ROUGE-2:
    R: 0.0000
    P: 0.0000
    F: 0.0000
  ROUGE-L:
    R: 0.1556
    P: 0.1061
    F: 0.1261

--- Summary 5 (Compared with Hand-Written) ---
ROUGE Scores:
 

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTscore Results:
Summary 1 (Compared with Hand-Written):
  Precision: 0.9295, Recall: 0.7912, F1: 0.8548

Summary 2 (Compared with Hand-Written):
  Precision: 0.9433, Recall: 0.8597, F1: 0.8996

Summary 3 (Compared with Hand-Written):
  Precision: 0.9155, Recall: 0.8255, F1: 0.8682

Summary 4 (Compared with Hand-Written):
  Precision: 0.9129, Recall: 0.8313, F1: 0.8702

Summary 5 (Compared with Hand-Written):
  Precision: 0.8453, Recall: 0.8076, F1: 0.8260


===== Evaluation Against Article Descriptions =====
--- Summary 1 (Compared with Article Description) ---
ROUGE Scores:
  ROUGE-1:
    R: 0.1176
    P: 0.2308
    F: 0.1558
  ROUGE-2:
    R: 0.0536
    P: 0.0968
    F: 0.0690
  ROUGE-L:
    R: 0.0784
    P: 0.1538
    F: 0.1039

--- Summary 2 (Compared with Article Description) ---
ROUGE Scores:
  ROUGE-1:
    R: 0.5278
    P: 0.2568
    F: 0.3455
  ROUGE-2:
    R: 0.3500
    P: 0.1308
    F: 0.1905
  ROUGE-L:
    R: 0.5000
    P: 0.2432
    F: 0.3273

--- Summary 3 (Compared wit

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTscore Results:
Summary 1 (Compared with Article Description):
  Precision: 0.9295, Recall: 0.7912, F1: 0.8548

Summary 2 (Compared with Article Description):
  Precision: 0.9433, Recall: 0.8597, F1: 0.8996

Summary 3 (Compared with Article Description):
  Precision: 0.9155, Recall: 0.8255, F1: 0.8682

Summary 4 (Compared with Article Description):
  Precision: 0.9129, Recall: 0.8313, F1: 0.8702

Summary 5 (Compared with Article Description):
  Precision: 0.8453, Recall: 0.8076, F1: 0.8260

