# Automated News Summarizer

In [3]:
import numpy as np
import pandas as pd

# Read the article csv file
df = pd.read_csv('Articles.csv', encoding='latin-1')
df.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2692 entries, 0 to 2691
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Article   2692 non-null   object
 1   Date      2692 non-null   object
 2   Heading   2692 non-null   object
 3   NewsType  2692 non-null   object
dtypes: object(4)
memory usage: 84.2+ KB


In [5]:
art = df.loc[10, 'Article']
print(art)

TOKYO: Tokyo stocks opened 0.74 percent lower on Wednesday, hit by the yen´s rise and drops on Wall Street on worries about falling oil prices.The Nikkei 225 index at the Tokyo Stock Exchange lost 125.89 to 16,961.82 at the start.In New York on Tuesday the Dow Jones Industrial Average dropped 0.15 percent and the broad-based S&P 500 fell 0.26 percent, overshadowed by worries about sliding crude oil prices.The yen rose against other currencies on safe-haven buying, a negative for Japanese exporters as the stronger currency makes them less competitive abroad and erodes profits when repatriated.The dollar was at 117.72 yen early Wednesday, down from 117.90 yen in New York Tuesday afternoon and rates above 118 yen seen in Tokyo earlier Tuesday.The euro also fell after a key European central banker expressed support for monetary stimulus.The common European currency bought 138.69 yen and $1.1776 against 138.84 yen and $1.1777 in US trade.The ruble´s drop took a breather early Wednesday afte

In [6]:
import re
import nltk
nltk.download("punkt")
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\angel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
# Clean the article by removing whitespace and newlines
art_clean = re.sub(r'\s+', ' ', art).strip()
# print(art_clean)

In [8]:
# Use spacy to process the text
nlp = spacy.load("en_core_web_sm")
doc = nlp(art_clean)
# print(doc)

## Extractive Summarization

In [10]:
# Remove sentences that are too short
sent_list = []
for s in doc.sents:
    s_clean = s.text.strip()
    print(len(s_clean))
    if len(s_clean) > 20:
        sent_list.append(s_clean)
print([len(a) for a in sent_list])

6
136
87
179
190
151
95
102
147
95
[136, 87, 179, 190, 151, 95, 102, 147, 95]


In [11]:
# Do extractive summarization using TF-IDF

# Initialize a TfidfVectorizer, which converts text into numerical vectors
tfidf_vectorizer = TfidfVectorizer()

# Represent the content of each sentence numerically in a matrix
X_tfidf = tfidf_vectorizer.fit_transform(sent_list)

# Calculate the mean TF-IDF vector across all sentences, which represents the mean content of the whole article 
art_vector = np.array(X_tfidf.mean(axis=0))

# Use cosine similarity with the matrix and the mean to see which sentence is more similar to the article as a whole
close_art = cosine_similarity(X_tfidf, art_vector)
close_art

array([[0.50927947],
       [0.3623672 ],
       [0.49025453],
       [0.41979145],
       [0.50663382],
       [0.32694746],
       [0.42468947],
       [0.45458952],
       [0.51975167]])

In [12]:
# Find the top 3 most relevant sentences
indices_sort = close_art.flatten().argsort()
top_3 = indices_sort[-3:][::-1]
top_3

array([8, 0, 4], dtype=int64)

In [13]:
# Build summary using top 3 sentences
sum_list = []
for t in top_3:
    sum_list.append(sent_list[t])
sum_art = " ".join(sum_list)
print('Title:', df.loc[10, 'Heading'])
print('Description:', sum_art)

Title: tokyo stocks open 0.74 percent lower
Description: The dollar was at 65.28 against the ruble on Wednesday against levels above 66 seen on Tuesday. Tokyo stocks opened 0.74 percent lower on Wednesday, hit by the yen´s rise and drops on Wall Street on worries about falling oil prices. The dollar was at 117.72 yen early Wednesday, down from 117.90 yen in New York Tuesday afternoon and rates above 118 yen seen in Tokyo earlier Tuesday.


Let's combine what we did all in one function

In [15]:
def article_summary(article):
    # Clean the article by removing whitespace and newlines
    art_clean = re.sub(r'\s+', ' ', article).strip()

    # Use spacy to process the text
    doc = nlp(art_clean)

    # Remove sentences that are too short
    sent_list = []
    for s in doc.sents:
        s_clean = s.text.strip()
        # print(len(s_clean))
        if len(s_clean) > 20:
            sent_list.append(s_clean)

    # Do extractive summarization using TF-IDF

    # Initialize a TfidfVectorizer, which converts text into numerical vectors
    tfidf_vectorizer = TfidfVectorizer()

    # Represent the content of each sentence numerically in a matrix
    X_tfidf = tfidf_vectorizer.fit_transform(sent_list)

    # Calculate the mean TF-IDF vector across all sentences, which represents the mean content of the whole article 
    art_vector = np.array(X_tfidf.mean(axis=0))

    # Use cosine similarity with the matrix and the mean to see which sentence is more similar to the article as a whole
    close_art = cosine_similarity(X_tfidf, art_vector)

    # Find the top 3 most relevant sentences
    indices_sort = close_art.flatten().argsort()
    top_3 = indices_sort[-3:][::-1]

    # Build summary using top 3 sentences
    sum_list = []
    for t in top_3:
        sum_list.append(sent_list[t])
    summary = " ".join(sum_list)
    return summary

To check if our summaries are accurate, we must measure semantic similarity between them.

In [17]:
# Use sentence embeddings to measure semantic similarity
nlp = spacy.load("en_core_web_md")

In [18]:
# Choose 1000 random articles to prevent bias
df_sample1 = df.sample(1000, random_state = 42)
index_list = df_sample1.index.tolist()
df_sample1.head()

Unnamed: 0,Article,Date,Heading,NewsType
1784,strong>KATHMANDU: A 36-year-old Dutch climber...,5/21/2016,Dutch climber dies descent Everest summi,sports
2219,"REVEL, France: Michael Matthews completed the ...",7/12/2016,Dream comes true as Matthews completes Grand T...,sports
368,Singapore: Oil prices held above $43 a barrel ...,11/26/2015,oil prices up in asi,business
535,strong>TOKYO: Asian stocks slipped on Tuesday ...,3/22/2016,Asian shares edge lower as Fed rate talk reviv,business
2424,strong>GALLE: Off-spinner Dilruwan Perera bagg...,8/6/2016,Sri Lanka beat Australia 229 runs clinch seri,sports


In [19]:
# Calculate each score for each article using the heading
similar_score = []

article_print = 0
for i in index_list:
    art = df.loc[i, 'Article']
    heading = df.loc[i, 'Heading']
    sum_art = article_summary(art)
    doc1 = nlp(heading)
    doc2 = nlp(sum_art)
    similarity = doc1.similarity(doc2)
    similar_score.append(similarity)
    # Print the first 5 articles to make sure it works
    if (article_print < 5):
        print('Title:', heading)
        print('Description:', sum_art)
        print()
        article_print += 1
len(similar_score)

Title: Dutch climber dies descent Everest summi
Description: strong>KATHMANDU: A 36-year-old Dutch climber died while descending from the summit of Everest, the first to perish this year on the world's highest mountain, officials in Nepal said on Saturday.</strongEric Ary Arnold was among over 40 climbers who reached the 8,850 metre (29,035 feet) summit on Friday, but died later that day while coming down on high-altitude slopes known as the "death zone" because of the prevailing thin air, Tourism Department official Gyanendra Shrestha said. Mingma Sherpa of the Seven Summits Treks company that organised Arnold's expedition said his client complained of weakness while descending above 8,000 metres (26,246 feet) and probably died from altitude sickness. An earthquake last year killed at least 18 people at the Everest Base Camp, situated at some 5,400 metres (17,800 feet) altitude, and forced hundreds of climbers to abandon their expeditions.

Title: Dream comes true as Matthews complete

1000

Now, we find the mean accuracy rate to see how accurate extractive summarization really is

In [21]:
similar_score = np.array(similar_score)
similar_mean = np.mean(similar_score)
print(similar_mean)

0.6437360152676267


The mean semantic similarity score was approximately 0.644. This indicates a moderate level of alignment between the headlines and extractive summaries. The extractive summarization does not always capture what the headlines emphasize, or the headlines often omit contextual details that the summarization highlights. Overall, the extractive summarization is generally effective at capturing the core article content, but may not fully reflect headline intent.

In [23]:
# Let's try a different dataset and see what results we get
# Let's use a dataset from BBC News
df_bbc = pd.read_csv('bbc_news.csv')
df_bbc.head()

Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


In [24]:
df_bbc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42115 entries, 0 to 42114
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        42115 non-null  object
 1   pubDate      42115 non-null  object
 2   guid         42115 non-null  object
 3   link         42115 non-null  object
 4   description  42115 non-null  object
dtypes: object(5)
memory usage: 1.6+ MB


In [25]:
# Let's extract the article information from the URL
from newspaper import Article
def url_extract(row):
    try:
        url = df_bbc.loc[row, 'link']
    
        art = Article(url)
        art.download()
        art.parse()
    
        description = art.text
        return description
    # If article link is dead, throw exception
    except Exception as e:
        return ""

In [26]:
# Choose 1000 random articles to prevent bias
df_sample2 = df_bbc.sample(1000, random_state = 42)
index_list = df_sample2.index.tolist()
df_sample2.head()

Unnamed: 0,title,pubDate,guid,link,description
24790,US's only Palestinian-American Congresswoman c...,"Wed, 08 Nov 2023 10:19:15 GMT",https://www.bbc.co.uk/news/world-us-canada-673...,https://www.bbc.co.uk/news/world-us-canada-673...,"Michigan Democrat defends pro-Palestinian ""riv..."
29911,Murdered driver's family demand help for couriers,"Sun, 25 Feb 2024 21:52:55 GMT",https://www.bbc.co.uk/news/uk-wales-67726081,https://www.bbc.co.uk/news/uk-wales-67726081,Mark Lang was killed by a man who was stealing...
21220,Cleared pony owner criticises 'trial by social...,"Fri, 25 Aug 2023 18:20:35 GMT",https://www.bbc.co.uk/news/uk-england-leiceste...,https://www.bbc.co.uk/news/uk-england-leiceste...,"Sarah Moulds also criticises the RSPCA, saying..."
16716,Wrexham: Can football fame make City of Cultur...,"Sat, 06 May 2023 07:52:28 GMT",https://www.bbc.co.uk/news/uk-wales-65494230,https://www.bbc.co.uk/news/uk-wales-65494230?a...,Backers who want Wrexham to win City of Cultur...
32850,Gaza ceasefire talks intensify in Cairo,"Sat, 04 May 2024 19:53:46 GMT",https://www.bbc.co.uk/news/world-middle-east-6...,https://www.bbc.co.uk/news/world-middle-east-6...,Hamas said it was sending negotiators to talks...


In [27]:
# Calculate each score for each article using the heading
similar_score = []
article_print = 0
for i in index_list:
    art = url_extract(i)

    if (art.strip() == ""):
        similar_score.append(np.nan)
        continue
    
    heading = df_bbc.loc[i, 'title']
    sum_art = article_summary(art)
    
    doc1 = nlp(heading)
    doc2 = nlp(sum_art)
    if doc1.vector_norm == 0 or doc2.vector_norm == 0:
        similarity = np.nan
    else:
        similarity = doc1.similarity(doc2)
    similar_score.append(similarity)
    if (article_print < 5):
        print('Title:', heading)
        print('Description:', sum_art)
        print()
        article_print += 1
len(similar_score)

Title: US's only Palestinian-American Congresswoman censured over comments
Description: Ms Tlaib posted a video, external to Twitter on Friday that included a clip of protestors using the chant, which critics say calls for Palestinian control of all land between the Jordan River and the Mediterranean Sea, including Israel. The resolution formally condemned her for "calling for the destruction of the state of Israel". Michigan Democrat Rashida Tlaib was rebuked for her defence of the chant "from the river to the sea, Palestine will be free".

Title: Murdered driver's family demand help for couriers
Description: Cara and Elena Lang are calling for better safety measures to protect delivery drivers after their father was killed in March 2023. And police chiefs warned of a rising number of "opportunistic" attacks on delivery drivers and are urging courier companies to take responsibility for their workers' safety. The sisters sat in court as Christopher El Gifari, 31, was sentenced to life

1000

Let's find the mean accuracy score again

In [29]:
similar_score = np.array(similar_score)
# This is here in case one of the articles we tried to access is a dead link
clean_score = similar_score[~np.isnan(similar_score)]

similar_mean = np.mean(clean_score)
print(similar_mean)

0.6375540795604265


The mean semantic similarity score was approximately 0.637. Compared to the earlier mean similarity score of 0.644, the difference is relatively small, indicating a consistent level of alignment between article headlines and their extractive summaries across both datasets. This suggests that, despite potential editorial or topical differences, both sources generally maintain coherence between headline intent and article content.

## References

News Article Dataset: https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles 

BBC News Dataset: https://www.kaggle.com/datasets/gpreda/bbc-news 