- Analyse the data.
- Handle missing, duplicate values.
- Preprocess the text data by removing stopwords, punctuation, and converting text to lowercase.

---

### Summarization

- Implement a summarization technique to generate concise summaries for each financial news article.
- Evaluate the quality of the generated summaries based on relevancy and coherence.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("news-data/data.csv")
print(df.shape)
df.head()

(105375, 12)


Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content
0,89541,,International Business Times,Paavan MATHEMA,UN Chief Urges World To 'Stop The Madness' Of ...,UN Secretary-General Antonio Guterres urged th...,https://www.ibtimes.com/un-chief-urges-world-s...,https://d.ibtimes.com/en/full/4496078/nepals-g...,2023-10-30 10:12:35.000000,UN Secretary-General Antonio Guterres urged th...,Nepal,UN Secretary-General Antonio Guterres urged th...
1,89542,,Prtimes.jp,,RANDEBOOよりワンランク上の大人っぽさが漂うニットとベストが新登場。,[株式会社Ainer]\nRANDEBOO（ランデブー）では2023年7月18日(火)より公...,https://prtimes.jp/main/html/rd/p/000000147.00...,https://prtimes.jp/i/32220/147/ogp/d32220-147-...,2023-10-06 04:40:02.000000,"RANDEBOO2023718()WEB2023 Autumn Winter \n""Nepa...",Nepal,
2,89543,,VOA News,webdesk@voanews.com (Agence France-Presse),UN Chief Urges World to 'Stop the Madness' of ...,UN Secretary-General Antonio Guterres urged th...,https://www.voanews.com/a/un-chief-urges-world...,https://gdb.voanews.com/01000000-0a00-0242-60f...,2023-10-30 10:53:30.000000,"Kathmandu, Nepal UN Secretary-General Antonio...",Nepal,
3,89545,,The Indian Express,Editorial,Sikkim warning: Hydroelectricity push must be ...,Ecologists caution against the adverse effects...,https://indianexpress.com/article/opinion/edit...,https://images.indianexpress.com/2023/10/edit-...,2023-10-06 01:20:24.000000,At least 14 persons lost their lives and more ...,Nepal,At least 14 persons lost their lives and more ...
4,89547,,The Times of Israel,Jacob Magid,"200 foreigners, dual nationals cut down in Ham...","France lost 35 citizens, Thailand 33, US 31, U...",https://www.timesofisrael.com/200-foreigners-d...,https://static.timesofisrael.com/www/uploads/2...,2023-10-27 01:08:34.000000,"Scores of foreign citizens were killed, taken ...",Nepal,


## Overview and Info


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105375 entries, 0 to 105374
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   article_id    105375 non-null  int64 
 1   source_id     24495 non-null   object
 2   source_name   105375 non-null  object
 3   author        97156 non-null   object
 4   title         105335 non-null  object
 5   description   104992 non-null  object
 6   url           105375 non-null  object
 7   url_to_image  99751 non-null   object
 8   published_at  105375 non-null  object
 9   content       105375 non-null  object
 10  category      105333 non-null  object
 11  full_content  58432 non-null   object
dtypes: int64(1), object(11)
memory usage: 9.6+ MB


### Null Values


In [4]:
df.isna().sum().truediv(df.shape[0]).mul(100).round(2)

article_id       0.00
source_id       76.75
source_name      0.00
author           7.80
title            0.04
description      0.36
url              0.00
url_to_image     5.34
published_at     0.00
content          0.00
category         0.04
full_content    44.55
dtype: float64

**Conclusions**

- `source_id` column has 77% null values. So, I am going to drop it blindly.
- `title` column has 0.04% null values.
- `description` column has 0.36% null values. I have to check its value to make any decision.
- `url_to_image` column has 5.3% null values. I think this column is not for my use.
- `full_content` column has 45% null value. So I have to see what's my requirements to make any decision.


## Data Exploration and Data Cleaning


### `full_content`, `content`


In [5]:
df[df["full_content"].isna()]["source_name"].value_counts()

source_name
Biztoc.com                     3968
Forbes                         1656
BBC News                       1216
AllAfrica - Top Africa News    1114
Business Insider                718
                               ... 
Abnormalreturns.com               1
Bonterms.com                      1
Darnell.day                       1
Platformer.news                   1
Omnigroup.com                     1
Name: count, Length: 2370, dtype: int64

In [6]:
df[~df["full_content"].isna()]["source_name"].value_counts()

source_name
ETF Daily News                  16631
The Times of India               7579
GlobeNewswire                    5480
Globalsecurity.org               3093
Forbes                           2767
BBC News                         2126
ABC News                         2102
Business Insider                 2028
The Punch                        1798
Al Jazeera English               1642
Marketscreener.com               1439
Phys.Org                         1253
International Business Times     1202
The Indian Express               1180
RT                               1124
NPR                               980
Deadline                          927
Digital Trends                    787
CNA                               710
Boing Boing                       708
Time                              598
Android Central                   523
Gizmodo.com                       382
ReadWrite                         319
Euronews                          291
Wired                             268


**Conclusions**

- `full_content` column's description: "Article Extracted from its respected URL" _(from kaggle)_.
- When I filter the column with `nan` values I got to know that about **2370** `source_name` got filtered.
- Only **29** `source_name` has any values.

Keep in mind that whether the `full_content` is not present but the `content` column has 100% `non-null` value.
(`content`: "The unformatted content of the article, where available. This is truncated to 200 chars" _(from kaggle)_).
That's why we cannot remove these data points on the basis of `full_content` column.

I can substitute the `full_content`'s values to `content` column where `full_content` value is `not-null`. And after
this we can remove the `full_content` column.


In [7]:
full_content_present_index = df[~df["full_content"].isna()].index
full_content_present_index

Index([     0,      3,      6,      7,     12,     15,     18,     21,     22,
           31,
       ...
       105365, 105366, 105367, 105368, 105369, 105370, 105371, 105372, 105373,
       105374],
      dtype='int64', length=58432)

In [8]:
df.loc[full_content_present_index, "content"] = df.loc[
    full_content_present_index, "full_content"
]

In [9]:
df = df.drop(columns=["full_content"])
df.shape

(105375, 11)

### `url`, `url_to_image` and `published_at`

These columns are not much relevent for our project/analysis, so we can drop them.


In [10]:
df = df.drop(columns=["url", "url_to_image", "published_at"])
df.shape

(105375, 8)

### `source_id`


In [11]:
df["source_id"].notna().sum()

24495

As we have `source_name` in the dataset and `source_id` column has only 25% of `not-null` values, so we can drop it.


In [12]:
df = df.drop(columns=["source_id"])
df.shape

(105375, 7)

In [13]:
df.sample(8)

Unnamed: 0,article_id,source_name,author,title,description,content,category
53899,122072,Marketscreener.com,,"EMBRAER S A : displays C-390, Super Tucano, E1...",(marketscreener.com) \n - Eve Air Mobility wil...,- Eve Air Mobility will showcase a full-size e...,Travel
77433,81790,GoNintendo,znbashi,The Trotties Adventure launching on November 1...,Visit the site to view the full article.,"Emma, Mia, Lucy and Sophie are the Trotties - ...",Madagascar
72958,269178,ETF Daily News,MarketBeat News,Kovitz Investment Group Partners LLC Buys Shar...,Kovitz Investment Group Partners LLC bought a ...,Kovitz Investment Group Partners LLC bought a ...,Real estate
48742,77502,Marketscreener.com,,Gulf Investment Fund net asset value falls in ...,(marketscreener.com) \nGulf Investment Fund PL...,Gulf Investment Fund PLC - Isle of Man-based i...,Kuwait
101355,676040,ETF Daily News,MarketBeat News,StockNews.com Lowers Nathan’s Famous (NASDAQ:N...,StockNews.com lowered shares of Nathan’s Famou...,StockNews.comlowered shares ofNathan’s Famous ...,Stock
76784,314357,ETF Daily News,MarketBeat News,"Spotlight Asset Group Inc. Sells 2,276 Shares ...",Spotlight Asset Group Inc. cut its position in...,Spotlight Asset Group Inc. cut its position in...,Canada
20978,17185,CoinDesk,Shaurya Malwa,"Crypto Market Stable, Oil Prices Surge as Hama...",The U.S. pre-market futures slid Monday mornin...,Bitcoin (BTC) and ether (ETH) showed signs of ...,Asia
52967,120452,The Times of Israel,,"Presumed captive: Ofir Tzarfati, celebrated 27...",Tzarfati got his girlfriend into a car and out...,Ofir Tzarfati was celebrating his 27th birthda...,Music


### `category`


In [14]:
df["category"].value_counts()

category
Stock          3999
Health         2594
Finance        2402
Technology     2371
Real estate    2352
               ... 
Eritrea          14
Martinique       13
Cabo Verde       11
Réunion           9
Guadeloupe        4
Name: count, Length: 257, dtype: int64

### `author`


In [15]:
df["author"].value_counts()

author
MarketBeat News                                                                                                                                        16627
John Pike                                                                                                                                               3093
https://www.facebook.com/bbcnews                                                                                                                        2040
Reuters                                                                                                                                                 1348
PTI                                                                                                                                                     1231
                                                                                                                                                       ...  
Bethan Ackerley                                    

In [16]:
df = df.drop(columns=["author"])
df.shape

(105375, 6)

In [17]:
df.sample(8)

Unnamed: 0,article_id,source_name,title,description,content,category
68929,217120,GlobeNewswire,Global Fuel Delivery Systems Strategic Busines...,"Dublin, Nov. 08, 2023 (GLOBE NEWSWIRE) -- The ...","Dublin, Nov. 08, 2023 (GLOBE NEWSWIRE) -- Th...",Australia
23194,21138,ReadWrite,How influencers and Riot Games made Valorant a...,Riot Games’ Valorant has carved a niche for it...,Riot Games’Valoranthas carved a niche for itse...,YouTube
56424,128414,Sky.com,'Hero' astronaut who helped save Apollo 13 cre...,An astronaut who orbited the moon and helped r...,An astronaut who orbited the moon and helped r...,Space
86876,424677,BBC News,George Santos to face new expulsion vote after...,The move comes one day after a damning ethics ...,George Santos will face a new expulsion vote a...,Jobs
86369,413662,The Times of India,Risk rally stalls as bullish investors take br...,Global stocks fell and the dollar slightly ros...,LONDON: World stocks fell for the first time i...,Japan
92203,87219,Biztoc.com,Acclaimed Animator Anca Damian Lines Up Live-A...,Acclaimed Romanian animator Anca Damian has li...,Acclaimed Romanian animator Anca Damian has li...,Namibia
48196,76450,Globalsecurity.org,UN rights expert urges key reforms in Cambodia,A UN independent human rights expert on Monday...,9 October 2023 - A UN independent human right...,Cambodia
20800,16854,HuffPost,G.0.A.T.: Simone Biles Wins 6th Title At World...,Simone Biles has won the individual all-around...,"ANTWERP, Belgium (AP) After a two-year absence...",world


## Data Preprocessing


In [18]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [None]:
from string import punctuation

In [20]:
import nltk

nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /Users/iarv/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/iarv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Remove stopwords, punctuation and convert to lowercase


In [70]:
def text_preprocess(df: pd.DataFrame) -> pd.DataFrame:
    for col in df.columns:
        df.loc[:, col] = (
            df[col].str.lower().str.replace(rf"[{punctuation}]", "", regex=True)
        )
    return df

In [75]:
en_stopwords = stopwords.words("english")

In [None]:
def remove_stopword(df: pd.DataFrame) -> pd.DataFrame:
    for col in df.columns:
        df.loc[:, col] = df[col].apply(word_tokenize)
        df.loc[:, col] = df[col].apply(
            lambda tokens: [w for w in tokens if w not in en_stopwords]
        )
    return df

In [79]:
pipe = Pipeline(
    [
        ("text_preprocess", FunctionTransformer(text_preprocess)),
        ("remove_stopword", FunctionTransformer(remove_stopword)),
    ]
)

col_trf = ColumnTransformer(
    [
        ("data_preprocessing", pipe, ["title", "description", "content"]),
    ],
    remainder="passthrough",
)

col_trf

In [81]:
df.head()

Unnamed: 0,article_id,source_name,title,description,content,category
0,89541,International Business Times,UN Chief Urges World To 'Stop The Madness' Of ...,UN Secretary-General Antonio Guterres urged th...,UN Secretary-General Antonio Guterres urged th...,Nepal
1,89542,Prtimes.jp,RANDEBOOよりワンランク上の大人っぽさが漂うニットとベストが新登場。,[株式会社Ainer]\nRANDEBOO（ランデブー）では2023年7月18日(火)より公...,"RANDEBOO2023718()WEB2023 Autumn Winter \n""Nepa...",Nepal
2,89543,VOA News,UN Chief Urges World to 'Stop the Madness' of ...,UN Secretary-General Antonio Guterres urged th...,"Kathmandu, Nepal UN Secretary-General Antonio...",Nepal
3,89545,The Indian Express,Sikkim warning: Hydroelectricity push must be ...,Ecologists caution against the adverse effects...,At least 14 persons lost their lives and more ...,Nepal
4,89547,The Times of Israel,"200 foreigners, dual nationals cut down in Ham...","France lost 35 citizens, Thailand 33, US 31, U...","Scores of foreign citizens were killed, taken ...",Nepal


In [None]:
col_trf.fit_transform(df.head(10)[["title", "description", "content"]])