# Global Article Summarization and Link Analysis
**PROBLEM STATEMENT 1**
https://drive.google.com/file/d/1rhS3kvKaegBEo9il9ijX-k0QhdFdsgPr/view

### Key tasks:
- Article classification - ex: virat kohli -> sports, songs, etc.
- **Article summarization** - could be done using open-source models (30%)
- Natural language based **information retrieval system** - cosine similarity? (30%)
- Keywords extraction - bold them or use it for related articles
- **Related articles** - we could possibly do this be having a vector embedding for every article, and find the closest n articles in the vector space (30%)
- **Interactive, intuitive UI** - with feature to tweak summarization size (10%)
- Bonus - translation of summary

[Dataset 1](https://www.kaggle.com/datasets/everydaycodings/global-news-dataset?resource=download)

## Architecture Diagram
![Architecture Diagram](imgs/UI.png)

### Loading the required modules

In [1]:
import pandas as pd
import numpy as np
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline
from transformers import AutoTokenizer

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
2024-02-19 00:53:56.613038: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-19 00:53:56.648443: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 00:53:56.648526: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 00:53:56.649603: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 00:53:56.655715: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU wi

### Preparing the dataset

In [4]:
data = pd.read_csv("./data.csv")
data.head(2)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content
0,89541,,International Business Times,Paavan MATHEMA,UN Chief Urges World To 'Stop The Madness' Of ...,UN Secretary-General Antonio Guterres urged th...,https://www.ibtimes.com/un-chief-urges-world-s...,https://d.ibtimes.com/en/full/4496078/nepals-g...,2023-10-30 10:12:35.000000,UN Secretary-General Antonio Guterres urged th...,Nepal,UN Secretary-General Antonio Guterres urged th...
1,89542,,Prtimes.jp,,RANDEBOOよりワンランク上の大人っぽさが漂うニットとベストが新登場。,[株式会社Ainer]\nRANDEBOO（ランデブー）では2023年7月18日(火)より公...,https://prtimes.jp/main/html/rd/p/000000147.00...,https://prtimes.jp/i/32220/147/ogp/d32220-147-...,2023-10-06 04:40:02.000000,"RANDEBOO2023718()WEB2023 Autumn Winter \n""Nepa...",Nepal,


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105375 entries, 0 to 105374
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   article_id    105375 non-null  int64 
 1   source_id     24495 non-null   object
 2   source_name   105375 non-null  object
 3   author        97156 non-null   object
 4   title         105335 non-null  object
 5   description   104992 non-null  object
 6   url           105375 non-null  object
 7   url_to_image  99751 non-null   object
 8   published_at  105375 non-null  object
 9   content       105375 non-null  object
 10  category      105333 non-null  object
 11  full_content  58432 non-null   object
dtypes: int64(1), object(11)
memory usage: 9.6+ MB


### Null Handling

In [6]:
null_counts = data.isnull().sum()
null_counts

article_id          0
source_id       80880
source_name         0
author           8219
title              40
description       383
url                 0
url_to_image     5624
published_at        0
content             0
category           42
full_content    46943
dtype: int64

In [7]:
# Null values in title
data[data['title'].isnull()].head(2)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content
1139,91420,,kottke.org,Jason Kottke,,Is food in America better or worse than in oth...,https://kottke.org/23/10/0043175-is-food-in-am...,,2023-10-06 22:16:33.000000,Is food in America better or worse than in oth...,Peru,
16575,8362,,Thegospelcoalition.org,Scotty Smith,,“He (the Messiah—Jesus) shall judge between th...,https://www.thegospelcoalition.org/blogs/scott...,https://media.thegospelcoalition.org/wp-conten...,2023-10-08 11:28:57.000000,He (the MessiahJesus) shall judge between the ...,Somalia,


In [8]:
data['title'] = data['title'].fillna(data['content'])
data['title'].isnull().sum()

0

In [9]:
# Null values in content
data[data['full_content'].isnull()].head(2)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content
1,89542,,Prtimes.jp,,RANDEBOOよりワンランク上の大人っぽさが漂うニットとベストが新登場。,[株式会社Ainer]\nRANDEBOO（ランデブー）では2023年7月18日(火)より公...,https://prtimes.jp/main/html/rd/p/000000147.00...,https://prtimes.jp/i/32220/147/ogp/d32220-147-...,2023-10-06 04:40:02.000000,"RANDEBOO2023718()WEB2023 Autumn Winter \n""Nepa...",Nepal,
2,89543,,VOA News,webdesk@voanews.com (Agence France-Presse),UN Chief Urges World to 'Stop the Madness' of ...,UN Secretary-General Antonio Guterres urged th...,https://www.voanews.com/a/un-chief-urges-world...,https://gdb.voanews.com/01000000-0a00-0242-60f...,2023-10-30 10:53:30.000000,"Kathmandu, Nepal UN Secretary-General Antonio...",Nepal,


In [10]:
data['full_content'] = data['full_content'].fillna(data['content'])
data['full_content'].isnull().sum()

0

## Information Retrieval
Return the most relevant documents from the natural query using **cosine similarity**

In [11]:
%%time
def search_articles(query, data, topn=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(data['full_content'])
    query_vector = vectorizer.transform([query])
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    article_indices = cosine_similarities.argsort()[::-1]
    relevant_articles = data.iloc[article_indices][:topn]
    return relevant_articles

query = "plant based diet options"
results = search_articles(query, data, 10)

CPU times: user 33.6 s, sys: 687 ms, total: 34.3 s
Wall time: 31.5 s


In [13]:
results.head(3)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content
84445,390387,,The Indian Express,IE Online,New superdiet to reduce bad cholesterol? How a...,Why portfolio diet has the potential to reduce...,https://indianexpress.com/article/health-welln...,https://images.indianexpress.com/2023/11/portf...,2023-11-16 12:06:14,Diet plays a crucial role in reducing choleste...,Health,Written by By Dr Pradeep Haranahalli Diet play...
31601,35822,,Wealthofgeeks.com,Alison Corey,Health Benefits of Eating Vegan Once a Week,Eating a plant-based diet has become increasin...,https://wealthofgeeks.com/eating-vegan-once-a-...,https://wealthofgeeks.com/wp-content/uploads/2...,2023-10-31 10:48:56.000000,Eating a plant-based diet has become increasin...,Vegan,Eating a plant-based diet has become increasin...
31460,35625,,Erickimphotography.com,ERIC KIM,Food is Fashion,Thus the idea— being vegan or eating a “plant ...,https://erickimphotography.com/blog/2023/10/31...,,2023-11-01 03:12:43.000000,Thus the idea being vegan or eating a plant ba...,Vegan,Thus the idea being vegan or eating a plant ba...


### Top Article

In [14]:
title = results.iloc[0].title
content = results.iloc[0].full_content
url = results.iloc[0].url
print(title.upper(), '-'*len(title), content, url, sep="\n")

NEW SUPERDIET TO REDUCE BAD CHOLESTEROL? HOW AVOCADO, NUTS, AND LEGUMES COULD LOWER YOUR HEART DISEASE RISK
-----------------------------------------------------------------------------------------------------------
Written by By Dr Pradeep Haranahalli Diet plays a crucial role in reducing cholesterol levels. Hence, when we talk about lowering cholesterol, the first therapeutic measure is a therapeutic lifestyle change that includes exercise and diet, followed by medications if the lifestyle changes do not work. The American Heart Association suggests ten lipid-lowering or heart-healthy diets. Among these, the top three are the Mediterranean diet—abundant in vegetables, fruits, whole grains, beans, nuts, and seeds; the DASH diet—emphasizing potassium, calcium, magnesium, fiber, and protein while minimizing saturated fat; and the Pescetarian diet, notable for its emphasis on seafood. Very recently, the nurses’ health study and the Heart Protection Study have put forward the portfolio di

## Text Summarization

In [3]:
tokenizer = AutoTokenizer.from_pretrained("Falconsai/text_summarization")
summarizer = pipeline("summarization", model="Falconsai/text_summarization", tokenizer=tokenizer)


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [15]:
%%time
#tokenizer = AutoTokenizer.from_pretrained("./text_summarization")
#summarizer = pipeline("summarization", model="./text_summarization", tokenizer=tokenizer)
max_seq_length = 512
chunks = [""]
sentences = content.split('. ')
for sentance in sentences:
    chunk_size = len(chunks[-1])
    if chunk_size <= max_seq_length:
        if len(sentance)+chunk_size <= max_seq_length:
            chunks[-1] += sentance + ". "
        else:
            chunks.append(sentance)

summaries = []
for i, chunk in enumerate(chunks):
    print("Summarizing Chunk (",i+1,"/",len(chunks),")", sep="")
    summary = summarizer(chunk, max_length=50, min_length=30, do_sample=False)
    summaries.append(summary[0]['summary_text'])

Summarizing Chunk (1/7)


2024-02-19 01:01:50.330643: I external/local_xla/xla/service/service.cc:168] XLA service 0x55942c8962e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-02-19 01:01:50.330776: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2024-02-19 01:01:50.847713: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1708333312.295959 1641679 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2024-02-19 01:01:52.297444: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.


Summarizing Chunk (2/7)
Summarizing Chunk (3/7)


Your max_length is set to 50, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)


Summarizing Chunk (4/7)
Summarizing Chunk (5/7)
Summarizing Chunk (6/7)
Summarizing Chunk (7/7)
CPU times: user 1min 8s, sys: 11.2 s, total: 1min 19s
Wall time: 1min 1s


In [16]:
out = ". ".join("".join(summaries).split(" ."))
print("QUERY:", query.capitalize(), "\n")
print("OUTPUT:", out, sep="\n")
print("\nRead the full article:", url, sep='\n')

QUERY: Plant based diet options 

OUTPUT:
The American Heart Association suggests ten lipid-lowering diets.  The first therapeutic measure is exercise and diet, followed by medications if the lifestyle changes do not work. the nurses’ health study and the Heart Protection Study have put forward the portfolio diet based on the results of a trial study involving about one lakh thirty thousand participants. Portfolio diet focuses on protein derived from soy or legumes, fiber from products like okra or eggplant, nuts, soluble fibers, and plant sterols. Cardiovascular diseases, especially atherosclerotic vascular disease, are strongly associated with cholesterol levels in the body.  What is the advantage of the portfolio diet?high cholesterol levels lead to increased incidence of blockages in the blood supply, leading to heart attacks and strokes The portfolio diet effectively reduces the LDL cholesterol. The portfolio diet showed a 14 percent reduction in cardiovascular death (a composite 

## Keyword Extraction

In [17]:
%%time
def extract_keywords(document, topn=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform([document])
    feature_names = vectorizer.get_feature_names_out()
    word_scores = [(feature_names[i], tfidf_matrix[0, i]) for i in range(tfidf_matrix.shape[1])]
    word_scores_sorted = sorted(word_scores, key=lambda x: x[1], reverse=True)
    top_keywords = [word for word, score in word_scores_sorted[:topn]]
    return top_keywords

keywords = extract_keywords(out, 6)
keywords

CPU times: user 9.45 ms, sys: 8.4 ms, total: 17.8 ms
Wall time: 96 ms


['diet', 'portfolio', 'heart', 'cholesterol', 'grams', 'study']

### Keywords related articles

In [18]:
%%time
query = " ".join(keywords)
results = search_articles(query, data, 10)

CPU times: user 33.7 s, sys: 738 ms, total: 34.4 s
Wall time: 31.7 s


In [19]:
results.head(3)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content
84445,390387,,The Indian Express,IE Online,New superdiet to reduce bad cholesterol? How a...,Why portfolio diet has the potential to reduce...,https://indianexpress.com/article/health-welln...,https://images.indianexpress.com/2023/11/portf...,2023-11-16 12:06:14,Diet plays a crucial role in reducing choleste...,Health,Written by By Dr Pradeep Haranahalli Diet play...
79455,39223,,Spring.org.uk,Mina Dean,The Common Nut That Lowers Cholesterol Levels,"HDL cholesterol, known as ""good"" cholesterol, ...",https://www.spring.org.uk/?p=97971,https://www.spring.org.uk/images/nuts-1.jpg,2023-10-15 13:00:28.000000,"HDL cholesterol, known as “good” cholesterol, ...",Nutrition,"HDL cholesterol, known as “good” cholesterol, ..."
80155,326870,wired,Wired,Emily Mullin,A Single Infusion of a Gene-Editing Treatment ...,It’s still early days for a novel form of gene...,https://www.wired.com/story/a-single-infusion-...,https://media.wired.com/photos/6552b2a81dc6437...,2023-11-14 13:00:00,"In a small initial test in people, researchers...",News,"In a small initial test in people, researchers..."


## User Interface / User Experience Design
![](https://raw.githubusercontent.com/data-overflow/article-summarization/main/imgs/UI.png?token=GHSAT0AAAAAACOKDHDHIA3PIWC6KXIZ7IAKZO2YKZA)
