# News Article Summarization. 
This notebook prepares news articles to be inserted into the ChromaDB vector database. First, it creates a summary of the news article, then it identifies important Named Entities such as the names of politicians, locations, and relevant dates. Having a news summary allows us to quickly perform small-to-big retrieval; finding the full article from it's brief overview. This method helps us evaluate the Language Learning Model (LLM) more effectively. Additionally, the identified entities will be used as metadata and embedded with the news articles, aiding in fine-tuning and evaluating the LLM.

For more insight into this approach, check out this YouTube video by Jerry Liu, Founder of LlmamaIndex: https://youtu.be/TRjq7t2Ms5I.

## Config & Install Libraries
Check if Huggingface transformers and required libraries are installed

In [None]:
!pip install -q transformers sentencepiece sentence-transformers datasets 

In [61]:
!pip install -q spacy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [63]:
!python -m spacy download en_core_web_sm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m116.7 kB/s[0m eta [36m0:00:00[0m00:01[0m00:04[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## News Summary Pipeline

In [3]:
import os
import json
from util import utils
from pymongo import MongoClient
from dotenv import load_dotenv, find_dotenv

In [4]:
load_dotenv(find_dotenv())

True

In [5]:
MONGO_CONN_STRING = 'mongodb://admin:7QdQ3v0M50<>@192.168.8.166:27017/'  # os.getenv("MONGO_CONNECTION_STRING")

In [6]:
mongo_client = MongoClient(MONGO_CONN_STRING)
db = mongo_client.get_database(os.getenv("MONGO_DB_NAME"))

### Prepare Dataset

In [7]:
batch_date = {'$gte': '2024-04-08', '$lte': '2024-04-09'}

In [8]:
news_articles = json.loads(json.dumps(list(db.get_collection('raw-news').find({'created_at': batch_date})), cls=utils.CustomMongoDecoder))

In [9]:
for article in news_articles:
    article['processed_content'] = ''.join(art.strip() for art in article['raw_content'])
    article['processed_content'] = article['processed_content'].replace('\xa0', ' ')

In [25]:
article['processed_content']



## Load Summarization Model

In [21]:
MODEL_NAME = 'google/bigbird-roberta-base'

In [57]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Load fine-tuned BART model for summarization
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Text to be summarized
text = article['processed_content']

# Tokenize and encode the text
inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True)

# Generate summary
summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, early_stopping=True)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# print summary
print("Summary with Named Entities: ", summary)

Summary with Named Entities:  South Carolina catapultedto the top of the Democratic primary in 2020, and on Monday, the president returned hoping the state – and its Black voters – can help recharge. The state’s February 3 Democratic primary is not competitive. But with many Black voters saying in polls and Democratic focus groups they feel disengaged and disenchanted with the political process, South Carolina will be the first electoral test of how deep a hole Biden is in.


In [67]:
import spacy
from collections import defaultdict

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Define the input text
text = article['processed_content']

# Process the text with spaCy
doc = nlp(text)

# Extract named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

In [68]:
# Print the named entities
processed_entities = defaultdict(list)

for entity, label in entities:
    processed_entities[label].append(entity)

print(dict(processed_entities))

{'GPE': ['South Carolina', 'South Carolina', 'South Carolina', 'South Carolina', 'Valley Forge', 'Pennsylvania', 'Charleston', 'America', 'South Carolina', 'South Carolina', 'South Carolina', 'South Carolina', 'South Carolina', 'America', 'America', 'America', 'Harboring', 'America', 'America', 'America', 'South Carolina', 'Wisconsin', 'Michigan', 'Pennsylvania', 'Georgia', 'North Carolina', 'Milwaukee', 'Detroit', 'Pittsburgh', 'Philadelphia', 'Atlanta', 'Charlotte', 'North Carolina', 'Israel', 'Gaza', 'Israel', 'Gaza', 'Pennsylvania', 'South Carolina', 'Valley Forge', 'South Carolina', 'South Carolina', 'Haley', 'America', 'South Carolina', 'Myrtle Beach', 'Haley', 'America', 'South Carolina'], 'NORP': ['Democratic', 'Democratic', 'Democratic', 'American', 'Democratic', 'South Carolinian', 'Confederates', 'American', 'Democratic', 'Republican', 'Israeli', 'Black Americans', 'American', 'American', 'Democratic', 'Democrat', 'Republicans'], 'DATE': ['2020', 'Monday', 'February 3', 'Mon