# News Article Summarization. 
This notebook prepares news articles to be inserted into the ChromaDB vector database. First, it creates a summary of the news article, then it identifies important Named Entities such as the names of politicians, locations, and relevant dates. Having a news summary allows us to quickly perform small-to-big retrieval; finding the full article from it's brief overview. This method helps us evaluate the Language Learning Model (LLM) more effectively. Additionally, the identified entities will be used as metadata and embedded with the news articles, aiding in fine-tuning and evaluating the LLM.

For more insight into this approach, check out this YouTube video by Jerry Liu, Founder of LlmamaIndex: https://youtu.be/TRjq7t2Ms5I.

## Config & Install Libraries
Check if Huggingface transformers and required libraries are installed

In [None]:
!pip install -q transformers sentencepiece sentence-transformers datasets

## News Summary Pipeline

In [3]:
import os
import json
from util import utils
from pymongo import MongoClient
from dotenv import load_dotenv, find_dotenv

In [4]:
load_dotenv(find_dotenv())

True

In [5]:
MONGO_CONN_STRING = 'mongodb://admin:7QdQ3v0M50<>@192.168.8.166:27017/'  # os.getenv("MONGO_CONNECTION_STRING")

In [6]:
mongo_client = MongoClient(MONGO_CONN_STRING)
db = mongo_client.get_database(os.getenv("MONGO_DB_NAME"))

### Prepare Dataset

In [7]:
batch_date = {'$gte': '2024-04-08', '$lte': '2024-04-09'}

In [8]:
news_articles = json.loads(json.dumps(list(db.get_collection('raw-news').find({'created_at': batch_date})), cls=utils.CustomMongoDecoder))

In [9]:
for article in news_articles:
    article['processed_content'] = ''.join(art.strip() for art in article['raw_content'])
    article['processed_content'] = article['processed_content'].replace('\xa0', ' ')

In [25]:
article['processed_content']



## Load Summarization Model

In [21]:
MODEL_NAME = 'google/bigbird-roberta-base'

In [47]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Load fine-tuned BART model for summarization
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Text to be summarized
text = article['processed_content']

# Tokenize and encode the text
inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True)

# Generate summary
summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, early_stopping=True)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# print summary
print("Summary with Named Entities: ", summary)


Summary with Named Entities:  South Carolina catapultedto the top of the Democratic primary in 2020, and on Monday, the president returned hoping the state – and its Black voters – can help recharge. The state’s February 3 Democratic primary is not competitive. But with many Black voters saying in polls and Democratic focus groups they feel disengaged and disenchanted with the political process, South Carolina will be the first electoral test of how deep a hole Biden is in.


In [49]:
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-bigpatent")

# by default encoder-attention is `block_sparse` with num_random_blocks=3, block_size=64
model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-bigpatent")

text = article['processed_content']

inputs = tokenizer(text, return_tensors='pt')
prediction = model.generate(**inputs)
prediction = tokenizer.batch_decode(prediction)

In [50]:
prediction

['<s> A method for energizing voters to re-engage with the fundamentals of the fabric of the fabric of the fabric of the fabric of the fabric of the fabric of the fabric of the fabric of the nation, the method comprising a series of speeches by candidates for and for re-establishing the integrity of the fabric of the fabric of the fabric of the fabric of the nation, the method comprising a series of speeches by candidates for and for re-establishing the integrity of the fabric of the fabric of the fabric of the fabric of the nation, the method comprising a series of speeches by candidates for and for re-establishing the integrity of the fabric of the fabric of the fabric of the fabric of the nation, the method comprising a series of speeches by candidates for and for re-establishing the integrity of the fabric of the fabric of the fabric of the fabric of the fabric of the nation, the method comprising a series of speeches by candidates for and for re-establishing the integrity of the f

In [None]:
# Input text
text = "Your long text here..."

# Tokenize input text
inputs = tokenizer(text, return_tensors='pt', max_length=4096, truncation=True)

# Generate summary
summary_ids = model.generate(inputs.input_ids, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Print the summary
print(summary)

In [19]:
summarizer(article['processed_content'])

[{'summary_text': 'in this report we report on the joint investigation of the 2016 , 2016 march 1 protest and the 2016 march in support of @xmath0 and @xmath1 .'}]