# News Article Summarization. 
This notebook prepares news articles to be inserted into the ChromaDB vector database. First, it creates a summary of the news article, then it identifies important Named Entities such as the names of politicians, locations, and relevant dates. Having a news summary allows us to quickly perform small-to-big retrieval; finding the full article from it's brief overview. This method helps us evaluate the Language Learning Model (LLM) more effectively. Additionally, the identified entities will be used as metadata and embedded with the news articles, aiding in fine-tuning and evaluating the LLM.

For more insight into this approach, check out this YouTube video by Jerry Liu, Founder of LlmamaIndex: https://youtu.be/TRjq7t2Ms5I.

## Config & Install Libraries
Check if Huggingface transformers and required libraries are installed

In [None]:
!pip install -q transformers sentencepiece sentence-transformers datasets spacy chromadb

In [None]:
!python -m spacy download en_core_web_sm

## News Summary Pipeline

In [2]:
import os
import json
from util import utils
from pymongo import MongoClient
from dotenv import load_dotenv, find_dotenv

In [3]:
load_dotenv(find_dotenv())

True

### Parameters

In [4]:
collection_name = 'raw-news'

In [5]:
batch_date = {'$gte': '2024-04-12', '$lte': '2024-04-13'}

In [6]:
MONGO_CONN_STRING = 'mongodb://admin:7QdQ3v0M50<>@192.168.8.166:27017/'  # os.getenv("MONGO_CONNECTION_STRING")

In [7]:
mongo_client = MongoClient(MONGO_CONN_STRING)
db = mongo_client.get_database(os.getenv("MONGO_DB_NAME"))

### Prepare Dataset

In [8]:
news_articles = json.loads(json.dumps(list(db.get_collection(collection_name).find({'created_at': batch_date})), cls=utils.CustomMongoDecoder))

In [9]:
for article in news_articles:
    article['processed_content'] = ''.join(art.strip() for art in article['raw_content'])
    article['processed_content'] = article['processed_content'].replace('\xa0', ' ')

In [10]:
article['processed_content']

'Sen. Joe Manchin, a critical swing vote in the closely divided Senate, said Wednesday that from now on, he will only vote to confirm nominees who have the support of at least one Republican senator.“I’m going to be very honest with everybody, if my Democratic colleagues and friends can’t get one Republican vote, don’t count on me. You can’t make it bipartisan, don’t count on me,” said the West Virginia Democrat who haswill begin in January 2025.“I’m not leaving this place unless I can practice what I preach and I’m preaching, basically bipartisanship,” he said. “This is my little way of doing it.”Manchin’s comments came in response to questions from CNN about President Joe Biden’s nominee for the Third Circuit Court of Appeals, Adeel Mangi, who would be the first Muslin-American on a federal appeals court. Many Republicans are vehemently opposed to him and accuse him having extreme views and part of a group they call antisemitic. Top Democrats strongly defend him and are pressing for 

## Load Summarization Model

In [11]:
MODEL_NAME = 'google/bigbird-roberta-base'

In [12]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Load fine-tuned BART model for summarization
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)


def summarize_article(text: str):
    # Tokenize and encode the text
    inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True)
    
    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, early_stopping=True)
    
    # Decode and print the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    # return summary
    return summary

### Named Entity Recognition

In [13]:
REQUIRED_FIELDS = ['PERSON', 'GPE', 'NORP', 'EVENT', 'ORG']

In [14]:
import spacy
from collections import defaultdict

def perform_ner(text: str):
    
    # Load the English language model
    nlp = spacy.load("en_core_web_sm")
    
    # Process the text with spaCy
    doc = nlp(text)
    
    # Extract named entities
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    return entities

In [15]:
# postprocess the named entities to select the required entity tags
def postprocess_entities(entities):
    processed_entities = defaultdict(set)
    
    for entity, label in entities:
        if label in REQUIRED_FIELDS:
            processed_entities[label].add(entity)
    processed_entities = {key: list(value) for key, value in processed_entities.items()}
    return processed_entities

## Perform Summarization and NER on News Articles

In [16]:
from bson import ObjectId

In [None]:
for news in news_articles:
    summary = summarize_article(news['processed_content'])
    entities = postprocess_entities(perform_ner(news['processed_content']))

    # filter criteria
    filter_criteria = {'_id': ObjectId(news['_id'])}
    
    # Define the update operation
    update_data = {
        '$set': {
            'processed_news_content': news['processed_content'],
            'news_summary': summary,
            'entities': entities
        }
    }
    
    # Update the Mongo document
    result = db.get_collection('raw-news').update_one(filter_criteria, update_data)

## Save content and metadata on Chromadb