# News Article Summarization. 
This notebook prepares news articles to be inserted into the ChromaDB vector database. First, it creates a summary of the news article, then it identifies important Named Entities such as the names of politicians, locations, and relevant dates. Having a news summary allows us to quickly perform small-to-big retrieval; finding the full article from it's brief overview. This method helps us evaluate the Language Learning Model (LLM) more effectively. Additionally, the identified entities will be used as metadata and embedded with the news articles, aiding in fine-tuning and evaluating the LLM.

For more insight into this approach, check out this YouTube video by Jerry Liu, Founder of LlmamaIndex: https://youtu.be/TRjq7t2Ms5I.

## Config & Install Libraries
Check if Huggingface transformers and required libraries are installed

In [None]:
!pip install -q transformers sentencepiece sentence-transformers datasets spacy chromadb

In [None]:
!python -m spacy download en_core_web_sm

## News Summary Pipeline

In [2]:
import os
import json
from util import utils
from pymongo import MongoClient
from dotenv import load_dotenv, find_dotenv

In [3]:
load_dotenv(find_dotenv())

True

### Parameters

In [4]:
collection_name = 'raw-news'

In [5]:
batch_date = {'$gte': '2024-04-12', '$lte': '2024-04-13'}

In [6]:
MONGO_CONN_STRING = 'mongodb://admin:7QdQ3v0M50<>@192.168.8.166:27017/'  # os.getenv("MONGO_CONNECTION_STRING")

In [7]:
mongo_client = MongoClient(MONGO_CONN_STRING)
db = mongo_client.get_database(os.getenv("MONGO_DB_NAME"))

### Prepare Dataset

In [8]:
news_articles = json.loads(json.dumps(list(db.get_collection(collection_name).find({'created_at': batch_date})), cls=utils.CustomMongoDecoder))

In [9]:
for article in news_articles:
    article['processed_content'] = ''.join(art.strip() for art in article['raw_content'])
    article['processed_content'] = article['processed_content'].replace('\xa0', ' ')

In [10]:
article['processed_content']

'Sen. Joe Manchin, a critical swing vote in the closely divided Senate, said Wednesday that from now on, he will only vote to confirm nominees who have the support of at least one Republican senator.“I’m going to be very honest with everybody, if my Democratic colleagues and friends can’t get one Republican vote, don’t count on me. You can’t make it bipartisan, don’t count on me,” said the West Virginia Democrat who haswill begin in January 2025.“I’m not leaving this place unless I can practice what I preach and I’m preaching, basically bipartisanship,” he said. “This is my little way of doing it.”Manchin’s comments came in response to questions from CNN about President Joe Biden’s nominee for the Third Circuit Court of Appeals, Adeel Mangi, who would be the first Muslin-American on a federal appeals court. Many Republicans are vehemently opposed to him and accuse him having extreme views and part of a group they call antisemitic. Top Democrats strongly defend him and are pressing for 

## Load Summarization Model

In [11]:
MODEL_NAME = 'google/bigbird-roberta-base'

In [12]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Load fine-tuned BART model for summarization
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)


def summarize_article(text: str):
    # Tokenize and encode the text
    inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True)
    
    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, early_stopping=True)
    
    # Decode and print the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    # return summary
    return summary

### Named Entity Recognition

In [13]:
REQUIRED_FIELDS = ['PERSON', 'GPE', 'NORP', 'EVENT', 'ORG']

In [14]:
import spacy
from collections import defaultdict

def perform_ner(text: str):
    
    # Load the English language model
    nlp = spacy.load("en_core_web_sm")
    
    # Process the text with spaCy
    doc = nlp(text)
    
    # Extract named entities
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    return entities

In [15]:
# postprocess the named entities to select the required entity tags
def postprocess_entities(entities):
    processed_entities = defaultdict(set)
    
    for entity, label in entities:
        if label in REQUIRED_FIELDS:
            processed_entities[label].add(entity)
    processed_entities = {key: list(value) for key, value in processed_entities.items()}
    return processed_entities

## Perform Summarization and NER on News Articles

In [16]:
from bson import ObjectId

In [None]:
# for news in news_articles:
#     summary = summarize_article(news['processed_content'])
#     entities = postprocess_entities(perform_ner(news['processed_content']))

#     news['news_summary'] = summary
#     news['entities'] = entities

#     # filter criteria
#     filter_criteria = {'_id': ObjectId(news['_id'])}
    
#     # Define the update operation
#     update_data = {
#         '$set': {
#             'processed_news_content': news['processed_content'],
#             'news_summary': summary,
#             'entities': entities
#         }
#     }
    
#     # Update the Mongo document
#     result = db.get_collection('raw-news').update_one(filter_criteria, update_data)

## Save content and metadata on Chromadb

In [17]:
import chromadb
from chromadb.utils import embedding_functions

In [18]:
chroma_client = chromadb.HttpClient()

In [26]:
try:
    collection = chroma_client.get_collection('us-election-gpt')
except:
    collection = chroma_client.create_collection('us-election-gpt')

In [22]:
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

In [None]:
for news in news_articles:
    collection.add(
        documents=[news['processed_content']],
        embeddings=[sentence_transformer_ef(news['processed_content'])[0]],
        metadatas=[{'entities': json.dumps(news['entities']), 
                    'summary': news['news_summary'], 
                    'source': news['source'],
                    # 'publication_date': news['publication_date']
                   }],
        ids=[str(news['_id'])]
    )

### Test Chromadb Querying

In [41]:
TEST_QUERY = """
    What's the latest in Texas?
"""

In [42]:
query_entities = postprocess_entities(perform_ner(TEST_QUERY))
query_embeddings = sentence_transformer_ef(TEST_QUERY)

In [43]:
query_entities

{'GPE': ['Texas']}

In [49]:
collection.query(query_embeddings=query_embeddings, n_results=1)

{'ids': [['66195e15fd5a60db36d7bf94'],
  ['66195e15fd5a60db36d7bf94'],
  ['66195e15fd5a60db36d7bf94'],
  ['66195e15fd5a60db36d7bf94'],
  ['66195e15fd5a60db36d7bf94'],
  ['66195e19fd5a60db36d7bf96'],
  ['661944e002d805322e3a6a40'],
  ['6619466f02d805322e3a6a5a'],
  ['6619467b02d805322e3a6a5b'],
  ['66195e15fd5a60db36d7bf94'],
  ['6619441802d805322e3a6a29'],
  ['66195e15fd5a60db36d7bf94'],
  ['6619467b02d805322e3a6a5b'],
  ['661944e002d805322e3a6a40'],
  ['66195e00fd5a60db36d7bf91'],
  ['66195e15fd5a60db36d7bf94'],
  ['66195e18fd5a60db36d7bf95'],
  ['6619466f02d805322e3a6a5a'],
  ['6619467b02d805322e3a6a5b'],
  ['66195e00fd5a60db36d7bf91'],
  ['6619441802d805322e3a6a29'],
  ['6619467b02d805322e3a6a5b'],
  ['66195e15fd5a60db36d7bf94'],
  ['66195e2afd5a60db36d7bf99'],
  ['6619468902d805322e3a6a5e'],
  ['66195e15fd5a60db36d7bf94'],
  ['6619467b02d805322e3a6a5b'],
  ['66195e00fd5a60db36d7bf91'],
  ['6619464a02d805322e3a6a48'],
  ['6619466f02d805322e3a6a5a'],
  ['6619441802d805322e3a6a29'],
 

In [46]:
len(collection.query(query_embeddings=query_embeddings, n_results=1)['ids'])

33

## LangChain

In [52]:
!pip install langchain langchain-chroma

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting langchain-chroma
  Downloading langchain_chroma-0.1.0-py3-none-any.whl.metadata (1.3 kB)
Downloading langchain_chroma-0.1.0-py3-none-any.whl (8.5 kB)
Installing collected packages: langchain-chroma
Successfully installed langchain-chroma-0.1.0


In [65]:
import chromadb
from langchain_chroma import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

In [56]:
chroma_client = chromadb.HttpClient()

In [66]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [67]:
langchain_chroma = Chroma(
    client=chroma_client,
    collection_name="us-election-gpt",
    embedding_function=embedding_function,
)

In [68]:
print("There are", langchain_chroma._collection.count(), "items in the collection")

There are 78 items in the collection


In [72]:
query = "How do you think michigan will vote this coming election"
docs = langchain_chroma.similarity_search(query)
print(docs[0].page_content)

It’s an important new phase of, when the early contests are over and voters from multiple states cast ballots in primaries timed to occur on the same date.It’s called “Super Tuesday,” and it is important even though neither Democratic President Joe Biden nor former President Donald Trump has had to sweat the competition this year. Primaries on Tuesday may offer the final opportunity for former South Carolina Gov. Nikki Haley’s quixotic and lackluster effort to challenge Trump for the Republican presidential nomination.Instead of a single primary or caucus, Super Tuesday lumps together 15 contests for Republicans and 16 contests for Democrats spread across the country.More than a third ofare at stake along with an equally large portion of Democratic delegates. Biden is undefeated in primary contests this year, and Trump has lost only one.A large sample of the country will have contests on Super Tuesday – red states and blue states from the North, South, East and West.The primaries at st

### Quantization
Quantization is done reduce memory footprint and perform faster inference while still retaining acceptable model performance. For this quantization, we will use bitandbytes 

In [74]:
!pip install bitsandbytes

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m303.0 kB/s[0m eta [36m0:00:00[0m00:01[0m00:10[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.42.0
