# News Article Summarization. 
This notebook prepares news articles to be inserted into the ChromaDB vector database. First, it creates a summary of the news article, then it identifies important Named Entities such as the names of politicians, locations, and relevant dates. Having a news summary allows us to quickly perform small-to-big retrieval; finding the full article from it's brief overview. This method helps us evaluate the Language Learning Model (LLM) more effectively. Additionally, the identified entities will be used as metadata and embedded with the news articles, aiding in fine-tuning and evaluating the LLM.

For more insight into this approach, check out this YouTube video by Jerry Liu, Founder of LlmamaIndex: https://youtu.be/TRjq7t2Ms5I.

## Config & Install Libraries
Check if Huggingface transformers and required libraries are installed

In [1]:
!pip install -q transformers sentencepiece sentence-transformers datasets

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scrapydweb 1.5.0 requires click==7.1.2, but you have click 8.1.7 which is incompatible.
scrapydweb 1.5.0 requires idna==2.7, but you have idna 3.6 which is incompatible.
scrapydweb 1.5.0 requires pytz==2018.9, but you have pytz 2024.1 which is incompatible.
scrapydweb 1.5.0 requires w3lib==2.0.0, but you have w3lib 2.1.2 which is incompatible.[0m[31m
[0m

## News Summary Pipeline

In [9]:
import os
import json
from util import utils
from pymongo import MongoClient
from dotenv import load_dotenv, find_dotenv

In [4]:
load_dotenv(find_dotenv())

True

In [5]:
MONGO_CONN_STRING = 'mongodb://admin:7QdQ3v0M50<>@192.168.8.166:27017/'  # os.getenv("MONGO_CONNECTION_STRING")

In [6]:
mongo_client = MongoClient(MONGO_CONN_STRING)
db = mongo_client.get_database(os.getenv("MONGO_DB_NAME"))

### Prepare Dataset

In [7]:
batch_date = {'$gte': '2024-04-05', '$lte': '2024-04-06'}

In [10]:
news_articles = json.loads(json.dumps(list(db.get_collection('raw-news').find({'created_at': batch_date})), cls=utils.CustomMongoDecoder))

In [15]:
for article in news_articles:
    article['processed_content'] = ''.join(art.strip() for art in article['raw_content'])

In [16]:
article['processed_content']

'Many Republicans plan to skip the House GOP retreat as they grumble about both the location and the idea of spending time with one another, with tensions still running high inside the party in the wake of their unprecedented speakership drama.Fewer than 100 Republicans have RSVP’d to attend the retreat, which is less than half of the entire conference, according to a GOP source familiar with the attendance sheet.The retreat is scheduled to take place Wednesday and Thursday in West Virginia.Publicly, Republicans have cited a litany of reasons for not attending: from having to tend to reelection races to scheduling conflicts. GOP Rep. Nancy Mace of South Carolina, for example, is scheduled to appear on “Real Time with Bill Maher” later this week. Meanwhile, when asked if he was attending, GOP Rep. Kelly Armstrong of North Dakota told CNN: “No way, I have to run for governor.” And Rep. Tim Burchett of Tennessee quipped:\xa0“I don’t retreat, I move forward! I got a farm to run.”But privat