# News Summarize and Queries RAG Model

Reference: https://medium.com/@vivekschaurasia/a-rag-model-for-summarizing-news-with-langchain-and-openai-301faefae7ef

Install all required Libraries

In [None]:
!pip install \
GoogleNews \
bs4 \
sentence-transformers \
faiss-cpu \
groq

Import all required libraries

In [47]:
from sentence_transformers import SentenceTransformer
from langchain.schema import SystemMessage, HumanMessage
from GoogleNews import GoogleNews
from bs4 import BeautifulSoup
import re
import requests
import time, random
import pandas as pd
import re
import faiss
import numpy as np

Search for news related to user entered topic

In [17]:
user_request = input(str("Give me a topic: " ))

googlenews = GoogleNews(period='7d')
googlenews.search(user_request)

all_results = []
for i in range(1, 50):
    googlenews.getpage(i)
    result = googlenews.result()

    if result:
        all_results.extend(result)

    if len(all_results) >= 100:
        break

df = pd.DataFrame(all_results).drop_duplicates(subset=['title'], keep='last').head(100)
df.reset_index(drop=True, inplace=True)

Give me a topic: GDP


In [18]:
df.columns

Index(['title', 'media', 'date', 'datetime', 'desc', 'link', 'img'], dtype='object')

In [22]:
df.loc[1,'desc']

'John Healey told papers over the weekend he had "no doubt" that the government would spend 3% on defence by 2034. NATO target: 3.5% By the Sunday.'

In [23]:
data = df.drop(columns = ['media', 'date', 'datetime', 'desc', 'img'])

In [26]:
latest_links = [re.split("&ved", link)[0] for link in df['link']]

Using the news extracted from the googlenews. We extract more details from the website usign the link in the googlnews

In [29]:
description = []
summary_details = []

for idx,link in enumerate(latest_links):

    try:
        response = requests.get(link, timeout=10)

        if response.status_code == 200:
            html_content = response.text
        else:
            print(f"Failed to retrieve: {link} (Status code: {response.status_code})")
            description.append("Failed to retrieve the webpage.")
            continue

        soup = BeautifulSoup(html_content, "html.parser")
        paragraphs = soup.find_all("p")

        page_description = " ".join([p.get_text() for p in paragraphs])
        description.append(page_description)
        if idx <= 5:
          summary_details.appendpage_description()

    except requests.exceptions.RequestException as e:
        print(f"Error retrieving {i}: {e}")
        description.append("Failed to retrieve the webpage.")
        continue

    time.sleep(random.uniform(1, 3))

print(description)

data["description"] = description

Failed to retrieve: https://in.investing.com/news/india-markets-weak-in-early-trade-as-global-jitters-offset-gdp-cheer-nifty-slips-below-24600-4856878 (Status code: 403)
Failed to retrieve: https://www.forexfactory.com/news/1344344-swiss-gross-domestic-product-in-the-1st-quarter (Status code: 403)
Failed to retrieve: https://www.businessworld.in/article/capital-gains-consumption-pains-indias-fy25-gdp-reveals-uneven-growth-story-558444 (Status code: 202)
Failed to retrieve: https://www.youtube.com/watch%3Fv%3DmZgxH15STSM (Status code: 429)


In [33]:
data.loc[1,'description']

'John Healey told papers over the weekend he had “no doubt” that the government would spend 3% on defence by 2034. NATO target: 3.5% By the Sunday interviews Healey had reversed his line and said it was only an “ambition” to ramp up spending to that level. This morning on the Today Programme – a couple of hours before the Strategic Defence Review is launched – Starmer could not provide a guarantee: “What I said at the election in 2024 is that we would get to 2.5% and I was pressed time and again ‘what precise date’ and I said ‘as soon as I can be absolutely clear with a firm date, a firm commitment that we will keep to’, because I had seen the previous government make commitments about this percent or that percent with no plan behind it, I’m not going down that road. Therefore, what you can take from this is – yes – that 3% but I am not, as the Prime Minister of a Labour Government, going to make a commitment as to the precise date, until I couldn’t be sure precisely where the money is

Create a folder to store these data

In [35]:
import os

folder_name = "NEWS_data"
os.makedirs(folder_name, exist_ok=True)

for index, row in data.iterrows():
    with open(os.path.join(folder_name, f"description_{index + 1}.txt"), "w", encoding="utf-8") as f:
        f.write(row['description'])

print("Descriptions have been saved in the 'NEWS_data' folder.")

Descriptions have been saved in the 'NEWS_data' folder.


Load the data from the folder and chunk it using RecursiveCharacterTextSplitter

In [42]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

documents = []
for filename in os.listdir("NEWS_data"):
    if filename.endswith(".txt"):
        with open(os.path.join("NEWS_data", filename), 'r', encoding='utf-8') as file:
            documents.append(file.read())

all_chunks = []
for doc in documents:
    chunks = text_splitter.split_text(doc)
    all_chunks.extend(chunks)

In [44]:
all_chunks[1]

'Analysts also said that market participants will be closely monitoring key macroeconomic announcements for further cues.\nRBI’s Monetary Policy Committee (MPC) will begin the deliberations on its next bi-monthly policy on June 4 and the outcome is scheduled to be announced on June 6.\nBesides, PMI (Purchasing Managers’ Index) data for manufacturing and services sectors is also expected to be announced this week.\nAt the interbank foreign exchange, the domestic unit opened at 85.55 and gained further ground to trade at 85.43 against the greenback in initial deals, registering a rise of 12 paise from its previous close.\nThe rupee ended 7 paise lower at 85.55 against the dollar on Friday.\nMeanwhile, the dollar index, which gauges the greenback’s strength against a basket of six currencies, was trading lower by 0.05 per cent at 99.21.\nBrent crude, the global oil benchmark, rose 2.12 per cent to USD 64.11 per barrel in futures trade.'

Load embedding model

In [54]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight & fast

Embed the corpus

In [55]:
corpus_embeddings = embedding_model.encode(all_chunks, convert_to_numpy=True)

Push the embedding into the faiss vector db

In [56]:
embedding_dimension = corpus_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dimension)  # L2 = Euclidean distance
index.add(corpus_embeddings)

Sample query

In [58]:
# Query example
query = "Summarize the news related to the GDP"
query_embedding = embedding_model.encode([query], convert_to_numpy=True)

In [63]:
len(all_chunks[0])

700

Extract top 3 document from the Faiss that related to the query

In [59]:
# Search for top 3 similar sentences
k = 3
distances, indices = index.search(query_embedding, k)

Print the extracted query

In [60]:
# Step 8: Show results
print(f"\nQuery: {query}")
print("Top 3 most similar sentences in the corpus:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}. {all_chunks[idx]} (Distance: {distances[0][i]:.4f})")


Query: Summarize the news related to the GDP
Top 3 most similar sentences in the corpus:
1. momentum.   Per capita GDP in real terms increased by 5.5 per cent, reaching `1.33 lakh, while per capita Gross National Income stood at `1.31 lakh, marking a 5.4 per cent rise.    These gains suggest broad-based improvements in economic well-being.  Stay in the loop with the latest buzz! Subscribe to our newsletter  © 2025 BizzBuzz. All Rights Reserved.     Powered by  Hocalwire (Distance: 1.0093)
2. clear message that the tariff and trade scenario will continue to be uncertain and turbulent. This headwind will impact markets,” Vijayakumar added. Echoing similar concerns, Prashanth Tapse, Senior VP (Research), Mehta Equities Ltd, warned of possible turbulence on Dalal Street. “Market turbulence looms amid escalating US-China trade tensions, with President Trump accusing China of violating their agreement,” Tapse said. Still, India’s better-than-expected GDP figures offer comfort on the domesti

System Prompt to guide the generation model

In [64]:
SYSTEM_PROMPT = """
You are an expert news curator and writer. Based solely on the provided context, perform the specified task (e.g., summarization or question answering).
Do not include any information not present in the context.
Write in an engaging, user-friendly news style: start with a clear, concise title, then present the content in short, factual paragraphs, maintaining a curious and informative tone.
Use factual language, avoid opinions, and maintain objectivity. Ensure clarity and cohesion throughout.
""".strip()

Load model from GROQ

In [71]:
from groq import Groq

In [70]:
os.environ['GROQ_API_KEY'] = "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"

In [89]:
model_id = "llama3-70b-8192"
client = Groq()

Function for run the model

In [80]:
SYSTEM_PROMPT = (
    "You are an expert news curator and writer. "
    "Based solely on the provided context, perform the specified task (e.g., summarization or question answering). "
    "Do not include any information not present in the context. "
    "Write in an engaging, user-friendly news style: start with a clear, concise title, then present the content in short, factual paragraphs, maintaining a curious and informative tone. "
    "Use factual language, avoid opinions, and maintain objectivity. Ensure clarity and cohesion throughout."
)


In [90]:
def run_news_task(query: str,task: str,k:int = 3):
    """
    context: The news article(s) or document(s) you retrieved separately.
    task: A short instruction like "Summarize this in 3 paragraphs."
          or "Answer: Who are the key people mentioned?"
    Returns a generator (stream) of incremental chunks from the API.
    """
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)


    k = 3
    distances, indices = index.search(query_embedding, k)
    context = ""
    for i, idx in enumerate(indices[0]):
      context += all_chunks[idx]

    # 1️⃣ Build LangChain‐style messages (just for readability here). We will convert to plain dicts.
    system_msg_obj = SystemMessage(content=SYSTEM_PROMPT)
    human_msg_obj = HumanMessage(
        content=(
            "Here is the context—do not hallucinate.\n\n"
            "-----\n"
            f"{context}\n"
            "-----\n\n"
            f"{task}"
        )
    )

    # 2️⃣ Convert those into the dict format expected by the low‐level client:
    messages = [
        {"role": "system", "content": system_msg_obj.content},
        {"role": "user",   "content": human_msg_obj.content},
    ]

    # 3️⃣ Call the streaming endpoint
    stream = client.chat.completions.create(
        model=model_id,
        messages=messages,
        temperature=0.2,
        top_p=1.0,
        stream=True,
    )
    return stream

Call the function and get response back

In [92]:
stream = run_news_task(
        query="Explain the news",
        task="Summarize this news in two brief paragraphs, with a clear title at the top."
    )
print("\n--- Generated Summary ---")
for chunk in stream:
    print(chunk.choices[0].delta.content, end="")


--- Generated Summary ---
**Indian Stock Market Falls Despite Strong GDP Growth**

India's benchmark stock market indices, Sensex and Nifty, fell sharply in early trade despite the country's GDP growth exceeding expectations in the fourth quarter of FY25, coming in at 7.4%. The BSE Sensex was down 732.71 points at 80,718.30, while the NSE Nifty50 slipped 197.45 points to 24,553.25. Broader markets also mirrored the weakness, with most indices in the red amid a spike in volatility.

The decline in the Indian stock market is attributed to mounting global concerns, including renewed tariff concerns and uncertainty over global trade and capital flows. Despite strong domestic data, investor sentiment appears weighed down by these global headwinds. Experts believe that the threat of a wider impact on global trade and capital flows has made investors cautious, overshadowing the domestic positives.None