# Intro to RAG

We will use RAG to enhance the data we send off to the LLM to assess

## Install:

 pip install nltk spacy textblob scikit-learn


### Download English model for Spacy
 python -m spacy download en_core_web_sm  
 



### Next step:
Let's look at some of our data in news.json

In [45]:
import json
import random

# Load the cleaned dataset
with open("news.json", "r", encoding="utf-8") as f:
    news_articles = json.load(f)

# 🔹 CONFIGURATION: Set to True to remove short articles (otherwise, it's OFF)
remove_short_articles = False  # Change to True to enable filtering
min_content_length = 300  # Change this to filter out articles with content length < X

# Count total articles
total_articles = len(news_articles)
print(f"📰 Total articles in news.json: {total_articles}")

# 🔹 Optionally Remove Short Articles
if remove_short_articles:
    news_articles = [article for article in news_articles if len(article["content"]) >= min_content_length]
    print(f"🗑 Removed short articles. Remaining articles: {len(news_articles)}")

    # Save filtered articles back to news.json
    with open("news.json", "w", encoding="utf-8") as f:
        json.dump(news_articles, f, indent=4)
    print("✅ Filtered news.json saved.")

# Randomly select an article index
if len(news_articles) > 0:
    article_index = random.randint(0, len(news_articles) - 1)

    # Function to format text with line breaks every 100 characters
    def format_text(text, max_chars=100):
        """Inserts line breaks every `max_chars` characters for better readability."""
        return '\n'.join([text[i:i+max_chars] for i in range(0, len(text), max_chars)])

    # Get the title and content
    title = news_articles[article_index]['title']
    content = news_articles[article_index]['content']
    content_length = len(content)

    # Print the formatted article with content length
    print(f"\n🔹 **Title:** {title}\n")
    print(f"🔹 **Content Length:** {content_length} characters\n")
    print(f"🔹 **Content:**\n")
    print(format_text(content))
else:
    print("⚠️ No articles left after filtering.")


📰 Total articles in news.json: 5

🔹 **Title:** Theatre artist calls upon younger generation to nurture theatre

🔹 **Content Length:** 9048 characters

🔹 **Content:**

February 14, 2025e-Paper
Subscribe
Live Now Agriculture
Books
Books
Hindi Belt
The Hindu On Books Bo
oks of the week, reviews, excerpts, new titles and features.
SEE ALL NEWSLETTERS
Business
Business A
gri-Business Economy Industry Markets Budget
Children
Cities
Cities Bengaluru Chennai Coimbatore Del
hi Hyderabad Kochi Kolkata Kozhikode Madurai Mangaluru Mumbai Puducherry Thiruvananthapuram Tiruchir
apalli Vijayawada Visakhapatnam
Data
Data
Data Point Podcast
Data Point Decoding the headlines with 
facts, figures, and numbers
SEE ALL NEWSLETTERS
Ebook
Education
Education Careers Colleges Schools
E
lections
Entertainment
Entertainment Art Dance Movies Music Reviews Theatre
First Day First Show New
s and reviews from the world of cinema and streaming.
SEE ALL NEWSLETTERS
Environment
Food
Food Dini
ng Features Guides Recipe

## Text cleaning
The text inside news.json isn't very clean, let's fix that

In [39]:
import json

# Load the news dataset
with open("news.json", "r", encoding="utf-8") as f:
    news_articles = json.load(f)

# Filter out entries with NULL or empty content
cleaned_articles = [
    article for article in news_articles
    if "content" in article and isinstance(article["content"], str) and article["content"].strip()
]

# Save cleaned dataset
with open("cleaned_news.json", "w", encoding="utf-8") as f:
    json.dump(cleaned_articles, f, indent=4)

print(f"✅ Removed NULL values. {len(cleaned_articles)} valid articles saved in 'cleaned_news.json'.")


✅ Removed NULL values. 5 valid articles saved in 'cleaned_news.json'.


In [41]:
import json
import re

# Load the dataset
with open("cleaned_news.json", "r", encoding="utf-8") as f:
    news_articles = json.load(f)

def clean_text(text):
    """Cleans and preprocesses text by removing unwanted symbols, metadata, and fixing formatting."""
    if not isinstance(text, str):
        return ""

    # Remove "By [Author Name]" patterns (e.g., "By JOSH FUNK AP Business Writer")
    text = re.sub(r"By\s[A-Z\s]+(AP|Reuters|BBC|CNN|Business Writer|Correspondent)?", "", text, flags=re.IGNORECASE)

    # Remove Moon Alerts, Horoscope-like structures
    text = re.sub(r"Moon alert\s.*?\.", "", text, flags=re.IGNORECASE)

    # Remove "\r\n" and excessive whitespace
    text = re.sub(r"\s+", " ", text.strip())

    # Remove text in square brackets [Like this]
    text = re.sub(r"\[.*?\]", "", text)

    # Remove any remaining excessive punctuation
    text = re.sub(r"[^\w\s.,!?'-]", "", text)

    return text.strip()

# Apply cleaning to each article's content
for article in news_articles:
    if "content" in article:
        article["content"] = clean_text(article["content"])

# Save cleaned data back into cleaned_news.json
with open("cleaned_news.json", "w", encoding="utf-8") as f:
    json.dump(news_articles, f, indent=4)

print("✅ Text cleaning complete. `cleaned_news.json` has been updated.")

print(f"✅ We now have {len(cleaned_articles)} valid articles saved in 'cleaned_news.json'.")


✅ Text cleaning complete. `cleaned_news.json` has been updated.
✅ We now have 5 valid articles saved in 'cleaned_news.json'.


Doing Analysis on the texts again before we keep going...

In [42]:
import json
import random

# Load the cleaned dataset
with open("cleaned_news.json", "r", encoding="utf-8") as f:
    news_articles = json.load(f)

# 🔹 CONFIGURATION: Set to True to remove short articles (otherwise, it's OFF)
remove_short_articles = False  # Change to True to enable filtering
min_content_length = 300  # Change this to filter out articles with content length < X

# Count total articles
total_articles = len(news_articles)
print(f"📰 Total articles in cleaned_news: {total_articles}")

# 🔹 Optionally Remove Short Articles
if remove_short_articles:
    news_articles = [article for article in news_articles if len(article["content"]) >= min_content_length]
    print(f"🗑 Removed short articles. Remaining articles: {len(news_articles)}")

    # Save filtered articles back to cleaned_news
    with open("cleaned_news", "w", encoding="utf-8") as f:
        json.dump(news_articles, f, indent=4)
    print("✅ Filtered cleaned_news saved.")

# Randomly select an article index
if len(news_articles) > 0:
    article_index = random.randint(0, len(news_articles) - 1)

    # Function to format text with line breaks every 100 characters
    def format_text(text, max_chars=100):
        """Inserts line breaks every `max_chars` characters for better readability."""
        return '\n'.join([text[i:i+max_chars] for i in range(0, len(text), max_chars)])

    # Get the title and content
    title = news_articles[article_index]['title']
    content = news_articles[article_index]['content']
    content_length = len(content)

    # Print the formatted article with content length
    print(f"\n🔹 **Title:** {title}\n")
    print(f"🔹 **Content Length:** {content_length} characters\n")
    print(f"🔹 **Content:**\n")
    print(format_text(content))
else:
    print("⚠️ No articles left after filtering.")


📰 Total articles in cleaned_news: 5

🔹 **Title:** Theatre artist calls upon younger generation to nurture theatre

🔹 **Content Length:** 8770 characters

🔹 **Content:**

February 14, 2025e-Paper Subscribe Live Now Agriculture Books Books Hindi Belt The Hindu On Books Bo
oks of the week, reviews, excerpts, new titles and features. SEE ALL NEWSLETTERS Business Business A
gri-Business Economy Industry Markets Budget Children Cities Cities Bengaluru Chennai Coimbatore Del
hi Hyderabad Kochi Kolkata Kozhikode Madurai Mangaluru Mumbai Puducherry Thiruvananthapuram Tiruchir
apalli Vijayawada Visakhapatnam Data Data Data Point Podcast Data Point Decoding the headlines with 
facts, figures, and numbers SEE ALL NEWSLETTERS Ebook Education Education Careers Colleges Schools E
lections Entertainment Entertainment Art Dance Movies Music Reviews Theatre First Day First Show New
s and reviews from the world of cinema and streaming. SEE ALL NEWSLETTERS Environment Food Food Dini
ng Features Guides Rec

We need to be able to summarize text. BERT can help us

For state-of-the-art extractive summarization, we can use BERT (transformers).

Install it first:

pip install transformers



In [26]:
from transformers import pipeline

# Load BERT summarizer (once)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")



Device set to use cpu


In [None]:
import json
import nltk
import spacy
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize

from transformers import pipeline

# Run once to download NLTK and Spacy models
#nltk.download("punkt")
#nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\newpc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Toggle: Set to True for summaries, False for full content
summary_mode = True  

# Load cleaned dataset
with open("cleaned_news.json", "r", encoding="utf-8") as f:
    news_articles = json.load(f)

# Extract valid content
documents = [article["content"] for article in news_articles]

### 🔹 TF-IDF Outlier Analysis
def tfidf_outliers(texts, top_n=5):
    """Finds top N high-TF-IDF words per article."""
    texts = [text if isinstance(text, str) else "" for text in texts]  # Ensure valid text
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(texts)
    feature_names = vectorizer.get_feature_names_out()

    outlier_words = []
    for row in tfidf_matrix:
        scores = row.toarray()[0]
        top_indices = scores.argsort()[-top_n:]
        outlier_words.append([feature_names[i] for i in top_indices])

    return outlier_words

### 🔹 Grammar & Readability Analysis
def grammar_analysis(text):
    """Analyzes grammatical complexity and readability."""
    if not isinstance(text, str) or not text.strip():
        return {"grammar_errors": 0, "sentence_count": 0}  

    doc = nlp(text)
    errors = sum(1 for token in doc if token.is_oov)  
    sentences = len(list(doc.sents))
    return {"grammar_errors": errors, "sentence_count": sentences}

### 🔹 Sentiment Analysis
def sentiment_analysis(text):
    """Detects sentiment polarity and emotional words."""
    analysis = TextBlob(text)
    return {"polarity": analysis.sentiment.polarity, "subjectivity": analysis.sentiment.subjectivity}

## Long Text summary using BERT

def split_text(text, max_tokens=1024):
    """Splits long text into smaller chunks that fit within BERT's token limit."""
    sentences = text.split('. ')
    chunks, current_chunk = [], ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) < max_tokens:
            current_chunk += sentence + ". "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "
    
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def generate_summary(text, max_length=80, min_length=20):
    """Generates a summary for large texts efficiently."""
    if not isinstance(text, str) or not text.strip():
        return "No summary available."

    # Step 1: Split the text into manageable chunks
    text_chunks = split_text(text)

    # Step 2: Summarize each chunk separately
    chunk_summaries = [summarizer(chunk, max_length=100, min_length=20, do_sample=False)[0]["summary_text"] for chunk in text_chunks]

    # Step 3: Merge chunk summaries and summarize the full summary
    merged_summary = " ".join(chunk_summaries)

    # Step 4: Final summarization to keep it concise
    final_summary = summarizer(merged_summary, max_length=100, min_length=20, do_sample=False)[0]["summary_text"]

    return final_summary


# Apply linguistic analysis
outliers = tfidf_outliers(documents)
linguistic_features = []

for i, text in enumerate(documents):
    grammar = grammar_analysis(text)
    sentiment = sentiment_analysis(text)
    summary = generate_summary(text) if summary_mode else text  # creates and saves summary to .json

    linguistic_features.append({
        "summary": summary,
        "tfidf_outliers": outliers[i],
        "grammar_errors": grammar["grammar_errors"],
        "sentence_count": grammar["sentence_count"],
        "sentiment_polarity": sentiment["polarity"],
        "sentiment_subjectivity": sentiment["subjectivity"],
    })

# Merge analysis into articles
for i, article in enumerate(news_articles):
    article["linguistic_analysis"] = linguistic_features[i]

# Save the updated dataset
with open("linguistic_news.json", "w", encoding="utf-8") as f:
    json.dump(news_articles, f, indent=4)

print(f"✅ Linguistic analysis completed. Results saved in 'linguistic_news.json'.")
print(f"🔹 Summary Mode: {'ON' if summary_mode else 'OFF'}")


Your max_length is set to 80, but your input_length is only 45. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
Your max_length is set to 80, but your input_length is only 73. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=36)
Your max_length is set to 80, but your input_length is only 19. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=9)
Your max_length is set to 100, but your input_length is only 60. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=30)


✅ Linguistic analysis completed. Results saved in 'linguistic_news.json'.
🔹 Summary Mode: ON


Need to cut down the Ling analysis - separating FULL TEXT to focus on it \
TESTING PHASE

In [14]:
# Load the linguistic analysis dataset
with open("linguistic_news.json", "r", encoding="utf-8") as f:
    news_articles = json.load(f)

# Create a new list without 'content'
filtered_articles = []

for article in news_articles:
    filtered_articles.append({
        "title": article.get("title", "No Title"),
        "linguistic_analysis": article.get("linguistic_analysis", {})
    })

# Save the cleaned data
with open("small_linguistic_news.json", "w", encoding="utf-8") as f:
    json.dump(filtered_articles, f, indent=4)

print(f"✅ Content removed. Saved {len(filtered_articles)} entries in 'small_linguistic_news.json'.")


✅ Content removed. Saved 16 entries in 'small_linguistic_news.json'.
