# Intro to RAG

We will use RAG to enhance the data we send off to the LLM to assess

## Install:

 pip install nltk spacy textblob scikit-learn transformers


### Download English model for Spacy
 python -m spacy download en_core_web_sm  
 



### Next steps:
Let's look have our RAG system go through the entries in ChromeDB

In [1]:
import chromadb
import json
import re
import nltk
import spacy
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize
from transformers import pipeline


In [2]:
# Run to download NLTK and Spacy models
nltk.download("punkt")
nlp = spacy.load("en_core_web_sm")

# Load BERT summarizer
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\newpc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Device set to use cpu


### Retrieve & Load Articles from ChromaDB

In [3]:
# ✅ Connect to ChromaDB
CHROMA_DB_PATH = "../chroma_db"
COLLECTION_NAME = "news_articles"

client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
collection = client.get_collection(COLLECTION_NAME)

# ✅ Load stored articles
data = collection.get()
documents = data["documents"]
metadata = data["metadatas"]

# ✅ Convert ChromaDB data into structured format
news_articles = []
for i in range(len(documents)):
    article = metadata[i]
    article["content"] = documents[i]
    article["status"] = "raw"  # Track processing stage
    news_articles.append(article)

print(f"📊 Loaded {len(news_articles)} articles from ChromaDB.")


📊 Loaded 69 articles from ChromaDB.


## Clean & Preprocess Text
The text might not be clean, so let's fix that

In [4]:
def smart_clean_text(text):
    """Cleans text while preserving meaning, structure, and tone."""
    if not isinstance(text, str):
        return ""

    # ✅ 1. Normalize spaces and line breaks
    text = re.sub(r"\s+", " ", text.strip())  # Remove excessive spaces/newlines

    # ✅ 2. Remove unnecessary system characters
    text = re.sub(r"[\x00-\x1F\x7F]", "", text)  # Remove non-printable ASCII characters

    # ✅ 3. Remove certain "breaking" symbols but keep meaning intact
    text = re.sub(r"[\u200B-\u200D\uFEFF]", "", text)  # Remove zero-width spaces & soft hyphens

    # ✅ 4. Preserve quoted text (speech & sources)
    # Does NOT remove content inside quotation marks to keep context accurate.

    # ✅ 5. Remove unnecessary prefixes without deleting valuable text
    text = re.sub(r"^(By\s+[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*,\s*\w+\s*Writer)", "", text)  # Remove "By Author, Writer"
    
    # ✅ 6. Remove excessive special characters (but keep punctuation & formatting)
    text = re.sub(r"[^\w\s.,!?()'\-\"“”‘’]", "", text)  # Keeps normal sentence structure intact

    return text.strip()

# ✅ Apply smart cleaning
for article in news_articles:
    article["content"] = smart_clean_text(article["content"])
    article["status"] = "cleaned"  # Update processing stage

print(f"✅ Smart text cleaning completed for {len(news_articles)} articles.")


✅ Smart text cleaning completed for 69 articles.


In [5]:
#### SECOND OPTION @@@@@ TESTING

### CLEANING TEXT

import re

def smart_clean_text(text):
    """Cleans text while preserving meaning, structure, and tone."""
    if not isinstance(text, str):
        return ""

    # ✅ 1. Normalize spaces and line breaks
    text = re.sub(r"\s+", " ", text.strip())  

    # ✅ 2. Remove unnecessary system characters
    text = re.sub(r"[\x00-\x1F\x7F]", "", text)  

    # ✅ 3. Remove certain "breaking" symbols but keep meaning intact
    text = re.sub(r"[\u200B-\u200D\uFEFF]", "", text)  

    return text.strip()

# ✅ Apply smart cleaning
for article in news_articles:
    article["content"] = smart_clean_text(article["content"])
    article["status"] = "cleaned"  

print(f"✅ Smart text cleaning completed for {len(news_articles)} articles.")


✅ Smart text cleaning completed for 69 articles.


### Perform Linguistic Analysis

Each metric provides a different perspective on the text:

* TF-IDF Outliers → Highlights rare but significant words.
* Grammar Analysis → Assesses complexity and readability.
* Sentiment Analysis → Identifies bias and emotional intensity.
* Summarization → Reduces lengthy text while retaining meaning.

For state-of-the-art extractive summarization, we can use BERT (transformers).

NOTE: THERE'S ROOM HERE FOR IMPROVEMENT!

In [None]:
### 🔹 TF-IDF Outlier Analysis
def tfidf_outliers(texts, top_n=5):
    """Finds top N high-TF-IDF words per article."""
    texts = [text if isinstance(text, str) else "" for text in texts]
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(texts)
    feature_names = vectorizer.get_feature_names_out()

    outlier_words = []
    for row in tfidf_matrix:
        scores = row.toarray()[0]
        top_indices = scores.argsort()[-top_n:]
        outlier_words.append([feature_names[i] for i in top_indices])

    return outlier_words

### 🔹 Grammar & Readability Analysis
def grammar_analysis(text):
    """Analyzes grammatical complexity and readability."""
    if not isinstance(text, str) or not text.strip():
        return {"grammar_errors": 0, "sentence_count": 0}  

    doc = nlp(text)
    errors = sum(1 for token in doc if token.is_oov)
    sentences = len(list(doc.sents))
    return {"grammar_errors": errors, "sentence_count": sentences}

### 🔹 Sentiment Analysis
def sentiment_analysis(text):
    """Detects sentiment polarity and emotional words."""
    analysis = TextBlob(text)
    return {"polarity": analysis.sentiment.polarity, "subjectivity": analysis.sentiment.subjectivity}

### 🔹 BERT Summarization (Handles Long Texts)
def split_text(text, max_tokens=1024):
    """Splits long text into smaller chunks that fit within BERT's token limit."""
    sentences = text.split('. ')
    chunks, current_chunk = [], ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) < max_tokens:
            current_chunk += sentence + ". "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "
    
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def generate_summary(text, max_length=80, min_length=20):
    """Generates a summary using BERT (BART Model)."""
    if not isinstance(text, str) or not text.strip():
        return "No summary available."

    text_chunks = split_text(text)
    chunk_summaries = [summarizer(chunk, max_length=100, min_length=20, do_sample=False)[0]["summary_text"] for chunk in text_chunks]
    merged_summary = " ".join(chunk_summaries)

    return summarizer(merged_summary, max_length=100, min_length=20, do_sample=False)[0]["summary_text"]

# ✅ Process all articles
documents = [article["content"] for article in news_articles]
outliers = tfidf_outliers(documents)

for i, text in enumerate(documents):
    grammar = grammar_analysis(text)
    sentiment = sentiment_analysis(text)
    summary = generate_summary(text)

    news_articles[i]["linguistic_analysis"] = {
        "summary": summary,
        "tfidf_outliers": outliers[i],
        "grammar_errors": grammar["grammar_errors"],
        "sentence_count": grammar["sentence_count"],
        "sentiment_polarity": sentiment["polarity"],
        "sentiment_subjectivity": sentiment["subjectivity"],
    }
    news_articles[i]["status"] = "processed"

print("✅ Linguistic analysis & summarization completed.")


### Generate Summaries

$!@$!@#$ NEED TO CHECK AND TEST THIS PART

In [None]:
import json
from transformers import pipeline
import spacy
import nltk
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer

# ✅ Load cleaned dataset from ChromaDB (instead of news.json)
CHROMA_DB_PATH = "./chroma_db"
COLLECTION_NAME = "news_articles"

import chromadb
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
collection = client.get_collection(COLLECTION_NAME)

# ✅ Fetch stored articles from ChromaDB
data = collection.get()
documents = data["documents"]
metadata = data["metadatas"]

# ✅ Convert ChromaDB data into structured format
news_articles = []
for i in range(len(documents)):
    article = metadata[i]
    article["content"] = documents[i]
    article["status"] = "raw"
    news_articles.append(article)

print(f"📊 Loaded {len(news_articles)} articles from ChromaDB.")

# ✅ Load BERT Summarizer
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# ✅ Load NLP models
nltk.download("punkt")
nlp = spacy.load("en_core_web_sm")


### Store Processed Articles in ChromaDB

NOTE:

In [None]:
# ✅ Update articles in ChromaDB with processed data
for article in news_articles:
    collection.update(
        ids=[article["url"]],
        metadatas=[{
            "title": article["title"],
            "url": article["url"],
            "published_date": article["published_date"],
            "source_name": article["source_name"],
            "author": article["author"],
            "category": article["category"],
            "status": article["status"],  # Mark as "processed"
            "linguistic_analysis": article["linguistic_analysis"]
        }]
    )

print("✅ Processed articles updated in ChromaDB.")


### SAVE TO JSON for later use

In [None]:
# ✅ Save processed articles to JSON
with open("RAG_output.json", "w", encoding="utf-8") as f:
    json.dump(news_articles, f, indent=4)

print("✅ Processed articles saved to 'RAG_output.json'.")


In [None]:
# Toggle: Set to True for summaries, False for full content
summary_mode = True  

# Load cleaned dataset
with open("cleaned_news.json", "r", encoding="utf-8") as f:
    news_articles = json.load(f)

# Extract valid content
documents = [article["content"] for article in news_articles]

### 🔹 TF-IDF Outlier Analysis
def tfidf_outliers(texts, top_n=5):
    """Finds top N high-TF-IDF words per article."""
    texts = [text if isinstance(text, str) else "" for text in texts]  # Ensure valid text
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(texts)
    feature_names = vectorizer.get_feature_names_out()

    outlier_words = []
    for row in tfidf_matrix:
        scores = row.toarray()[0]
        top_indices = scores.argsort()[-top_n:]
        outlier_words.append([feature_names[i] for i in top_indices])

    return outlier_words

### 🔹 Grammar & Readability Analysis
def grammar_analysis(text):
    """Analyzes grammatical complexity and readability."""
    if not isinstance(text, str) or not text.strip():
        return {"grammar_errors": 0, "sentence_count": 0}  

    doc = nlp(text)
    errors = sum(1 for token in doc if token.is_oov)  
    sentences = len(list(doc.sents))
    return {"grammar_errors": errors, "sentence_count": sentences}

### 🔹 Sentiment Analysis
def sentiment_analysis(text):
    """Detects sentiment polarity and emotional words."""
    analysis = TextBlob(text)
    return {"polarity": analysis.sentiment.polarity, "subjectivity": analysis.sentiment.subjectivity}

## Long Text summary using BERT

def split_text(text, max_tokens=1024):
    """Splits long text into smaller chunks that fit within BERT's token limit."""
    sentences = text.split('. ')
    chunks, current_chunk = [], ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) < max_tokens:
            current_chunk += sentence + ". "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "
    
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def generate_summary(text, max_length=80, min_length=20):
    """Generates a summary for large texts efficiently."""
    if not isinstance(text, str) or not text.strip():
        return "No summary available."

    # Step 1: Split the text into manageable chunks
    text_chunks = split_text(text)

    # Step 2: Summarize each chunk separately
    chunk_summaries = [summarizer(chunk, max_length=100, min_length=20, do_sample=False)[0]["summary_text"] for chunk in text_chunks]

    # Step 3: Merge chunk summaries and summarize the full summary
    merged_summary = " ".join(chunk_summaries)

    # Step 4: Final summarization to keep it concise
    final_summary = summarizer(merged_summary, max_length=100, min_length=20, do_sample=False)[0]["summary_text"]

    return final_summary


# Apply linguistic analysis
outliers = tfidf_outliers(documents)
linguistic_features = []

for i, text in enumerate(documents):
    grammar = grammar_analysis(text)
    sentiment = sentiment_analysis(text)
    summary = generate_summary(text) if summary_mode else text  # creates and saves summary to .json

    linguistic_features.append({
        "summary": summary,
        "tfidf_outliers": outliers[i],
        "grammar_errors": grammar["grammar_errors"],
        "sentence_count": grammar["sentence_count"],
        "sentiment_polarity": sentiment["polarity"],
        "sentiment_subjectivity": sentiment["subjectivity"],
    })

# Merge analysis into articles
for i, article in enumerate(news_articles):
    article["linguistic_analysis"] = linguistic_features[i]

# Save the updated dataset
with open("RAG.json", "w", encoding="utf-8") as f:
    json.dump(news_articles, f, indent=4)

print(f"✅ Linguistic analysis completed. Results saved in 'RAG_news.json'.")
print(f"🔹 Summary Mode: {'ON' if summary_mode else 'OFF'}")


Used to cut down the Ling analysis - separating FULL TEXT to focus on it \
TESTING PHASE

In [None]:
# Load the linguistic analysis dataset
with open("linguistic_news.json", "r", encoding="utf-8") as f:
    news_articles = json.load(f)

# Create a new list without 'content'
filtered_articles = []

for article in news_articles:
    filtered_articles.append({
        "title": article.get("title", "No Title"),
        "linguistic_analysis": article.get("linguistic_analysis", {})
    })

# Save the cleaned data
with open("small_linguistic_news.json", "w", encoding="utf-8") as f:
    json.dump(filtered_articles, f, indent=4)

print(f"✅ Content removed. Saved {len(filtered_articles)} entries in 'small_linguistic_news.json'.")


✅ Content removed. Saved 16 entries in 'small_linguistic_news.json'.
