# ü¶ô RAG Pipeline for Wikipedia Trend Analysis using LlamaIndex ü¶ô

In the data folder we have trending data from wikipedia I have generated in my project [The Wikipedia Chronicle](wiki-chronicle.lovable.dev)

this is in the format: [`date`, `title`, `summary`, `news summary`, `search summary`]


LlamaIndex allows us to create a Document with both metadata and embeddings to later be queried.

Here are the basic steps

1. Load trend data from .csv files
2. Convert to `LlamaIndex Documents`
3. Generate single embedding for each article containing `title`, `summary`, `news summary`, `search summary`
3. Query embedded data using `gpt-5-mini`

In [23]:
!pip install llama-index llama-index-embeddings-huggingface llama-index-llms-openai pandas

/bin/bash: /Users/codyfeda/devel/2025/ai-from-scratch/.venv/bin/pip: /Users/codyfeda/devel/2025/codfed-cookbook/.venv/bin/python: bad interpreter: No such file or directory


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [2]:
from llama_index.core import Document, VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI


## üèóÔ∏è Step 1: Load data from CSV files üèóÔ∏è

`glob` and `Pandas` make this real easy. We'll get everything from our data directory.

In [None]:
import pandas as pd
import glob

# Folder path (adjust as needed)
data_folder = "data/"

# Get all CSV file paths in that folder
csv_files = glob.glob(f"{data_folder}/*.csv")

# Read and concatenate them
df_list = [pd.read_csv(f) for f in csv_files]
df = pd.concat(df_list, ignore_index=True)

print(f"Loaded {len(df)} rows from {len(csv_files)} CSV files.")
df.head()

‚úÖ Loaded 174 documents from 4 files.


Unnamed: 0,trending_date,normalized_title,summary,news_relation,search_relation
0,2025-10-12,Diane Keaton,An American actress known for her work in the ...,Diane Keaton‚Äôs reported death at 79 ‚Äî the Osca...,Major outlets report Diane Keaton has died at ...
1,2025-10-12,Ed Gein,An American murderer known for exhuming corpse...,Netflix's new Monster season about Ed Gein has...,"Netflix's new Ryan Murphy episode ""Monster: Th..."
2,2025-10-12,Ian Watkins (Lostprophets singer),"A Welsh singer and child sex offender, he led ...",Ian Watkins ‚Äî the disgraced former Lostprophet...,Multiple major outlets report that Ian Watkins...
3,2025-10-12,Google Chrome,A cross-platform web browser developed by Goog...,Chrome is in the news because Google is changi...,A new stable Chrome release (141.0.7390.65/.66...
4,2025-10-12,Lostprophets,A Welsh rock band formed in 1997 by Ian Watkin...,Ian Watkins ‚Äî the disgraced former Lostprophet...,Most traffic to the Lostprophets page follows ...


## üí´ Step 2: Convert to LlamaIndex documents ‚ú®

LlamaIndex Documents are really cool! They store data AND context. Stay tuned for how useful this is for querying specific date ranges in step 4. But first, let's build our documents. They  consist of two parts:

1. Context (embedding):
- `title`
- `summary`
- `news summary`
- `search summary`

2. Metadata
- `Trending Date`
- `Title`

In [53]:
def combine_fields(row):
    return (
        f"Title: {row['normalized_title']}\n\n"
        f"Topic Summary: {row['summary']}\n\n"
        f"News Context: {row['news_relation']}\n\n"
        f"Search Context: {row['search_relation']}"
    )

df['combined_text'] = df.apply(combine_fields, axis=1)

documents = [
    Document(
        text=row["combined_text"],
        metadata={
            "title": row["normalized_title"],
            "date": row["trending_date"]
        }
    )
    for _, row in df.iterrows()
]

## ‚Üó Step 3: Generate Vector Embeddings ‚Üó

We're using the iconic `all-MiniLM-L6-v2` as our embedding model (more about that later).

We save our embeddings as `trend_index`

In [61]:
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
trend_index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)


## ü¶øüîßüë©üèª‚Äçüî¨ Step 4: Initiate RAGBot ü¶æ‚öôÔ∏èüëÅ‚Äçüó®

In [62]:
llm = OpenAI(model="gpt-5-mini")
trend_query_engine = trend_index.as_query_engine(llm=llm)


## üßêüîç Step 5: query üïµüèª‚Äç‚ôÇÔ∏è‚ùì

 
2. Specific single topic
3. Meta (about the knowledge) 

### 1. General summary üòê

In [105]:
response = trend_query_engine.query("Give me a summary of the trends")
print(response)

Here are the key trends across the two stories:

Ford vehicles
- A refreshed 2025‚Äì2026 lineup (including updates to trucks and electrified models) has driven a spike in shopper and media interest.
- Leadership moves in regional and manufacturing roles suggest possible strategic shifts at the company level.
- New model announcements (example: performance variants) and updates to EV/hybrid offerings are attracting fan and buyer attention.
- Changes to EV tax‚Äëcredit eligibility and coverage of lower-priced models are directly affecting buying decisions and perceived value.
- Dealer inventory and manufacturer listings are being checked more closely as shoppers verify availability.

Kantara: Chapter 1
- The film has seen an explosive box‚Äëoffice run, with very large early-week grosses and later reports putting total worldwide receipts substantially higher.
- Strong word‚Äëof‚Äëmouth, praise for the theatrical experience, and actor/director publicity are sustaining momentum.
- Social me

### 2. Specific single topic üëç

In [106]:
response = trend_query_engine.query("Give specific details about a very surprising trend")
print(response)

A very surprising trend is the rapid, youth-led upheaval in Madagascar that turned a protest wave into a sudden political collapse. Specific details:

- The protests were largely driven by young people (described as Gen Z‚Äìled) and centered in Antananarivo.  
- Core grievances included poverty, repeated power cuts and broader governance complaints.  
- The situation escalated quickly: the president said he was sheltering from threats and then fled or went into hiding.  
- The army moved in as Parliament moved to impeach and ultimately ousted the president.  
- An elite military unit seized control, prompting cheers from crowds in some areas.  
- Clashes during the unrest left several people dead.  
- The sequence produced intense international coverage and widespread uncertainty about who now governs the country, and public interest spiked as many turned to information sources to follow developments.


#### 3. dataset self-awareness ‚õî

In [107]:
response = trend_query_engine.query("Give me a list of dates you know about")
print(response)

- 2025-10-15
- 2025-10-13 (Oct. 13, 2025)


# üèÅüèÅüèÅ We're done! üèÅüèÅüèÅ

This was a simple showcase of how simple `Pandas` and `LlamaIndex` makes working with structured and unstructured data.

The paper path that lead us here:

| Date | Paper | Lab | Description |
| --- | --- | --- | --- |
| June 2017 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | Google | Transformers paper. This is built the foundation for LLMs |
| October 2018 | [Bidirectional encoder representations from transformers (BERT)](https://arxiv.org/abs/1810.04805) | Google | Uses transformer architecture to derive semantic meaning of text chunks |
| August 2019 | [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) | UKPLab | Reduces the computation time of BERT from 65 hours to 5 seconds. The `all-MiniLM-L6-v2` we used came from this |
| May 2020 | [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) | Facebook AI Labs, University College London, NYU | Upon query, retrieve relevant documents, let the model read them, then include in its response. |




## üì¶ Package notes üì¶

`llama-index`
- Open Source project to help with the RAG Pipeline

`llama-index-embeddings-huggingface`
- creates vector embeddings of text
- wraps embedding model (in this case... `sentence-transformers/all-MiniLM-L6-v2`)

`pandas`
- Data analysis tool
- Handles structured and labeled data

`sentence-transformers`
- UKPLab (Ubiquitous Knowledge Processing Lab) at TU Darmstadt, Germany
- Turns sentences (or paragraphs or pages) into embeddings
- the model `all-MiniLM-L6-v` does this incredibly fast, read the SentenceBERT Paper above to see how