# RAG with OpenAI Embeddings + FAISS

This workbook demonstrates a Retrieval-Augmented Generation (RAG) workflow:
- Load a live-news web page, clean the text, split into chunks.
- Embed chunks with OpenAI embeddings and index them in FAISS.
- Retrieve the most relevant chunks for a question and answer with an LLM.
- Show citations (source and chunk index) for transparency.

## What are Embeddings?
Embeddings are dense numeric vectors that represent text. They capture semantic meaning so that similar texts have similar vectors. We use OpenAI's `text-embedding-3-small` to convert each chunk into a vector, enabling similarity search.

## What is a Vector DB (FAISS)?
FAISS is a high-performance vector database/index for fast nearest-neighbor search. We store the chunk embeddings in FAISS and query it with the question's embedding to retrieve the top-k most relevant chunks. This powers the "retrieve" step before the LLM answers.

## Workflow Overview
1) Load + clean content
2) Split into chunks (with metadata)
3) Embed chunks and build FAISS index
4) Retrieve top-k chunks for the query
5) Format context and call the LLM
6) Print answer and cite sources

Prerequisite: set `OPENAI_API_KEY` in your environment.

In [67]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import WebBaseLoader

In [78]:
_DEFAULT_EMBEDDINGS_MODEL= "text-embedding-3-small"
# Initializes a large language model (LLM) for text generation
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

In [79]:
question = "what are the most recent udpates about hurricane Helene as of September 2024?"
response = llm.invoke(question)
print(response.content)

I'm sorry, but I don't have access to real-time data or updates beyond October 2023. To get the most recent information about Hurricane Helene or any other current events, I recommend checking reliable news sources or official weather websites like the National Hurricane Center.


In [80]:
url = "https://www.cnn.com/weather/live-news/hurricane-helene-florida-north-carolina-georgia-09-30-24/index.html"

loader = WebBaseLoader(url)

docs = loader.load()

In [81]:
import re
from src.fnUtils import render_markdown
# Clean and preview the loaded content
raw_content = docs[0].page_content
# Remove excessive whitespace: multiple spaces/newlines become single space
clean_content = re.sub(r'\s+', ' ', raw_content).strip()

print(f"Document length: {len(raw_content)} characters (raw), {len(clean_content)} characters (cleaned)\n")
print(f"Preview (first 800 chars):\n{'-'*80}")
print(clean_content[:1500]+f"\n{'-'*80}\n... (content continues)")
# print(f"\n{'-'*80}\n... (content continues)")

Document length: 39504 characters (raw), 24602 characters (cleaned)

Preview (first 800 chars):
--------------------------------------------------------------------------------
September 30 news on Hurricane Helene | CNN CNN values your feedback 1. How relevant is this ad to you? 2. Did you encounter any technical issues? Video player was slow to load content Video content never loaded Ad froze or did not finish loading Video content did not start after ad Audio on ad was too loud Other issues Ad never loaded Ad prevented/slowed the page from loading Content moved around while ad loaded Ad was repetitive to ads I've seen previously Other issues Cancel Submit Thank You! Your effort and contribution in providing this feedback is much appreciated. Close Ad Feedback Close icon Weather Video Climate More Video Climate Watch Listen Subscribe Sign in My Account Settings Newsletters Topics you follow Sign out Your CNN account Sign in to your CNN account Sign in My Account Settings Newsletters 

In [82]:
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# Build embeddings and FAISS index from cleaned content
embeddings = OpenAIEmbeddings(model=_DEFAULT_EMBEDDINGS_MODEL)

# Create a single Document from the cleaned text
cleaned_doc = Document(page_content=clean_content, metadata={"source": url})

# Split the cleaned document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
documents = text_splitter.split_documents([cleaned_doc])

# Optional: add chunk indices for easier citation
for i, d in enumerate(documents):
    d.metadata["chunk_index"] = i

# Create FAISS vector store
vector = FAISS.from_documents(documents, embeddings)

In [83]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Create a prompt template for question answering
prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

** Context **
{context}

Question: {input}""")

# Helper to format documents into a single context string
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


In [84]:
# Build retriever and compose a single QA chain with LCEL
retriever = vector.as_retriever(search_kwargs={"k": 4})

qa_chain = (
    {
        "context": lambda x: format_docs(retriever.invoke(x["input"])),
        "input": lambda x: x["input"],
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [85]:
# Invoke the QA chain with the user question
response = qa_chain.invoke({"input": question})

In [86]:
# Display the answer
print(f"Prompt/> {question}")
render_markdown(f"Answer: {response}")

Prompt/> what are the most recent udpates about hurricane Helene as of September 2024?


> Answer: As of September 2024, the most recent updates about Hurricane Helene indicate that North Carolina is experiencing "total devastation" in the aftermath of the storm. Mayor Zeb Smathers highlighted the challenges faced in recovery, noting that they are dealing with search and rescue efforts using outdated technology from the 1990s. Additionally, at least 130 people have died due to the storm.

In [87]:
# Show citations: list retrieved chunks with metadata
retrieved_docs = retriever.invoke(question)
print("\nCitations (top-k):")
for d in retrieved_docs:
    src = d.metadata.get("source", "unknown")
    idx = d.metadata.get("chunk_index", "?")
    snippet = d.page_content[:200].replace("\n", " ")
    print(f"- source: {src} | chunk: {idx} | snippet: {snippet}...")


Citations (top-k):
- source: https://www.cnn.com/weather/live-news/hurricane-helene-florida-north-carolina-georgia-09-30-24/index.html | chunk: 0 | snippet: September 30 news on Hurricane Helene | CNN CNN values your feedback 1. How relevant is this ad to you? 2. Did you encounter any technical issues? Video player was slow to load content Video content n...
- source: https://www.cnn.com/weather/live-news/hurricane-helene-florida-north-carolina-georgia-09-30-24/index.html | chunk: 15 | snippet: to help those left in Helene’s aftermath, visit CNN Impact Your World. Bookmark CNN’s lite site for fast connectivity. 39 Posts Our live coverage of the aftermath of Hurricane Helene has moved here. L...
- source: https://www.cnn.com/weather/live-news/hurricane-helene-florida-north-carolina-georgia-09-30-24/index.html | chunk: 53 | snippet: Canton, North Carolina, but Mayor Zeb Smathers said there is a major difference between Hurricane Helene’s aftermath today and three years ago, when floodin