# Retrieval‑Augmented Generation (RAG) Demo
**Eva Paunova – June 2025**

This notebook demonstrates a minimal but complete RAG pipeline:
1. Load unstructured docs
2. Chunk & embed via Hugging Face
3. Store vectors in pgvector (PostgreSQL)
4. Retrieve with LangChain and query LLM
5. Record latency & cost metrics

Feel free to swap components (e.g. Pinecone instead of pgvector, Mistral instead of OpenAI).

## 1 · Setup

In [None]:
!pip install -q langchain sentence-transformers pgvector psycopg2-binary openai
import os, time

### 1.1 Environment variables

In [None]:
os.environ['OPENAI_API_KEY'] = 'sk-REPLACE_ME'
CONN_STR = 'postgresql://rag_user:rag_pass@localhost:5432/ragdemo'

## 2 · Load sample documents

In [None]:
from langchain.document_loaders import UnstructuredURLLoader
urls = [
    'https://www.nvidia.com/en-us/blog/what-is-generative-ai/',
    'https://huggingface.co/blog/rag'  # any public article
]
loader = UnstructuredURLLoader(urls)
docs = loader.load()[:10]
len(docs)

### 2.1 Chunk & embed

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=60)
chunks = splitter.split_documents(docs)
from langchain.embeddings import HuggingFaceEmbeddings
embedder = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

## 3 · Vector store (pgvector)

In [None]:
from langchain.vectorstores import PGVector
vectordb = PGVector.from_documents(
    documents=chunks,
    embedding=embedder,
    connection_string=CONN_STR,
    collection_name='demo_chunks'
)

## 4 · RAG Query

In [None]:
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model_name='gpt-3.5-turbo', temperature=0.2),
    chain_type='stuff',
    retriever=vectordb.as_retriever(k=3)
)
query = 'Explain retrieval‑augmented generation in two sentences.'
start = time.time()
response = qa_chain.run(query)
latency = time.time() - start
print(response)
print(f'Latency: {latency:.2f} s')

## 5 · Mini evaluation – cost & latency

In [None]:
prompt_tokens = 40  # stub values
completion_tokens = 80
cost = (prompt_tokens + completion_tokens) / 1000 * 0.0015  # $/1k for gpt-3.5
print(f'Approx cost: ${cost:.4f}')

## 6 · Next steps / TODO
* Replace URLs with your own knowledge base
* Swap pgvector for Pinecone or Chroma
* Add BERTScore evaluation
* Deploy via FastAPI for real-time use
