-
Notifications
You must be signed in to change notification settings - Fork 0
003_Production level RAG Workshop: Part 1
RAG stands for:
Retrieval-Augmented Generation
It combines:
- Retrieval
- Fetching relevant information from an external knowledge source
- Generation
- Using an LLM to generate a response
Instead of relying only on the LLM’s pretrained knowledge, RAG supplements the LLM with retrieved context from external documents.
- Lovable
- Supabase
Problem: Context Window Limitations
Consider a 1200-page nutrition textbook.
A naive solution:
Question + Entire PDF
↓
LLM
Problems:
- Too many tokens
- High cost
- Context window overflow
- Hallucinations
- Slow responses
Example discussed:
PDF size ≈ 400K tokens
GPT context window ≈ 128K tokens
The entire document cannot fit into memory at once.
When the relevant information is missing from the prompt:
The LLM may answer from pretrained knowledge rather than the provided document.
This leads to:
- Incorrect answers
- Non-grounded responses
- Hallucinations
RAG helps reduce this issue by supplying only relevant document sections.
The workshop explains RAG using an open-book exam.
Without RAG
A student answers questions using memory only.
Equivalent:
User Question
↓
LLM
↓
Answer
With RAG
A student:
- Searches the book
- Finds relevant pages
- Uses both retrieved information and existing knowledge
Equivalent:
User Question
↓
Retrieval
↓
Relevant Context
↓
LLM
↓
Answer
This is the core intuition behind RAG.
Main objective:
Reduce hallucinations
Architecture:
Documents
↓
Retrieval
↓
LLM
↓
Answer
RAG is viewed as part of a larger discipline:
Context Engineering
Components include:
- Retrieval
- Prompt Engineering
- Memory
- State Management
- Embeddings
- Vector Databases
- Long Context Windows
Modern RAG is therefore a subset of context engineering.
Document Pre-process
while we are process data file like pdf then we need to use some lib. It is depend on your data file, what kind of data are containing like below
we can use package lib like **pymuPDF** lib. It is tradisonal libary. By using we can use diff pages.
Now I have question if pdf contain image then it will consider as image only. But if image contain text like restorent bill then it will not read that text.
OCR lib can resolved above issues. The best OCR opensouce lib called Tesseract or Mistral
Docling can like with OCR tool called TOCR. So we can use OCR+Docling
Not all data arrives as PDFs.
The workshop discusses:
Website → LLM-ready content
HTML scraping
Browser automation
Best for:
- Thousands of pages
- Automated extraction
Example:
Client with 5000 pages requiring automated ingestion.