003_Production level RAG Workshop: Part 1

What is RAG?

Definition

RAG stands for:

Retrieval-Augmented Generation

It combines:

Retrieval
- Fetching relevant information from an external knowledge source
Generation
- Using an LLM to generate a response

Instead of relying only on the LLM’s pretrained knowledge, RAG supplements the LLM with retrieved context from external documents.

External tools

Lovable
Supabase

Why RAG Exists

Problem: Context Window Limitations

Consider a 1200-page nutrition textbook.

A naive solution:

   Question + Entire PDF
            ↓
           LLM

Problems:

Too many tokens
High cost
Context window overflow
Hallucinations
Slow responses

Example discussed:

PDF size ≈ 400K tokens

GPT context window ≈ 128K tokens

The entire document cannot fit into memory at once.

Hallucination Problem

When the relevant information is missing from the prompt:

The LLM may answer from pretrained knowledge rather than the provided document.

This leads to:

Incorrect answers
Non-grounded responses
Hallucinations

RAG helps reduce this issue by supplying only relevant document sections.

Open-Book Exam Analogy

The workshop explains RAG using an open-book exam.

Without RAG

    A student answers questions using memory only.

Equivalent:

    User Question
          ↓
         LLM
          ↓
       Answer

With RAG

A student:

Searches the book
Finds relevant pages
Uses both retrieved information and existing knowledge

Equivalent:

    User Question
           ↓
     Retrieval
           ↓
    Relevant Context
           ↓
          LLM
           ↓
        Answer

This is the core intuition behind RAG.

Evolution of RAG

RAG in 2021

Main objective:

Reduce hallucinations

Architecture:

    Documents
       ↓
    Retrieval
       ↓
    LLM
       ↓
    Answer

RAG Today

RAG is viewed as part of a larger discipline:

Context Engineering

Components include:

Retrieval
Prompt Engineering
Memory
State Management
Embeddings
Vector Databases
Long Context Windows

Modern RAG is therefore a subset of context engineering.

App hole Flow

Step 1:

Document Pre-process

while we are process data file like pdf then we need to use some lib. It is depend on your data file, what kind of data are containing like below

1. Pdf can contain Text only

    we can use package lib like **pymuPDF** lib. It is tradisonal libary. By using we can use diff pages.

Now I have question if pdf contain image then it will consider as image only. But if image contain text like restorent bill then it will not read that text.

2. Pdf can contain Text + images

OCR lib can resolved above issues. The best OCR opensouce lib called Tesseract or Mistral

3. Pdf can contain Text + images + Tables

Docling can like with OCR tool called TOCR. So we can use OCR+Docling

Data Collection and Scraping

Not all data arrives as PDFs.

The workshop discusses:

Firecrawl

  Website → LLM-ready content

BeautifulSoup

  HTML scraping

Puppeteer

  Browser automation

Best for:

Thousands of pages
Automated extraction

Example:

Client with 5000 pages requiring automated ingestion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

003_Production level RAG Workshop: Part 1

What is RAG?

Definition

External tools

Why RAG Exists

Hallucination Problem

Open-Book Exam Analogy

Evolution of RAG

RAG in 2021

RAG Today

App hole Flow

Step 1:

1. Pdf can contain Text only

2. Pdf can contain Text + images

3. Pdf can contain Text + images + Tables

Data Collection and Scraping

Firecrawl

BeautifulSoup

Puppeteer

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally