Skip to content
Amresh Verma edited this page May 28, 2026 · 52 revisions
  • Prompt Engineering
  • RAG (Retrieval Augmented Generation)
  • Embeddings + Vector
  • DB Function Calling / Tools

Prompt Engineering

📌 What is it?

Writing smart input (prompt) to get correct output from LLM

🎯 Example

❌ Bad Prompt: Tell me about milk

✅ Good Prompt: You are a shop assistant.

Extract product name and quantity: Input: "2 milk and 1 bread"

Output JSON:

👉 Output becomes structured:

{ "milk": 2, "bread": 1 }

RAG (Retrieval Augmented Generation)

Overview

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language models (LLMs) by allowing them to retrieve relevant external information before generating a response.

Instead of relying only on pre-trained knowledge, RAG enables models to access up-to-date, domain-specific, and private data sources, making responses more accurate and context-aware.

RAG is widely used in:

  • Customer support chatbots
  • Healthcare report summarization
  • Legal and compliance systems
  • Financial analysis tools
  • Enterprise knowledge search systems

Why RAG is Important

Traditional LLMs like GPT-style models generate responses based only on training data. This creates limitations:

Limitations of standard LLMs

  • Knowledge cutoff (no real-time updates)
  • Hallucinations (false or made-up answers)
  • No access to private company data
  • Lack of personalization/context
Screenshot 2026-05-28 at 12 31 42 PM

RAG solves these problems by:

  • Fetching real-time relevant data
  • Grounding answers in actual documents
  • Reducing hallucinations
  • Keeping knowledge updated without retraining

Real-Life Example

Imagine two students preparing for an exam:

Student 1 (LLM only)

  • Reads books once
  • Answers from memory only
  • Cannot verify facts
Screenshot 2026-05-28 at 1 51 20 PM

Student 2 (RAG system)

  • Reads books
  • Can open books during exam
  • Verifies answers in real-time
Screenshot 2026-05-28 at 1 51 20 PM

Student 2 performs better because they can retrieve information when needed.

Key Benefits of RAG

  1. Reduces Hallucination

    LLMs generate more factual and grounded responses. Some LLM is more overconfitent 
    and giving wrong response. In LLM we can not verified data but in RAG we can verified
    then respose will be more grounded and real data.
    
  2. Keeps Knowledge Updated

     Works with real-time and dynamic data sources. In LLM there is a knowlege cutoff date 
     mean till training date all info present. But by using RAG we can use current data or uptodate.
    
  3. Cost Efficient

     Avoids expensive retraining or fine-tuning of models mean new data then we can give access 
     by using RAG without training and finetuning.
    
  4. Data Privacy

    Sensitive enterprise data stays within controlled systems. Becuase our not access whole data same time for particular query 
    it is fetching/access only data. Suppose big company do not want to give full access to model. So he can control by using 
    RAG.
    
  5. Context Awareness

    Personalized responses using user-specific data.
    

Example:

     Airline chatbot knows your booking details (PNR, flight time, delay status)

RAG Architecture Overview

RAG consists of two major pipelines:

1. Ingestion Pipeline

This prepares data for retrieval.

Steps:

  • Data Collection
    • PDFs
    • Web pages
    • Databases
    • Excel files
    • APIs
  • Chunking
    • Large documents are split into smaller pieces (chunks)

Types of chunking:

  • Fixed-size chunking
  • Hierarchical chunking
  • Semantic chunking
  • Embedding Generation
  • Text is converted into numerical vectors using embedding models.

Popular embedding tools:

  • OpenAI Embeddings
  • Google Gemini Embeddings
  • Sentence Transformers (Hugging Face)
  • Vector Database Storage
  • Embeddings are stored in vector databases such as:
  • Pinecone
  • ChromaDB
  • FAISS
  • Elasticsearch

Vector databases enable semantic search (meaning-based search), not just keyword matching.

Screenshot 2026-05-28 at 2 38 11 PM

2. Retrieval Pipeline

This handles user queries.

Steps:

  • User Query Input

    User asks a question.
    
  • Query Embedding

    Query is converted into vector form.
    
  • Similarity Search

    System finds the most relevant document chunks from the vector database.
    
  • Context Creation

    Retrieved data is used as context.
    
  • Augmentation

    Prompt is enriched with:
    
  • User query

    Retrieved context
    
  • Generation

    LLM generates final response using grounded context.
    
Screenshot 2026-05-28 at 3 00 57 PM

Combined flow

Screenshot 2026-05-28 at 3 01 11 PM

Why It is Called RAG

  • Retrieval → Fetch relevant data
  • Augmented → Add context to prompt
  • Generation → Produce final AI response

RAG Technologies & Tools

Frameworks

1. LangChain

     LangChain = LEGO toolkit for AI apps
  • What it does
    • Connects LLMs
    • Connects vector databases
    • Handles prompts
    • Creates AI agents
    • Builds chatbots
  • Features
    • RAG pipelines
    • Memory
    • Tools
    • Agents
    • Chains
  • Best For
    • Production AI apps
    • Advanced workflows
    • AI agents

Example

    from langchain.chains import RetrievalQA

2. LlamaIndex

Focused mainly on connecting data to LLMs.

  • What it does
    • Reads PDFs
    • Reads databases
    • Creates indexes
    • Retrieves data efficiently
  • Best For
    • RAG projects
    • Document Q&A
    • Fast data ingestion

Example

from llama_index import VectorStoreIndex

3. Haystack

Enterprise-focused RAG framework.

  • What it does
    • Search pipelines
    • Question answering
    • Document retrieval
  • Best For
    • Enterprise search

    • Large-scale systems

    • Easy Analogy

      Haystack = Enterprise Google Search for AI

Embedding Models

Embeddings convert text into numbers/vectors.

AI understands meaning using vectors.

Easy Example

  "I love cricket"
         ↓
  [0.23, 0.91, 0.44, ...]

  Similar sentences have similar vectors.

1. OpenAI text-embedding models

       Popular cloud embedding models from OpenAI.
  • Examples
    • text-embedding-3-small
    • text-embedding-3-large
  • Features
    • High quality
    • Accurate semantic search
    • API-based
  • Best For
    • Production apps
    • High accuracy RAG
  • Drawback
    • Paid API

2. Gemini embeddings

Embedding models from Google.

  • Features
    • Good multilingual support
    • Integrated with Gemini ecosystem
  • Best For
    • Google Cloud users

3. Sentence Transformers

Open-source embedding models.

Built using Transformers.

  • Popular Models

    • all-MiniLM-L6-v2
    • bge-small
    • mpnet
  • Features

    • Free
    • Runs locally
    • Fast
  • Best For

    • Learning RAG
    • Local AI apps
    • Offline systems *Easy Analogy

    Sentence Transformers = Free local embedding engine

Vector Databases

Stores embeddings/vectors. Used for similarity search.

Easy Analogy

Normal Database:

   Search by exact keyword

Vector Database:

   Search by meaning

Example:

Question:

  "How to reset password?"
   It can also find:
  "Forgot password steps"

because meanings are similar.

  • Pinecone Cloud vector database.
    • Features
      • Fully managed
      • Fast similarity search
      • Scalable
    • Best For
      • Production systems
      • Enterprise AI
    • Drawback
      • Paid service

Easy Analogy

   Pinecone = Cloud storage for AI memory
  • FAISS Created by Meta.
    • Features
      • Very fast
      • Local vector search
      • Open source
    • Best For
      • Learning
      • Local applications
      • High-performance search
      • Drawback
      • No built-in cloud/database features

Easy Analogy

   FAISS = Fast local vector engine
  • ChromaDB Simple vector database for beginners.

    • Features
      • Easy setup
      • Python-friendly
      • Lightweight
    • Best For
      • Small projects
      • Beginners
      • Prototypes

Easy Analogy

   ChromaDB = SQLite for vectors
  • Elasticsearch. Traditional search engine with vector search support.
    • Features
      • Keyword search
      • Vector search
      • Hybrid search
    • Best For
      • Enterprise search systems
      • Large-scale applications

Easy Analogy

 Elasticsearch = Google-like search engine with AI support

RAG Chunking Strategies

  1. Fixed Size Chunking
    • Splits text every fixed number of tokens
    • Simple but may break context
  2. Hierarchical Chunking
    • Based on paragraphs, sentences, sections
    • More structured and production-friendly
  3. Semantic Chunking
    • Splits based on meaning/topic changes
    • High quality but computationally expensive

Types of RAG Architectures

Screenshot 2026-05-28 at 8 49 15 PM

1. Standard RAG

Basic retrieval + generation pipeline

Best for:

  • FAQ systems
  • Simple chatbots
Screenshot 2026-05-28 at 8 14 57 PM

2. Hybrid RAG

Combines:

  • Keyword search
  • Semantic vector search

Best for:

  • Enterprise search
  • E-commerce search systems
Screenshot 2026-05-28 at 8 36 12 PM

3. RAG with Memory

Stores conversation history

Best for:

  • Chatbots
  • Virtual assistants
Screenshot 2026-05-28 at 8 38 47 PM

4. Graph RAG

Uses knowledge graphs instead of chunks

Nodes = entities Edges = relationships

Best for:

  • Fraud detection
  • Legal systems
  • Research platforms
Screenshot 2026-05-28 at 8 39 58 PM

5. Agentic RAG

Uses multiple steps and reasoning agents

Features:

  • Multi-query decomposition
  • Tool usage (web, APIs)
  • Autonomous decision making

Best for:

  • Complex analytical tasks
  • Financial comparisons
  • Research workflows
Screenshot 2026-05-28 at 8 43 47 PM

6. Multimodal RAG

Works with multiple data types:

  • Text
  • Images
  • Audio
  • Video

Use cases:

  • Medical imaging (X-rays)
  • Surveillance systems
  • Audio transcription systems

7. Self-RAG (Reflective RAG)

  • Generates draft answer
  • Critically evaluates it
  • Improves response iteratively

Best for:

  • Research systems
  • High-accuracy applications
  • Summary
Screenshot 2026-05-28 at 8 47 25 PM

RAG is a powerful AI architecture that bridges the gap between:

Static knowledge (LLMs) Dynamic real-world data (databases, APIs, documents)

It enables:

Smarter AI systems Context-aware responses Enterprise-ready AI applications Final Thought

RAG is not a single technology—it is a design pattern that combines retrieval systems and generative AI to create intelligent, real-world-ready applications.

Clone this wiki locally