-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- Prompt Engineering
- RAG (Retrieval Augmented Generation)
- Embeddings + Vector
- DB Function Calling / Tools
Writing smart input (prompt) to get correct output from LLM
❌ Bad Prompt: Tell me about milk
✅ Good Prompt: You are a shop assistant.
Extract product name and quantity: Input: "2 milk and 1 bread"
Output JSON:
👉 Output becomes structured:
{ "milk": 2, "bread": 1 }
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language models (LLMs) by allowing them to retrieve relevant external information before generating a response.
Instead of relying only on pre-trained knowledge, RAG enables models to access up-to-date, domain-specific, and private data sources, making responses more accurate and context-aware.
- Customer support chatbots
- Healthcare report summarization
- Legal and compliance systems
- Financial analysis tools
- Enterprise knowledge search systems
Traditional LLMs like GPT-style models generate responses based only on training data. This creates limitations:
- Knowledge cutoff (no real-time updates)
- Hallucinations (false or made-up answers)
- No access to private company data
- Lack of personalization/context
- Fetching real-time relevant data
- Grounding answers in actual documents
- Reducing hallucinations
- Keeping knowledge updated without retraining
Imagine two students preparing for an exam:
- Reads books once
- Answers from memory only
- Cannot verify facts
- Reads books
- Can open books during exam
- Verifies answers in real-time
Student 2 performs better because they can retrieve information when needed.
-
Reduces Hallucination
LLMs generate more factual and grounded responses. Some LLM is more overconfitent and giving wrong response. In LLM we can not verified data but in RAG we can verified then respose will be more grounded and real data. -
Keeps Knowledge Updated
Works with real-time and dynamic data sources. In LLM there is a knowlege cutoff date mean till training date all info present. But by using RAG we can use current data or uptodate. -
Cost Efficient
Avoids expensive retraining or fine-tuning of models mean new data then we can give access by using RAG without training and finetuning. -
Data Privacy
Sensitive enterprise data stays within controlled systems. Becuase our not access whole data same time for particular query it is fetching/access only data. Suppose big company do not want to give full access to model. So he can control by using RAG. -
Context Awareness
Personalized responses using user-specific data.
Example:
Airline chatbot knows your booking details (PNR, flight time, delay status)
This prepares data for retrieval.
Steps:
- Data Collection
- PDFs
- Web pages
- Databases
- Excel files
- APIs
- Chunking
- Large documents are split into smaller pieces (chunks)
Types of chunking:
- Fixed-size chunking
- Hierarchical chunking
- Semantic chunking
- Embedding Generation
- Text is converted into numerical vectors using embedding models.
Popular embedding tools:
- OpenAI Embeddings
- Google Gemini Embeddings
- Sentence Transformers (Hugging Face)
- Vector Database Storage
- Embeddings are stored in vector databases such as:
- Pinecone
- ChromaDB
- FAISS
- Elasticsearch
Vector databases enable semantic search (meaning-based search), not just keyword matching.
This handles user queries.
Steps:
-
User Query Input
User asks a question. -
Query Embedding
Query is converted into vector form. -
Similarity Search
System finds the most relevant document chunks from the vector database. -
Context Creation
Retrieved data is used as context. -
Augmentation
Prompt is enriched with: -
User query
Retrieved context -
Generation
LLM generates final response using grounded context.
- Retrieval → Fetch relevant data
- Augmented → Add context to prompt
- Generation → Produce final AI response
LangChain = LEGO toolkit for AI apps
- What it does
- Connects LLMs
- Connects vector databases
- Handles prompts
- Creates AI agents
- Builds chatbots
- Features
- RAG pipelines
- Memory
- Tools
- Agents
- Chains
- Best For
- Production AI apps
- Advanced workflows
- AI agents
Example
from langchain.chains import RetrievalQA
Focused mainly on connecting data to LLMs.
- What it does
- Reads PDFs
- Reads databases
- Creates indexes
- Retrieves data efficiently
- Best For
- RAG projects
- Document Q&A
- Fast data ingestion
Example
from llama_index import VectorStoreIndex
Enterprise-focused RAG framework.
- What it does
- Search pipelines
- Question answering
- Document retrieval
- Best For
-
Enterprise search
-
Large-scale systems
-
Easy Analogy
Haystack = Enterprise Google Search for AI
-
Embeddings convert text into numbers/vectors.
AI understands meaning using vectors.
Easy Example
"I love cricket"
↓
[0.23, 0.91, 0.44, ...]
Similar sentences have similar vectors.
1. OpenAI text-embedding models
Popular cloud embedding models from OpenAI.
- Examples
- text-embedding-3-small
- text-embedding-3-large
- Features
- High quality
- Accurate semantic search
- API-based
- Best For
- Production apps
- High accuracy RAG
- Drawback
- Paid API
2. Gemini embeddings
Embedding models from Google.
- Features
- Good multilingual support
- Integrated with Gemini ecosystem
- Best For
- Google Cloud users
3. Sentence Transformers
Open-source embedding models.
Built using Transformers.
-
Popular Models
- all-MiniLM-L6-v2
- bge-small
- mpnet
-
Features
- Free
- Runs locally
- Fast
-
Best For
- Learning RAG
- Local AI apps
- Offline systems *Easy Analogy
Sentence Transformers = Free local embedding engine
Stores embeddings/vectors. Used for similarity search.
Easy Analogy
Normal Database:
Search by exact keyword
Vector Database:
Search by meaning
Example:
Question:
"How to reset password?"
It can also find:
"Forgot password steps"
because meanings are similar.
-
Pinecone Cloud vector database.
- Features
- Fully managed
- Fast similarity search
- Scalable
- Best For
- Production systems
- Enterprise AI
- Drawback
- Paid service
- Features
Easy Analogy
Pinecone = Cloud storage for AI memory
-
FAISS Created by Meta.
- Features
- Very fast
- Local vector search
- Open source
- Best For
- Learning
- Local applications
- High-performance search
- Drawback
- No built-in cloud/database features
- Features
Easy Analogy
FAISS = Fast local vector engine
-
ChromaDB Simple vector database for beginners.
- Features
- Easy setup
- Python-friendly
- Lightweight
- Best For
- Small projects
- Beginners
- Prototypes
- Features
Easy Analogy
ChromaDB = SQLite for vectors
-
Elasticsearch. Traditional search engine with vector search support.
- Features
- Keyword search
- Vector search
- Hybrid search
- Best For
- Enterprise search systems
- Large-scale applications
- Features
Easy Analogy
Elasticsearch = Google-like search engine with AI support
-
Fixed Size Chunking
- Splits text every fixed number of tokens
- Simple but may break context
-
Hierarchical Chunking
- Based on paragraphs, sentences, sections
- More structured and production-friendly
-
Semantic Chunking
- Splits based on meaning/topic changes
- High quality but computationally expensive
The baseline Retrieval-Augmented Generation setup.
A user query is embedded, relevant documents are retrieved from a vector
database, and the LLM generates an answer grounded in that context.
Example
User: “What are the side effects of ibuprofen?”
System:
- Retrieves medical documents
- Feeds them to the LLM
- Generates a grounded answer
Usage
- FAQ bots
- Knowledge base assistants
- Documentation search
- Customer support
Best when:
- Knowledge is static or slow-changing
- No multi-step reasoning required
Combines neural methods (embeddings/LLMs) with symbolic or rule-based systems (knowledge graphs, logic rules).
It enables:
- structured reasoning
- better interpretability
- improved factual consistency
Example
User: “Who is the CEO of the company that owns Instagram?”
System:
- Uses knowledge graph to resolve relationships
- Uses LLM for explanation
- Produces accurate, structured answer
Usage
- Enterprise knowledge systems
- Compliance / regulated domains
- Complex relational queries
Best when:
- Structured + unstructured data both matter
- Logical reasoning is required
Extends RAG by incorporating past interactions or stored user context into retrieval.
- The system remembers:
- conversation history
- user preferences
- past queries
- Memory can be:
- short-term (chat history)
- long-term (stored embeddings or profiles)
- Example
- User:
- Q1: “I’m a vegetarian.”
- Q2: “Suggest high-protein foods.”
- User:
- System:
- Retrieves nutrition docs
- Also recalls user preference (vegetarian)
- Filters answer accordingly
- Usage
- Personal assistants
- Chatbots with continuity
- Recommendation systems
- Learning companions
- Best when:
- Personalization matters
- Conversations span multiple turns
- Context evolves over time
Uses knowledge graphs instead of chunks
Nodes = entities Edges = relationships
Best for:
- Fraud detection
- Legal systems
- Research platforms
RAG + autonomous decision-making. The system behaves like an agent that
- can decide:
- what to retrieve
- whether to retrieve
- which tools to call
- how many steps to take
It’s iterative and dynamic rather than being a single pass.
Example
User: “Compare Tesla and BYD revenue growth over the last 3 years.”
- Agent:
- Decides to fetch financial data
- Calls APIs / retrieves reports
- May refine query multiple times
- Aggregates + reasons
- Produces final comparison
- Usage
- Research assistants
- Financial analysis
- Complex Q&A requiring multiple sources
- Autonomous workflows
- Best when:
- Multi-step reasoning & Query decomposition is needed
- External tools/APIs are involved
Works with multiple data types:
- Text
- Images
- Audio
- Video
Use cases:
- Medical imaging (X-rays)
- Surveillance systems
- Audio transcription systems
The model evaluates its own output and retrieval quality.
- It can:
- decide whether retrieval is needed
- critique retrieved documents
- refine or re-retrieve before answering
- Adds a self-feedback loop inside RAG.
Example
- User: “Explain quantum entanglement simply.”
- System:
- Retrieves docs
- Generates answer
- Checks: “Is this clear? sufficient?”
- If not → retrieves better sources → regenerates
- Usage
- High-accuracy QA systems
- Research assistants
- Domains needing hallucination control
- Scientific / legal applications
- Best when:
- Answer quality matters more than speed
- You want built-in validation
The system adapts retrieval strategy based on query complexity.
Not every query is treated the same:
- simple => no retrieval or light retrieval
- complex => multi-step retrieval
Example
- User:
- “Capital of France?” => direct answer (no retrieval)
- “Impact of inflation on emerging markets?” => deep retrieval
- Usage
- Cost-optimized systems
- Scalable chatbots
- Mixed workloads (simple + complex queries)
- Best when:
- You want efficiency + intelligence
- Handling large query volumes
Focuses on fixing bad retrievals.
- The system:
- detects irrelevant or low-quality documents
- filters or re-ranks them
- may re-query before answering
It improves retrieval robustness, not just generation.
- Example
- User: “Explain CRISPR gene editing.”
- System:
- Retrieves mixed-quality docs
- Detects irrelevant ones
- Filters + re-retrieves
- Produces accurate answer
- Usage
- Noisy or unstructured data sources
- Enterprise search systems
- Systems with inconsistent indexing
- Best when:
- Retrieval quality is unreliable
- Data sources are messy
Uses attention mechanisms to prioritize the most relevant parts of retrieved documents.
-
Instead of treating all retrieved chunks equally, the model:
- assigns weights to different passages
- focuses on high-signal content during generation
-
Example
User: “Causes of climate change?”
- System:
- Retrieves multiple documents
- Uses attention to emphasize key sections (e.g., greenhouse gases)
- Generates answer based on most relevant snippets
- Usage
- Long-document QA
- Summarization systems
- Legal / research analysis
- Best when:
- Retrieved context is large or noisy
- Fine-grained relevance matters
Optimizes RAG pipelines under cost constraints (tokens, API calls, latency).
- The system dynamically:
- limits retrieval depth
- chooses cheaper models when possible
- balances cost vs quality
Example
User: “Summarize this document.”
- System:
- Checks budget
- Uses fewer retrieved chunks or smaller model
- Produces acceptable answer within cost limits
- Usage
- Production systems at scale
- SaaS AI products
- High-volume query environments
- Best when:
- Cost efficiency is critical
- Trade-offs between quality and expense are needed
Focuses on making RAG outputs explainable.
- The system:
- shows retrieved sources
- provides reasoning traces
- justifies answers with evidence
Example
User: “Why was this loan rejected?”
- System:
- Retrieves policy documents
- Generates answer
- Explains decision with cited rules and reasoning
- Usage
- Finance / healthcare
- Legal systems
- Auditable AI applications
- Best when:
- Transparency is mandatory
- Decisions must be
RAG is a powerful AI architecture that bridges the gap between:
Static knowledge (LLMs) Dynamic real-world data (databases, APIs, documents)
It enables:
Smarter AI systems Context-aware responses Enterprise-ready AI applications Final Thought
RAG is not a single technology—it is a design pattern that combines retrieval systems and generative AI to create intelligent, real-world-ready applications.