Skip to content

achintmehta/langchain

Repository files navigation

LangChain & LangGraph — Learning Guide

A hands-on tutorial repo covering the core concepts of building LLM-powered applications with LangChain and LangGraph. Each concept is backed by a working Python program you can run locally against a llama-server or LM Studio instance.


Table of Contents

  1. Setup & Getting Started
  2. Core LangChain Concepts
  3. Chunking Strategies
  4. Embeddings & Vector Databases
  5. Indexing Strategies
  6. RAG Strategies
  7. Query Transformation
  8. Cognitive Architectures with LangGraph
  9. Agent Patterns
  10. Guardrails
  11. Topics to Explore Next

1. Setup & Getting Started

Prerequisites

  • Python 3.11+
  • A local LLM server. The examples in this repo use either:
    • llama-server (from llama.cpp) — default port 8080
    • LM Studio — default port 1234

Install dependencies

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install langchain langchain-openai langchain-community langchain-huggingface langgraph langchain-postgres sentence-transformers

Start your local LLM server

llama-server:

llama-server -m /path/to/your/model.gguf --port 8080 --host 0.0.0.0

LM Studio: load a model and click "Start Server" (port 1234 by default).


2. Core LangChain Concepts

OpenAI vs ChatOpenAI

LangChain exposes two client classes for talking to LLMs:

Class API endpoint Input Output Use when
ChatOpenAI /v1/chat/completions List of Message objects AIMessage Modern chat models (recommended)
OpenAI /v1/completions Plain string Plain string Legacy completion models

See openai_vs_chatopenai_example.py for a side-by-side comparison.

ChatOpenAI example:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOpenAI(base_url="http://localhost:8080/v1", api_key="not-needed", model="llama")

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="What is the capital of France?")
]
response = llm.invoke(messages)
print(response.content)

The simplest working example is in simple_example.py.

LCEL — LangChain Expression Language

LCEL lets you compose chains using the pipe operator |. A prompt template piped into an LLM is all you need for most single-step tasks.

from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant"),
    ("human", "What is the capital of {country}?")
])

chain = template | llm
response = chain.invoke({"country": "France"})

See the lcel_example() function inside llama_server_client.py.

Streaming, Batch & Invoke

LangChain supports three invocation patterns on any runnable:

# Single response
response = llm.invoke([HumanMessage(content="Hello")])

# Multiple prompts in parallel
responses = llm.batch([[HumanMessage(content="What is 2+2?")],
                       [HumanMessage(content="Name a planet.")]])

# Token-by-token streaming
for chunk in llm.stream([HumanMessage(content="Tell me a story.")]):
    print(chunk.content, end="", flush=True)

All three patterns are demonstrated in llama_server_client.py.


3. Chunking Strategies

Before you can do retrieval, you need to break documents into chunks. The right strategy depends on your document type and query patterns.

Strategy One-line summary Typical use case
Fixed-size window Equal-sized chunks by tokens/chars, often with overlap Logs, transcripts, messy text
Natural-boundary Group sentences/paragraphs up to a size target Articles, policies, how-to guides
Structure-aware Use headings, HTML tags, PDF sections as chunk boundaries API docs, manuals, textbooks
Semantic / embedding-based Split where topic shifts using embeddings Dense legal/research documents
Sliding window with overlap Overlapping chunks so boundary content appears in multiple chunks Any RAG system needing high recall
Metadata / layout-based Chunk per logical unit (FAQ pair, table row, UI block) FAQs, catalogs, product listings
Agentic / dynamic Agent chooses strategy per document type Heterogeneous corpora
Hierarchical / multi-level Multiple granularities: sentence-level + section-level Systems balancing precision & context
Time- / event-based Windows by time or session/trace ID Logs, observability, telemetry

Using LangChain's text splitters

chunking/chunking_data.py shows loading documents from text files, PDFs, and the web, then splitting with RecursiveCharacterTextSplitter. There are also language-aware splitters for Python and Markdown:

from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

# Generic splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(TextLoader("./file.txt").load())

# Language-aware splitter for code
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=60, chunk_overlap=0
)

Choosing chunk size and overlap

A few rules of thumb:

  • Chunk size should fit comfortably in your model's context window alongside the query and system prompt. 400–600 tokens is a common starting point.
  • Overlap (typically 15–25%) reduces information loss at chunk boundaries. More overlap means more embeddings and storage cost.
  • For multi-step procedures or docs with cross-references, prefer structure-aware chunking so related content stays together.

4. Embeddings & Vector Databases

Once you have chunks, you embed them and store them in a vector database so you can retrieve the most relevant ones at query time.

Embedding chunks

chunking/embedding.py shows how to generate embeddings with a local HuggingFace model:

from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings = embeddings_model.embed_documents([chunk.page_content for chunk in splitted_docs])

Each chunk ends up as a high-dimensional float vector. The metadata (source, chunk index, section title, page number) stored alongside it is just as important as the embedding itself — it's what lets you filter, reconstruct context windows, and route queries to the right index.

Storing in pgvector

chunking/vectorDb.py stores embeddings in PostgreSQL with the pgvector extension:

from langchain_postgres.vectorstores import PGVector

# Start the database first:
# docker run --name pgvector-container -e POSTGRES_USER=langchain \
#   -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain \
#   -p 6024:5432 -d pgvector/pgvector:pg16

connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
db = PGVector.from_documents(splitted_docs, embeddings_model, connection=connection)

results = db.similarity_search("your query here", k=3)

Use a unique collection_name (e.g., with uuid.uuid4()) per run to avoid accumulating duplicate entries.


5. Indexing Strategies

An indexing strategy is how you organize vectors so that similarity search is fast and accurate at scale, rather than scanning every vector linearly.

1. Flat / baseline — one embedding per chunk, brute-force k-NN. Simple and accurate; use for small-to-medium corpora or early prototypes.

2. Parent–child (sub-chunk) — index small child chunks for precise matching, each pointing to a larger parent chunk for context. High-precision matching without starving the LLM of surrounding context.

3. Hierarchical (multi-level) — maintain a coarse index (section/page) and a fine index (sentence/window). Stage 1 finds the right region; Stage 2 retrieves precise passages within it.

4. Hybrid lexical–vector — run both BM25 (keyword) and vector search, then combine results via reciprocal rank fusion. Lexical is great for exact terms, error codes, API names; vectors handle semantic paraphrases. Hybrid wins on real-world "messy" queries.

5. Metadata-centric — every record carries rich metadata (product, version, tenant, language, date). Queries always include pre-filters. Essential for multi-tenant systems and time-sensitive domains.

6. Multi-representation — store multiple embeddings per chunk from different models (general, domain-specific, task-specific). Choose at query time which representation to search.

7. Hypothetical-question (HyDE-adjacent) — for each chunk, generate synthetic questions it could answer and index those. Narrows the semantic gap between user queries and document style.

8. Graph-enhanced — build a knowledge graph where nodes are chunks/entities and edges are relations. Use vector search to find an initial neighborhood, then expand along graph edges for multi-hop reasoning.

ANN algorithm choices: HNSW (the default in most vector DBs, great balance of speed/recall) for < 100M vectors; IVF+PQ (clustered + compressed) for billion-scale corpora with memory constraints.


6. RAG Strategies

RAG (Retrieval-Augmented Generation) connects a retrieval system to an LLM so it can answer questions grounded in your documents.

Naive RAG — embed query → vector search → top-k chunks + query → LLM → answer. The baseline. Simple but brittle if retrieval misses key chunks.

RAG Fusion — generate multiple query variants, retrieve for each, merge results with reciprocal rank fusion, then feed to the LLM. Better coverage for ambiguous or multi-faceted questions.

RAPTOR — builds a recursive tree of LLM-generated summaries over your corpus. Leaf nodes are original chunks; internal nodes are summaries of clusters. Retrieval can start at high-level summaries and drill down to leaves. Best for long, structured corpora.

GraphRAG — extract entities and relations from text to build a knowledge graph. Retrieve by traversing the graph guided by the query. Excellent for multi-hop reasoning across documents.

Agentic RAG — wraps the RAG pipeline in an agent loop. The agent can try different retrieval strategies, decompose complex questions into sub-questions, fall back to web search, or use SQL/APIs alongside document retrieval.

Self-RAG / CRAG — adds a self-evaluation loop. A smaller model or the LLM itself scores retrieved documents for relevance. If scores are low, it re-retrieves with a different query or falls back to another source.

Hybrid & late-interaction (ColBERT, HyDE) — ColBERT stores token-level embeddings and uses MaxSim scoring for high-precision matches. HyDE generates a hypothetical document from the query and retrieves real documents similar to that hypothetical.

Iterative / Modular RAG — treat the pipeline as pluggable modules: query rewriting, routing, hybrid search, reranking, compression, generation. Mix and match per use case.


7. Query Transformation

The quality of what you retrieve is only as good as the query you send. Query transformation modifies the user's input before retrieval to compensate for ambiguity, incomplete context, or vocabulary mismatch.

Rewrite-Retrieve-Read — ask the LLM to produce a better search query from the user's raw input before doing retrieval:

rewrite_prompt = ChatPromptTemplate.from_template(
    "Provide a better search query to answer the given question. Question: {x} Answer:"
)

Multi-Query Retrieval — generate several different versions of the query (different angles, phrasings), retrieve for each in parallel, and merge the results. A single query can miss relevant content that a paraphrase would catch.

perspectives_prompt = ChatPromptTemplate.from_template(
    """You are an AI assistant. Generate five different versions of the user question 
    to retrieve relevant documents from a vector database. Question: {question}"""
)

RAG-Fusion — extends multi-query retrieval with a final reranking step using reciprocal rank fusion (RRF), which combines the rankings from multiple query variants to surface the most consistently relevant documents.

HyDE (Hypothetical Document Embeddings) — instead of embedding the question, ask the LLM to write a hypothetical answer document, then embed that document and use it to retrieve similar real documents. The intuition: a hypothetical answer is closer in embedding space to relevant documents than the bare question.

Logical Routing — give the LLM knowledge of your available data sources (e.g., Python docs vs. JS docs, SQL database vs. document store) and let it classify the query into the appropriate route using structured output:

from pydantic import BaseModel
from typing import Literal

class RouteQuery(BaseModel):
    """Route a user query to the most relevant datasource."""
    datasource: Literal["python_docs", "js_docs"]

structured_llm = ChatOpenAI(model="gpt-3.5-turbo").with_structured_output(RouteQuery)

8. Cognitive Architectures with LangGraph

LangGraph models your application as a state machine: nodes transform state, edges define flow, and conditional edges let the LLM decide what happens next.

The spectrum from "deterministic code" to "fully autonomous agent":

Level Description
Code Regular software, no LLM
LLM call Single LLM call as part of a larger application (translation, summarization)
Chain Multiple LLM calls in a fixed sequence (e.g., natural language → SQL → answer)
Router LLM picks the next step from predefined options
Agent LLM plans, acts, observes, and decides whether to keep going
Multi-agent Multiple specialized agents coordinated by a supervisor or graph

As you move right, you gain agency but lose predictability. Techniques like structured output, streaming, and human-in-the-loop help you push the reliability frontier outward.

Multi-agent topologies:

  • Network — any agent can call any other agent
  • Supervisor — all agents report to a central supervisor that routes between them
  • Hierarchical — supervisor of supervisors; useful for complex, decomposable tasks
  • Custom workflow — agents only communicate with specific neighbors; some flow is deterministic

Streaming modes in LangGraph

for chunk in graph.stream(input, stream_mode="updates"):   # default: each node's output
    print(chunk)

for chunk in graph.stream(input, stream_mode="values"):    # full state after each step
    print(chunk)

for chunk in graph.stream(input, stream_mode="debug"):     # checkpoint + task events
    print(chunk)

9. Agent Patterns

All agent examples live in the agents/ directory and target LM Studio on http://localhost:1234/v1.

Plan-Do Loop (ReAct)

agents/plan_do_loop.py

The simplest autonomous agent pattern: the LLM decides which tool to call (Plan), the graph executes it (Do), and the output goes back to the LLM. It loops until the LLM produces a final answer with no tool call.

START → model → [tool call?] → tools → model → ... → END

Key LangGraph primitives used:

  • ToolNode — prebuilt node that executes whatever tool call the LLM requested
  • tools_condition — prebuilt conditional edge that routes to tools if the LLM made a tool call, otherwise to END
builder.add_edge(START, "model")
builder.add_conditional_edges("model", tools_condition)
builder.add_edge("tools", "model")
graph = builder.compile()

Plan-Do Loop graph

Reflection Agent

agents/reflect.py

A generate → reflect loop where one LLM call acts as the "writer" and another acts as the "critic". The trick is role inversion: the essay (an AIMessage) is presented to the critic as a HumanMessage so the critic treats it as a student submission.

START → generate → [max iterations?] → reflect → generate → ...

The loop terminates after 3 iterations by checking len(state["messages"]) > 6.

Supervisor Agent

agents/supervisor_agent.py

A multi-agent system with a central supervisor that routes to worker agents. The supervisor uses structured output to return a decision (researcher, coder, or FINISH), which is read from state to route the conditional edge.

START → supervisor → researcher → supervisor → coder → supervisor → END

The routing function is simply lambda state: state["next"].

Supervisor graph

Subgraphs

LangGraph supports composing graphs inside other graphs. There are two approaches depending on whether the parent and child share state.

Shared state (agents/subgraph_with_shared_state.py) — SubgraphState inherits from ParentState. You can add the compiled subgraph directly as a node:

parent_builder.add_node("my_subgraph", subgraph)  # subgraph is a compiled graph

Non-shared state (agents/subgraph_with_non_shared_state.py) — the parent and subgraph have completely different state schemas. You wrap the subgraph invocation in a regular Python function that manually maps between the two schemas:

def call_subgraph_node(state: ParentState) -> dict:
    subgraph_input = {"task_name": state["parent_job_id"], ...}   # map in
    result = subgraph.invoke(subgraph_input)
    return {"extracted_data": result["final_output"]}             # map out

Dynamic Tool Selection

agents/tool_selection.py

When you have many tools, sending all of them in every prompt is wasteful and can confuse the model. This pattern uses a vector store to retrieve only the most relevant tools for each query, then binds only those tools to the LLM.

START → select_tools (RAG) → model (with selected tools only) → tools → model → END
tools_retriever = InMemoryVectorStore.from_documents(
    documents=[Document(page_content=t.description, metadata={"name": t.name}) for t in tools],
    embedding=embeddings,
).as_retriever(search_kwargs={"k": 2})

Tool selection graph

Human-in-the-Loop

agents/human_in_the_loop.py

LangGraph's checkpointing system makes it possible to pause execution, inspect or modify state, and resume — all from the same thread.

Setup: attach a checkpointer

from langgraph.checkpoint.memory import MemorySaver
graph = builder.compile(checkpointer=MemorySaver(), interrupt_before=["tools"])

Execution pauses before the tools node. You can then:

# Inspect what's pending
state = graph.get_state(config)
pending_calls = state.values["messages"][-1].tool_calls

# Option 1: Approve — resume with None input
graph.astream(None, config)

# Option 2: Modify state — inject a fake tool result and skip execution
from langchain_core.messages import ToolMessage
graph.update_state(config,
    {"messages": [ToolMessage(content="Custom result", tool_call_id=..., name=...)]},
    as_node="tools"
)
graph.astream(None, config)

State history and time travel — every checkpoint is stored with a unique ID. You can retrieve the full execution history and re-run from any past checkpoint by putting an old checkpoint_id into your config.

for snapshot in graph.get_state_history(config):
    print(snapshot.config["configurable"]["checkpoint_id"])
    print(snapshot.next)

Human-in-the-loop graph


10. Guardrails

guardrail_check.py shows a scenario where a harmful prompt is passed in the conversation history. Production systems need an explicit guardrail layer — either a separate classifier model, a prompt-based filter, or a third-party service — that runs before or after the LLM call to detect and block unsafe inputs and outputs.

Common guardrail approaches:

  • Input guardrails — classify the user's message before sending it to the LLM (topic filters, PII detection, toxicity classifiers)
  • Output guardrails — validate the LLM's response before returning it to the user (hallucination detection, schema validation)
  • LangGraph integration — add a guardrail node that returns early with a safe response if a violation is detected, instead of routing to the main agent

11. Topics to Explore Next

Based on what's covered here, these are natural next steps:

Memory — LangGraph's checkpointer gives you turn-level memory within a thread, but long-term memory across sessions requires a separate memory store. Look at langgraph-memory and summarization-based memory to avoid ever-growing context windows.

Evaluation & observability — build a benchmark of representative queries with expected answers and measure retrieval relevance (MRR, NDCG), answer correctness, and hallucination rate. LangSmith integrates with LangChain/LangGraph to trace every LLM call and tool execution.

Reranking — after retrieval, a cross-encoder reranker (e.g., Cohere Rerank, a local cross-encoder model) can significantly improve the precision of the top-k documents before they're sent to the LLM.

Structured output validation — use Pydantic models with with_structured_output() everywhere you need reliable schemas. Add retry logic (LangChain's with_retry) to handle occasional formatting failures.

Parallelism in LangGraph — nodes that have no dependency on each other can be fanned out and run in parallel using Send events. Useful for multi-query retrieval or sub-task decomposition.

Tool calling patterns — explore the difference between bind_tools (LLM decides when to call) and force_tool_call (always call a specific tool), and how to handle tool errors gracefully with retries or fallbacks.

Prompt caching — for long system prompts or large RAG context that stays the same across many calls, enable prompt caching (Anthropic, OpenAI) to cut latency and cost significantly.

Production deployment — LangGraph Cloud (or self-hosted LangGraph Server) adds persistent storage, queue-based execution, streaming APIs, and built-in human-in-the-loop endpoints so you don't have to wire that up yourself.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages