LangChain & LangGraph — Learning Guide

A hands-on tutorial repo covering the core concepts of building LLM-powered applications with LangChain and LangGraph. Each concept is backed by a working Python program you can run locally against a llama-server or LM Studio instance.

1. Setup & Getting Started

Prerequisites

Python 3.11+
A local LLM server. The examples in this repo use either:
- llama-server (from llama.cpp) — default port 8080
- LM Studio — default port 1234

Install dependencies

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install langchain langchain-openai langchain-community langchain-huggingface langgraph langchain-postgres sentence-transformers

Start your local LLM server

llama-server:

llama-server -m /path/to/your/model.gguf --port 8080 --host 0.0.0.0

LM Studio: load a model and click "Start Server" (port 1234 by default).

2. Core LangChain Concepts

OpenAI vs ChatOpenAI

LangChain exposes two client classes for talking to LLMs:

Class	API endpoint	Input	Output	Use when
`ChatOpenAI`	`/v1/chat/completions`	List of `Message` objects	`AIMessage`	Modern chat models (recommended)
`OpenAI`	`/v1/completions`	Plain string	Plain string	Legacy completion models

See openai_vs_chatopenai_example.py for a side-by-side comparison.

ChatOpenAI example:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOpenAI(base_url="http://localhost:8080/v1", api_key="not-needed", model="llama")

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="What is the capital of France?")
]
response = llm.invoke(messages)
print(response.content)

The simplest working example is in simple_example.py.

LCEL — LangChain Expression Language

LCEL lets you compose chains using the pipe operator |. A prompt template piped into an LLM is all you need for most single-step tasks.

from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant"),
    ("human", "What is the capital of {country}?")
])

chain = template | llm
response = chain.invoke({"country": "France"})

See the lcel_example() function inside llama_server_client.py.

Streaming, Batch & Invoke

LangChain supports three invocation patterns on any runnable:

# Single response
response = llm.invoke([HumanMessage(content="Hello")])

# Multiple prompts in parallel
responses = llm.batch([[HumanMessage(content="What is 2+2?")],
                       [HumanMessage(content="Name a planet.")]])

# Token-by-token streaming
for chunk in llm.stream([HumanMessage(content="Tell me a story.")]):
    print(chunk.content, end="", flush=True)

All three patterns are demonstrated in llama_server_client.py.

3. Chunking Strategies

Before you can do retrieval, you need to break documents into chunks. The right strategy depends on your document type and query patterns.

Strategy	One-line summary	Typical use case
Fixed-size window	Equal-sized chunks by tokens/chars, often with overlap	Logs, transcripts, messy text
Natural-boundary	Group sentences/paragraphs up to a size target	Articles, policies, how-to guides
Structure-aware	Use headings, HTML tags, PDF sections as chunk boundaries	API docs, manuals, textbooks
Semantic / embedding-based	Split where topic shifts using embeddings	Dense legal/research documents
Sliding window with overlap	Overlapping chunks so boundary content appears in multiple chunks	Any RAG system needing high recall
Metadata / layout-based	Chunk per logical unit (FAQ pair, table row, UI block)	FAQs, catalogs, product listings
Agentic / dynamic	Agent chooses strategy per document type	Heterogeneous corpora
Hierarchical / multi-level	Multiple granularities: sentence-level + section-level	Systems balancing precision & context
Time- / event-based	Windows by time or session/trace ID	Logs, observability, telemetry

Using LangChain's text splitters

chunking/chunking_data.py shows loading documents from text files, PDFs, and the web, then splitting with RecursiveCharacterTextSplitter. There are also language-aware splitters for Python and Markdown:

from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

# Generic splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(TextLoader("./file.txt").load())

# Language-aware splitter for code
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=60, chunk_overlap=0
)

Choosing chunk size and overlap

A few rules of thumb:

Chunk size should fit comfortably in your model's context window alongside the query and system prompt. 400–600 tokens is a common starting point.
Overlap (typically 15–25%) reduces information loss at chunk boundaries. More overlap means more embeddings and storage cost.
For multi-step procedures or docs with cross-references, prefer structure-aware chunking so related content stays together.

4. Embeddings & Vector Databases

Once you have chunks, you embed them and store them in a vector database so you can retrieve the most relevant ones at query time.

Embedding chunks

chunking/embedding.py shows how to generate embeddings with a local HuggingFace model:

from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings = embeddings_model.embed_documents([chunk.page_content for chunk in splitted_docs])

Each chunk ends up as a high-dimensional float vector. The metadata (source, chunk index, section title, page number) stored alongside it is just as important as the embedding itself — it's what lets you filter, reconstruct context windows, and route queries to the right index.

Storing in pgvector

chunking/vectorDb.py stores embeddings in PostgreSQL with the pgvector extension:

from langchain_postgres.vectorstores import PGVector

# Start the database first:
# docker run --name pgvector-container -e POSTGRES_USER=langchain \
#   -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain \
#   -p 6024:5432 -d pgvector/pgvector:pg16

connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
db = PGVector.from_documents(splitted_docs, embeddings_model, connection=connection)

results = db.similarity_search("your query here", k=3)

Use a unique collection_name (e.g., with uuid.uuid4()) per run to avoid accumulating duplicate entries.

5. Indexing Strategies

An indexing strategy is how you organize vectors so that similarity search is fast and accurate at scale, rather than scanning every vector linearly.

1. Flat / baseline — one embedding per chunk, brute-force k-NN. Simple and accurate; use for small-to-medium corpora or early prototypes.

2. Parent–child (sub-chunk) — index small child chunks for precise matching, each pointing to a larger parent chunk for context. High-precision matching without starving the LLM of surrounding context.

3. Hierarchical (multi-level) — maintain a coarse index (section/page) and a fine index (sentence/window). Stage 1 finds the right region; Stage 2 retrieves precise passages within it.

4. Hybrid lexical–vector — run both BM25 (keyword) and vector search, then combine results via reciprocal rank fusion. Lexical is great for exact terms, error codes, API names; vectors handle semantic paraphrases. Hybrid wins on real-world "messy" queries.

5. Metadata-centric — every record carries rich metadata (product, version, tenant, language, date). Queries always include pre-filters. Essential for multi-tenant systems and time-sensitive domains.

6. Multi-representation — store multiple embeddings per chunk from different models (general, domain-specific, task-specific). Choose at query time which representation to search.

7. Hypothetical-question (HyDE-adjacent) — for each chunk, generate synthetic questions it could answer and index those. Narrows the semantic gap between user queries and document style.

8. Graph-enhanced — build a knowledge graph where nodes are chunks/entities and edges are relations. Use vector search to find an initial neighborhood, then expand along graph edges for multi-hop reasoning.

ANN algorithm choices: HNSW (the default in most vector DBs, great balance of speed/recall) for < 100M vectors; IVF+PQ (clustered + compressed) for billion-scale corpora with memory constraints.

6. RAG Strategies

RAG (Retrieval-Augmented Generation) connects a retrieval system to an LLM so it can answer questions grounded in your documents.

Naive RAG — embed query → vector search → top-k chunks + query → LLM → answer. The baseline. Simple but brittle if retrieval misses key chunks.

RAG Fusion — generate multiple query variants, retrieve for each, merge results with reciprocal rank fusion, then feed to the LLM. Better coverage for ambiguous or multi-faceted questions.

RAPTOR — builds a recursive tree of LLM-generated summaries over your corpus. Leaf nodes are original chunks; internal nodes are summaries of clusters. Retrieval can start at high-level summaries and drill down to leaves. Best for long, structured corpora.

GraphRAG — extract entities and relations from text to build a knowledge graph. Retrieve by traversing the graph guided by the query. Excellent for multi-hop reasoning across documents.

Agentic RAG — wraps the RAG pipeline in an agent loop. The agent can try different retrieval strategies, decompose complex questions into sub-questions, fall back to web search, or use SQL/APIs alongside document retrieval.

Self-RAG / CRAG — adds a self-evaluation loop. A smaller model or the LLM itself scores retrieved documents for relevance. If scores are low, it re-retrieves with a different query or falls back to another source.

Hybrid & late-interaction (ColBERT, HyDE) — ColBERT stores token-level embeddings and uses MaxSim scoring for high-precision matches. HyDE generates a hypothetical document from the query and retrieves real documents similar to that hypothetical.

Iterative / Modular RAG — treat the pipeline as pluggable modules: query rewriting, routing, hybrid search, reranking, compression, generation. Mix and match per use case.

7. Query Transformation

The quality of what you retrieve is only as good as the query you send. Query transformation modifies the user's input before retrieval to compensate for ambiguity, incomplete context, or vocabulary mismatch.

Rewrite-Retrieve-Read — ask the LLM to produce a better search query from the user's raw input before doing retrieval:

rewrite_prompt = ChatPromptTemplate.from_template(
    "Provide a better search query to answer the given question. Question: {x} Answer:"
)

Multi-Query Retrieval — generate several different versions of the query (different angles, phrasings), retrieve for each in parallel, and merge the results. A single query can miss relevant content that a paraphrase would catch.

perspectives_prompt = ChatPromptTemplate.from_template(
    """You are an AI assistant. Generate five different versions of the user question 
    to retrieve relevant documents from a vector database. Question: {question}"""
)

RAG-Fusion — extends multi-query retrieval with a final reranking step using reciprocal rank fusion (RRF), which combines the rankings from multiple query variants to surface the most consistently relevant documents.

HyDE (Hypothetical Document Embeddings) — instead of embedding the question, ask the LLM to write a hypothetical answer document, then embed that document and use it to retrieve similar real documents. The intuition: a hypothetical answer is closer in embedding space to relevant documents than the bare question.

Logical Routing — give the LLM knowledge of your available data sources (e.g., Python docs vs. JS docs, SQL database vs. document store) and let it classify the query into the appropriate route using structured output:

from pydantic import BaseModel
from typing import Literal

class RouteQuery(BaseModel):
    """Route a user query to the most relevant datasource."""
    datasource: Literal["python_docs", "js_docs"]

structured_llm = ChatOpenAI(model="gpt-3.5-turbo").with_structured_output(RouteQuery)

8. Cognitive Architectures with LangGraph

LangGraph models your application as a state machine: nodes transform state, edges define flow, and conditional edges let the LLM decide what happens next.

The spectrum from "deterministic code" to "fully autonomous agent":

Level	Description
Code	Regular software, no LLM
LLM call	Single LLM call as part of a larger application (translation, summarization)
Chain	Multiple LLM calls in a fixed sequence (e.g., natural language → SQL → answer)
Router	LLM picks the next step from predefined options
Agent	LLM plans, acts, observes, and decides whether to keep going
Multi-agent	Multiple specialized agents coordinated by a supervisor or graph

As you move right, you gain agency but lose predictability. Techniques like structured output, streaming, and human-in-the-loop help you push the reliability frontier outward.

Multi-agent topologies:

Network — any agent can call any other agent
Supervisor — all agents report to a central supervisor that routes between them
Hierarchical — supervisor of supervisors; useful for complex, decomposable tasks
Custom workflow — agents only communicate with specific neighbors; some flow is deterministic

Streaming modes in LangGraph

for chunk in graph.stream(input, stream_mode="updates"):   # default: each node's output
    print(chunk)

for chunk in graph.stream(input, stream_mode="values"):    # full state after each step
    print(chunk)

for chunk in graph.stream(input, stream_mode="debug"):     # checkpoint + task events
    print(chunk)

9. Agent Patterns

All agent examples live in the agents/ directory and target LM Studio on http://localhost:1234/v1.

Plan-Do Loop (ReAct)

agents/plan_do_loop.py

The simplest autonomous agent pattern: the LLM decides which tool to call (Plan), the graph executes it (Do), and the output goes back to the LLM. It loops until the LLM produces a final answer with no tool call.

START → model → [tool call?] → tools → model → ... → END

Key LangGraph primitives used:

ToolNode — prebuilt node that executes whatever tool call the LLM requested
tools_condition — prebuilt conditional edge that routes to tools if the LLM made a tool call, otherwise to END

builder.add_edge(START, "model")
builder.add_conditional_edges("model", tools_condition)
builder.add_edge("tools", "model")
graph = builder.compile()

Reflection Agent

agents/reflect.py

A generate → reflect loop where one LLM call acts as the "writer" and another acts as the "critic". The trick is role inversion: the essay (an AIMessage) is presented to the critic as a HumanMessage so the critic treats it as a student submission.

START → generate → [max iterations?] → reflect → generate → ...

The loop terminates after 3 iterations by checking len(state["messages"]) > 6.

Supervisor Agent

agents/supervisor_agent.py

A multi-agent system with a central supervisor that routes to worker agents. The supervisor uses structured output to return a decision (researcher, coder, or FINISH), which is read from state to route the conditional edge.

START → supervisor → researcher → supervisor → coder → supervisor → END

The routing function is simply lambda state: state["next"].

Subgraphs

LangGraph supports composing graphs inside other graphs. There are two approaches depending on whether the parent and child share state.

Shared state (agents/subgraph_with_shared_state.py) — SubgraphState inherits from ParentState. You can add the compiled subgraph directly as a node:

parent_builder.add_node("my_subgraph", subgraph)  # subgraph is a compiled graph

Non-shared state (agents/subgraph_with_non_shared_state.py) — the parent and subgraph have completely different state schemas. You wrap the subgraph invocation in a regular Python function that manually maps between the two schemas:

def call_subgraph_node(state: ParentState) -> dict:
    subgraph_input = {"task_name": state["parent_job_id"], ...}   # map in
    result = subgraph.invoke(subgraph_input)
    return {"extracted_data": result["final_output"]}             # map out

Dynamic Tool Selection

agents/tool_selection.py

When you have many tools, sending all of them in every prompt is wasteful and can confuse the model. This pattern uses a vector store to retrieve only the most relevant tools for each query, then binds only those tools to the LLM.

START → select_tools (RAG) → model (with selected tools only) → tools → model → END

tools_retriever = InMemoryVectorStore.from_documents(
    documents=[Document(page_content=t.description, metadata={"name": t.name}) for t in tools],
    embedding=embeddings,
).as_retriever(search_kwargs={"k": 2})

Human-in-the-Loop

agents/human_in_the_loop.py

LangGraph's checkpointing system makes it possible to pause execution, inspect or modify state, and resume — all from the same thread.

Setup: attach a checkpointer

from langgraph.checkpoint.memory import MemorySaver
graph = builder.compile(checkpointer=MemorySaver(), interrupt_before=["tools"])

Execution pauses before the tools node. You can then:

# Inspect what's pending
state = graph.get_state(config)
pending_calls = state.values["messages"][-1].tool_calls

# Option 1: Approve — resume with None input
graph.astream(None, config)

# Option 2: Modify state — inject a fake tool result and skip execution
from langchain_core.messages import ToolMessage
graph.update_state(config,
    {"messages": [ToolMessage(content="Custom result", tool_call_id=..., name=...)]},
    as_node="tools"
)
graph.astream(None, config)

State history and time travel — every checkpoint is stored with a unique ID. You can retrieve the full execution history and re-run from any past checkpoint by putting an old checkpoint_id into your config.

for snapshot in graph.get_state_history(config):
    print(snapshot.config["configurable"]["checkpoint_id"])
    print(snapshot.next)

10. Guardrails

guardrail_check.py shows a scenario where a harmful prompt is passed in the conversation history. Production systems need an explicit guardrail layer — either a separate classifier model, a prompt-based filter, or a third-party service — that runs before or after the LLM call to detect and block unsafe inputs and outputs.

Common guardrail approaches:

Input guardrails — classify the user's message before sending it to the LLM (topic filters, PII detection, toxicity classifiers)
Output guardrails — validate the LLM's response before returning it to the user (hallucination detection, schema validation)
LangGraph integration — add a guardrail node that returns early with a safe response if a violation is detected, instead of routing to the main agent

11. Topics to Explore Next

Based on what's covered here, these are natural next steps:

Memory — LangGraph's checkpointer gives you turn-level memory within a thread, but long-term memory across sessions requires a separate memory store. Look at langgraph-memory and summarization-based memory to avoid ever-growing context windows.

Evaluation & observability — build a benchmark of representative queries with expected answers and measure retrieval relevance (MRR, NDCG), answer correctness, and hallucination rate. LangSmith integrates with LangChain/LangGraph to trace every LLM call and tool execution.

Reranking — after retrieval, a cross-encoder reranker (e.g., Cohere Rerank, a local cross-encoder model) can significantly improve the precision of the top-k documents before they're sent to the LLM.

Structured output validation — use Pydantic models with with_structured_output() everywhere you need reliable schemas. Add retry logic (LangChain's with_retry) to handle occasional formatting failures.

Parallelism in LangGraph — nodes that have no dependency on each other can be fanned out and run in parallel using Send events. Useful for multi-query retrieval or sub-task decomposition.

Tool calling patterns — explore the difference between bind_tools (LLM decides when to call) and force_tool_call (always call a specific tool), and how to handle tool errors gracefully with retries or fallbacks.

Prompt caching — for long system prompts or large RAG context that stays the same across many calls, enable prompt caching (Anthropic, OpenAI) to cut latency and cost significantly.

Production deployment — LangGraph Cloud (or self-hosted LangGraph Server) adds persistent storage, queue-based execution, streaming APIs, and built-in human-in-the-loop endpoints so you don't have to wire that up yourself.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agents		agents
chunking		chunking
.gitignore		.gitignore
README.md		README.md
guardrail_check.py		guardrail_check.py
llama_server_client.py		llama_server_client.py
openai_vs_chatopenai_example.py		openai_vs_chatopenai_example.py
requirements.txt		requirements.txt
simple_example.py		simple_example.py
simple_openai_example.py		simple_openai_example.py

Folders and files

Latest commit

History

Repository files navigation

LangChain & LangGraph — Learning Guide

Table of Contents

1. Setup & Getting Started

Prerequisites

Install dependencies

Start your local LLM server

2. Core LangChain Concepts

OpenAI vs ChatOpenAI

LCEL — LangChain Expression Language

Streaming, Batch & Invoke

3. Chunking Strategies

Using LangChain's text splitters

Choosing chunk size and overlap

4. Embeddings & Vector Databases

Embedding chunks

Storing in pgvector

5. Indexing Strategies

6. RAG Strategies

7. Query Transformation

8. Cognitive Architectures with LangGraph

Streaming modes in LangGraph

9. Agent Patterns

Plan-Do Loop (ReAct)

Reflection Agent

Supervisor Agent

Subgraphs

Dynamic Tool Selection

Human-in-the-Loop

10. Guardrails

11. Topics to Explore Next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages