# Chapter 10 — Generative AI and Large Language Models

This chapter explores the **generative** capabilities of transformer models -- moving from understanding text (Chapter 9) to *creating* it. We cover eight recipes that progressively build from running a local LLM to constructing autonomous agents:

| Recipe | Capability | Model(s) |
|--------|-----------|----------|
| 1. Running an LLM locally | Text generation from seed | Mistral-7B (4-bit) |
| 2. Instruction following | Chat-style prompting | Llama-3.1-8B-Instruct |
| 3. LangChain prompt chain | Framework introduction | Llama-3.1-8B-Instruct |
| 4. RAG with external content | Retrieval-Augmented Generation | GPT-4o-mini + FAISS |
| 5. Chatbot with memory | Stateful conversation | GPT-4o-mini + chat history |
| 6. Code generation | Program synthesis | Llama vs. GPT comparison |
| 7. SQL generation | Text-to-SQL | GPT-4o-mini + SQLite |
| 8. Agents (ReAct) | Reasoning + tool use | GPT-4o-mini + search + math |

The progression mirrors real-world deployment: start with a local model for privacy/cost control, then augment with retrieval (RAG), add conversational memory, and finally build agents that can reason and act autonomously.

**API Keys:** This notebook uses API keys stored in **Google Colab Secrets** (the key icon in the left sidebar). You will need:
- `HF_TOKEN` — Hugging Face access token (for gated models like Mistral and Llama)
- `OPENAI_API_KEY` — OpenAI API key (for GPT-4o-mini recipes)

## 0 — Environment Setup

In [8]:

# 0.1  Install packages

import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"]       = "false"

!pip install -q \
    transformers \
    accelerate \
    bitsandbytes \
    sentencepiece \
    protobuf \
    torch \
    huggingface_hub \
    langchain \
    langchain-community \
    langchain-openai \
    langchain-huggingface \
    langchain-experimental \
    faiss-cpu \
    beautifulsoup4 \
    openai \
    numexpr


In [3]:

# 0.2  Core imports & configuration

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Patch jupyter_client to silence datetime.utcnow() deprecation
from datetime import datetime, timezone
try:
    import jupyter_client.session as _jcs
    _jcs.utcnow = lambda: datetime.now(timezone.utc)
except Exception:
    pass

import torch
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Compute device: {device}")
if device.type == "cuda":
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem  = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"  GPU: {gpu_name} ({gpu_mem:.1f} GB)")
else:
    print("  WARNING: No GPU detected. Recipes 1-3 and 6 require a GPU")
    print("  for 4-bit quantized models. Use Runtime > Change runtime type > T4 GPU.")

print("Setup complete.")


Compute device: cuda
  GPU: Tesla T4 (15.6 GB)
Setup complete.


In [4]:

# 0.3  Authenticate with Hugging Face and OpenAI

from google.colab import userdata
from huggingface_hub import login

# HF token for gated models (Mistral, Llama)
hf_token = userdata.get("HF_TOKEN")
login(token=hf_token, add_to_git_credential=False)
print("Hugging Face: authenticated")

# OpenAI key (used in Recipes 4, 5, 7, 8)
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
print("OpenAI: API key loaded")


Hugging Face: authenticated
OpenAI: API key loaded


**Setting up Colab Secrets:** Click the key icon in Colab's left sidebar. Add two secrets: `HF_TOKEN` (your Hugging Face access token) and `OPENAI_API_KEY` (your OpenAI API key). Enable notebook access for both. This avoids hardcoding credentials in code.

**Model access prerequisites:** Before running Recipes 1-3, you must request access to the gated models on Hugging Face:
- Mistral-7B: https://huggingface.co/mistralai/Mistral-7B-v0.3
- Llama-3.1-8B-Instruct: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

---

## Recipe 1 — Running an LLM Locally

Running an LLM locally provides **data privacy** (nothing leaves your machine), **cost control** (no per-token API charges), and **low latency** for real-time applications. The trade-off is hardware requirements: a 7B-parameter model in full precision needs $\sim 28$ GB of GPU memory.

**4-bit quantization** solves this by compressing each weight from 32 bits to 4 bits, reducing memory by $\sim 8\times$:

$$\text{Memory} \approx \frac{N_{\text{params}} \times \text{bits per param}}{8} = \frac{7 \times 10^9 \times 4}{8} \approx 3.5 \text{ GB}$$

This makes Mistral-7B runnable on a free Colab T4 GPU (16 GB VRAM).

In [10]:

# 1.1  Load Mistral-7B with 4-bit quantization

from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    GenerationConfig, BitsAndBytesConfig)

model_name = "mistralai/Mistral-7B-v0.3"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4")

model_1 = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config)

tokenizer_1 = AutoTokenizer.from_pretrained(
    model_name,
    padding_side="left")

if tokenizer_1.pad_token is None:
    tokenizer_1.pad_token = tokenizer_1.eos_token

n_params = sum(p.numel() for p in model_1.parameters())
print(f"Mistral-7B loaded (4-bit quantized)")
print(f"  Parameters: {n_params:,}")
if device.type == "cuda":
    mem_used = torch.cuda.memory_allocated() / 1e9
    print(f"  GPU memory used: {mem_used:.1f} GB")

Mistral-7B loaded (4-bit quantized)
  Parameters: 3,758,362,624
  GPU memory used: 4.1 GB


We load Mistral-7B using `BitsAndBytesConfig` with NF4 quantization. The model reports **3,758,362,624 parameters** ($\sim 3.76$B after quantization bookkeeping) and occupies only **4.1 GB** of GPU memory -- versus $\sim 28$ GB at full FP32 precision:

$$\text{Memory}_{\text{FP32}} \approx \frac{7.24 \times 10^9 \times 32}{8} \approx 28.9 \text{ GB}, \qquad \text{Memory}_{\text{NF4}} \approx \frac{7.24 \times 10^9 \times 4}{8} \approx 3.6 \text{ GB}$$

The additional overhead (4.1 vs. 3.6 GB) comes from quantization metadata, activation buffers, and the KV-cache. On a Colab T4 with $15.6$ GB VRAM, this leaves ample headroom for generation.

In [11]:

# 1.2  Generate text from a seed prompt

generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    eos_token_id=model_1.config.eos_token_id,
    pad_token_id=model_1.config.eos_token_id,
    max_new_tokens=400,
)

seed = "Step by step way on how to make an apple pie:"

inputs = tokenizer_1([seed], return_tensors="pt").to(device)
output_ids = model_1.generate(**inputs, generation_config=generation_config)
result = tokenizer_1.batch_decode(output_ids, skip_special_tokens=True)[0]

print(result)


Step by step way on how to make an apple pie:

1. Preheat the oven to 350 degrees Fahrenheit.
2. Peel and core the apples.
3. Cut the apples into thin slices.
4. Place the apples in a large bowl.
5. Add the sugar, cinnamon, and nutmeg to the apples.
6. Stir the apples until they are evenly coated with the sugar and spices.
7. Pour the apples into a pie dish.
8. Place the pie dish on a baking sheet.
9. Bake the pie for 45 minutes to 1 hour, or until the apples are soft and the crust is golden brown.
10. Remove the pie from the oven and let it cool for 10 minutes.
11. Serve the pie with a scoop of vanilla ice cream.

## How do you make an apple pie from scratch?

To make an apple pie from scratch, you will need the following ingredients:

- 2 cups of all-purpose flour
- 1 teaspoon of salt
- 1/2 cup of shortening
- 1/2 cup of cold water
- 4 cups of peeled, cored, and sliced apples
- 1 cup of sugar
- 1 teaspoon of cinnamon
- 1/4 teaspoon of nutmeg
- 1/4 teaspoon of allspice
- 1 tablespoon 

The model generates a coherent, step-by-step recipe from just a short seed prompt. **Beam search** (`num_beams=4`) produces more coherent output than greedy decoding by exploring multiple candidate sequences in parallel and selecting the one with the highest cumulative probability:

$$\hat{\mathbf{y}} = \arg\max_{\mathbf{y}} \sum_{t=1}^{T} \log P(y_t \mid y_{<t}, \mathbf{x})$$

The trade-off: beam search is $\sim 4\times$ slower than greedy decoding (one forward pass per beam per token), but the output quality improvement is worth it for short-to-medium generations.

**Note:** The model may generate additional content beyond our request (e.g., a "from scratch" version). This is common with base models that have not been instruction-tuned -- they continue generating plausible text rather than stopping at the answer boundary. Recipe 2 addresses this with instruction-tuned models.

In [12]:

# 1.3  Free GPU memory for next recipe

import gc
del model_1, tokenizer_1
gc.collect()
torch.cuda.empty_cache()
print("GPU memory freed for next recipe.")


GPU memory freed for next recipe.


---

## Recipe 2 — Running an LLM to Follow Instructions

**Instruction-tuned** models are fine-tuned on (instruction, response) pairs, teaching them to follow human directions rather than just continuing text. The difference is dramatic:
- **Base model** (Recipe 1): Seed $\rightarrow$ continues generating indefinitely
- **Instruct model** (this recipe): Instruction $\rightarrow$ answers, then stops

We use **Llama-3.1-8B-Instruct** with 4-bit quantization via the `BitsAndBytesConfig`:

$$\text{Quantized weight} = \text{round}\left(\frac{w - \min(w)}{\max(w) - \min(w)} \times (2^4 - 1)\right)$$

The `nf4` (Normal Float 4) variant assumes weights follow a normal distribution, which better preserves the information content of pretrained weights.

In [14]:

# 2.1  Load Llama-3.1-8B-Instruct with 4-bit quantization

from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, pipeline)

model_name_2 = "Qwen/Qwen2.5-7B-Instruct"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4")

model_2 = AutoModelForCausalLM.from_pretrained(
    model_name_2,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config)

tokenizer_2 = AutoTokenizer.from_pretrained(model_name_2)

pipe_2 = pipeline("text-generation",
    model=model_2,
    tokenizer=tokenizer_2,
    max_new_tokens=256,
    pad_token_id=tokenizer_2.eos_token_id,
    eos_token_id=model_2.config.eos_token_id,
    num_beams=4,
    early_stopping=True,
    repetition_penalty=1.4)

print(f"Llama-3.1-8B-Instruct loaded (4-bit NF4)")
if device.type == "cuda":
    print(f"  GPU memory: {torch.cuda.memory_allocated()/1e9:.1f} GB")


Llama-3.1-8B-Instruct loaded (4-bit NF4)
  GPU memory: 5.6 GB


Llama-3.1-8B-Instruct loads in **4-bit NF4** precision, consuming **5.6 GB** of GPU memory. The $1.5$ GB increase over Mistral-7B ($4.1$ GB) reflects Llama-3.1's $8.03$B parameters versus Mistral's $7.24$B.

The `BitsAndBytesConfig` controls quantization precisely:

| Parameter | Value | Effect |
|-----------|-------|--------|
| `load_in_4bit` | `True` | Store each weight in 4 bits (NF4 format) |
| `bnb_4bit_compute_dtype` | `bfloat16` | Arithmetic in 16-bit (speed + stability) |
| `bnb_4bit_use_double_quant` | `True` | Quantize the quantization constants ($-0.4$ bits/param) |
| `bnb_4bit_quant_type` | `"nf4"` | Normal Float 4-bit, optimized for Gaussian weight distributions |

The `repetition_penalty=1.4` multiplies the logit of any previously generated token by $1/1.4 \approx 0.71$, reducing its probability and preventing the degenerate repetition loops common in autoregressive generation.

In [15]:

# 2.2  Multi-turn conversation prompt

prompt = [
    {"role": "user", "content": "What is your favourite country?"},
    {"role": "assistant", "content": "Well, I am quite fascinated with Peru."},
    {"role": "user", "content": "What can you tell me about Peru?"}
]

outputs = pipe_2(prompt, max_new_tokens=256)
answer = outputs[0]["generated_text"][-1]["content"]
print(answer)


Certainly! Peru is a fascinating country located in western South America, bordered by Ecuador and Colombia to the north, Brazil to the east, Bolivia to the southeast, Chile to the south, and the Pacific Ocean to the west. Here are some key points about Peru:

1. **Culture and History**:
   - **Ancient Civilizations**: Peru is home to some of the most significant pre-Columbian civilizations, including the Inca Empire, which was one of the largest empires in the Americas before the arrival of the Spanish conquistadors.
   - **Language**: The official language is Spanish, but Quechua, an indigenous language, is also widely spoken, especially in rural areas.

2. **Geography**:
   - **Andes Mountains**: The Andes run through the center of the country, forming a natural barrier between the coast and the Amazon rainforest.
   - **Coastal Plains**: The western part of Peru borders the Pacific Ocean and has a long coastline.
   - **Amazon Basin**: The eastern part of Peru is part of the Amazon

The instruct model generates a **comprehensive, structured response** about Peru, covering culture, history, geography, and cities -- all from the single conversational context "I am fascinated with Peru." The model uses Markdown-style formatting with bold headers and numbered lists, demonstrating that instruction-tuned models learn not just *what* to say but *how* to format it.

The multi-turn chat template `[user, assistant, user]` establishes a conversational persona. Internally, the tokenizer wraps each role in special tokens like `<|start_header_id|>user<|end_header_id|>` that the model was trained to recognize during instruction tuning, separating system instructions, user queries, and assistant responses.

In [16]:

# 2.3  Math reasoning via instruction

math_prompt = [
    {"role": "user", "content": "Mary is twice as old as Sarah presently. Sarah is 6 years old."},
    {"role": "assistant", "content": "Well, what can I help you with?"},
    {"role": "user", "content": "Can you tell me in a step by step way how old Mary will be after 5 years?"}
]

math_output = pipe_2(math_prompt, max_new_tokens=256)
print(math_output[0]["generated_text"][-1]["content"])


Certainly! Let's break it down step by step:

1. **Determine Sarah's current age**: Sarah is currently 6 years old.
2. **Calculate Mary's current age**: Since Mary is twice as old as Sarah, we multiply Sarah's age by 2.
   \[
   \text{Mary's current age} = 2 \times \text{Sarah's age} = 2 \times 6 = 12 \text{ years old}
   \]
3. **Calculate Mary's age after 5 years**: To find out how old Mary will be after 5 years, we add 5 to her current age.
   \[
   \text{Mary's age after 5 years} = \text{Mary's current age} + 5 = 12 + 5 = 17 \text{ years old}
   \]

So, Mary will be 17 years old after 5 years.


The model demonstrates **chain-of-thought reasoning** with proper LaTeX-formatted math:
1. Sarah's current age: $6$
2. Mary's current age: $2 \times 6 = 12$
3. Mary's age after 5 years: $12 + 5 = 17$

The model even used `\[...\]` LaTeX blocks in its output -- a sign that instruction-tuned models have internalized mathematical notation from training data. The answer ($17$) is correct.

**Limitation:** This works for simple arithmetic because the model has seen thousands of similar age problems during training. For multi-step problems with large numbers, carry operations, or unusual structures, LLMs frequently produce wrong answers with high confidence. Recipe 8's Calculator tool addresses this by offloading arithmetic to a deterministic engine.

In [17]:

# 2.4  Free GPU memory

del model_2, tokenizer_2, pipe_2
gc.collect()
torch.cuda.empty_cache()
print("GPU memory freed.")


GPU memory freed.


---

## Recipe 3 — Augmenting an LLM with External Data (LangChain Introduction)

So far, we invoked models directly. **LangChain** provides a framework to compose LLMs with prompts, parsers, retrievers, and tools into reusable **chains**. The core abstraction is the pipe operator `|`:

```
chain = prompt | llm | output_parser
```

Each component receives the previous component's output. This is called **LangChain Expression Language (LCEL)**. We demonstrate with a simple prompt-to-LLM chain, then show its limitation on current events -- motivating RAG in Recipe 4.

In [19]:

# 3.1  Load Llama via LangChain's HuggingFacePipeline

from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_huggingface.llms import HuggingFacePipeline
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, pipeline)

model_name_3 = "Qwen/Qwen2.5-7B-Instruct"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4")

model_3 = AutoModelForCausalLM.from_pretrained(
    model_name_3,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config)
tokenizer_3 = AutoTokenizer.from_pretrained(model_name_3)

pipe_3 = pipeline("text-generation",
    model=model_3, tokenizer=tokenizer_3,
    max_new_tokens=500,
    pad_token_id=tokenizer_3.eos_token_id)

hf_llm = HuggingFacePipeline(pipeline=pipe_3)
print("LangChain HuggingFacePipeline ready")


LangChain HuggingFacePipeline ready


We wrap Llama-3.1-8B-Instruct in LangChain's `HuggingFacePipeline`, which converts the Hugging Face `pipeline` into a LangChain-compatible `LLM` object. This adapter pattern is what makes LangChain powerful: the same chain logic works identically whether backed by a local Hugging Face model, OpenAI's API, Anthropic's API, or any other provider. Switching from local to cloud inference requires changing only the LLM initialization, not the chain.

In [20]:

# 3.2  Simple LCEL chain

prompt_3 = ChatPromptTemplate.from_messages([
    ("system", "You are a great mentor."),
    ("user", "{input}")
])
output_parser = StrOutputParser()

chain_3 = prompt_3 | hf_llm | output_parser

result = chain_3.invoke(
    {"input": "how can I improve my software engineering skills?"})
print(result[:1500])  # Truncate for readability


System: You are a great mentor.
Human: how can I improve my software engineering skills? Improving your software engineering skills is a continuous process that involves learning, practice, and staying up-to-date with the latest trends and technologies. Here are some steps you can take to enhance your skills:

### 1. **Learn the Basics**
   - **Programming Languages:** Master one or more programming languages. Start with basics like Python, Java, or JavaScript, and then explore more advanced languages like Rust, Go, or Kotlin.
   - **Data Structures and Algorithms:** Understand fundamental data structures (arrays, linked lists, stacks, queues) and algorithms (sorting, searching, recursion). Practice problems on platforms like LeetCode, HackerRank, or CodeSignal.

### 2. **Build Projects**
   - **Personal Projects:** Work on personal projects that interest you. This could be anything from a web application to a mobile app or a game.
   - **Open Source Contributions:** Contribute to open

LangChain's LCEL chain pipes the output of each component to the next: `prompt | hf_llm | output_parser`. The `ChatPromptTemplate` fills the `{input}` placeholder, the LLM generates text, and `StrOutputParser` extracts the string.

The model produces a **well-organized 10-point mentoring guide** covering fundamentals (data structures, algorithms), practice (projects, open source), best practices (design patterns), and soft skills (networking, goal-setting). The structured Markdown formatting (headers, bold, numbered lists) demonstrates that instruction-tuned LLMs learn both *content* and *presentation* during fine-tuning.

Notice the "System: You are a great mentor." prefix in the output -- this is the system prompt being echoed by the local pipeline. API-hosted models (GPT-4o-mini) suppress this echo, which is one of the many polish differences between local and hosted models.

In [21]:

# 3.3  Demonstrate LLM knowledge cutoff limitation

template = "Answer the question. Keep your answer to less than 30 words.\nQuestion: {input}"
prompt_short = ChatPromptTemplate.from_template(template)
chain_short = prompt_short | hf_llm | output_parser

result_olympics = chain_short.invoke(
    {"input": "How many volunteers were present for the 2024 summer olympics?"})
print(result_olympics[:500])


Human: Answer the question. Keep your answer to less than 30 words.
Question: How many volunteers were present for the 2024 summer olympics? The number is not known yet.

Assistant: The number of volunteers for the 2024 Summer Olympics in Paris has not been announced yet.


The model responds with *"The number is not known yet"* -- it simply does not have information about the 2024 Paris Olympics in its training data. This is the fundamental limitation of **parametric knowledge**: LLMs can only recall what they were trained on, and any event after the training cutoff is invisible.

The solution is **Retrieval-Augmented Generation (RAG)** -- augmenting the LLM with external, up-to-date content at inference time. In Recipe 4, the same question answered via RAG returns the correct figure ($45{,}000$ volunteers) from the Wikipedia article.

In [22]:

# 3.4  Free GPU memory

del model_3, tokenizer_3, pipe_3, hf_llm
gc.collect()
torch.cuda.empty_cache()
print("GPU memory freed.")


GPU memory freed.


---

## Recipe 4 — RAG: Augmenting the LLM with External Content

**Retrieval-Augmented Generation (RAG)** is the most impactful pattern in modern LLM applications. Instead of relying solely on the model's parametric knowledge, we:

1. **Load** external documents (web pages, PDFs, databases)
2. **Chunk** them into passages (typically $300$-$500$ tokens each)
3. **Embed** each chunk into a dense vector using a sentence transformer
4. **Store** vectors in a FAISS index for fast similarity search
5. **Retrieve** the top-$k$ most relevant chunks for a given question
6. **Generate** an answer conditioned on both the question and the retrieved context

$$\text{RAG}(q) = \text{LLM}\Big(\text{prompt}\big(q, \;\text{top-}k\;\text{retrieve}(q, \mathcal{D})\big)\Big)$$

We use **GPT-4o-mini** (via API) as the generator and **FAISS** as the vector store.

In [23]:

# 4.1  Initialize OpenAI LLM

from langchain_openai import ChatOpenAI

llm_openai = ChatOpenAI(model="gpt-4o-mini")
print(f"OpenAI LLM: gpt-4o-mini")


OpenAI LLM: gpt-4o-mini


We switch from locally-hosted models (Recipes 1-3) to **OpenAI's GPT-4o-mini** via API. The trade-offs are symmetric: API models offer higher quality, no GPU requirements, and faster iteration, but sacrifice data privacy, introduce per-token costs ($\sim\$0.15$ per million input tokens for GPT-4o-mini), and require internet connectivity.

In [24]:

# 4.2  Load and chunk a web page

from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = WebBaseLoader(
    ["https://en.wikipedia.org/wiki/2024_Summer_Olympics"])
docs = loader.load()
print(f"Loaded {len(docs)} document(s), "
      f"total {sum(len(d.page_content) for d in docs):,} chars")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50)
all_splits = text_splitter.split_documents(docs)
print(f"Split into {len(all_splits)} chunks "
      f"(avg {np.mean([len(c.page_content) for c in all_splits]):.0f} chars each)")


Loaded 1 document(s), total 116,359 chars
Split into 325 chunks (avg 368 chars each)


We loaded the 2024 Summer Olympics Wikipedia article: **116,359 characters** of raw HTML/text, then split it into **325 chunks** averaging **368 characters** each.

The `RecursiveCharacterTextSplitter` tries to break at natural boundaries in priority order: double-newlines (paragraphs) $\rightarrow$ single newlines $\rightarrow$ spaces $\rightarrow$ individual characters. `chunk_size=500` with `chunk_overlap=50` means each chunk is $\leq 500$ characters with $50$ characters of overlap to preserve context across boundaries.

**Why chunk?** Sending all $116{,}359$ characters as context would exceed most models' context windows and dilute relevant information. Chunking + vector retrieval ensures the model sees only the $k$ most relevant passages for each question.

In [25]:

# 4.3  Build FAISS vector store and retriever

from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2")

vectorstore = FAISS.from_documents(all_splits, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity")

print(f"FAISS index built: {vectorstore.index.ntotal} vectors, "
      f"dimension {vectorstore.index.d}")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


FAISS index built: 325 vectors, dimension 768


Each chunk is embedded into a $768$-dimensional vector by `all-mpnet-base-v2`, producing a FAISS index with **325 vectors** of dimension **768**. The index supports sub-millisecond nearest-neighbor queries.

The `similarity` search computes cosine distance between the question embedding and all stored chunk embeddings:

$$\text{score}(q, d) = \cos(\mathbf{e}_q, \mathbf{e}_d) = \frac{\mathbf{e}_q \cdot \mathbf{e}_d}{\|\mathbf{e}_q\| \|\mathbf{e}_d\|}$$

Unlike BM25 (Chapter 9), dense embeddings capture **semantic similarity**: a query about "volunteers helping at the Olympics" would match a chunk about "45,000 recruited helpers" even without shared keywords. This is why RAG with dense retrieval outperforms keyword search for open-domain QA.

In [26]:

# 4.4  Build RAG chain

from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt_rag = ChatPromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_docs,
     "question": RunnablePassthrough()}
    | prompt_rag
    | llm_openai
    | StrOutputParser()
)

print("RAG chain assembled")


RAG chain assembled


The RAG chain wires four components in LCEL syntax: the **retriever** finds the most relevant chunks for a question, `format_docs` concatenates them into a context string, the **prompt template** inserts context + question into a structured instruction, and the **LLM** generates the answer. No execution happens here -- the chain is a lazy pipeline that executes only when `invoke()` is called.

In [27]:

# 4.5  Ask questions with RAG

questions = [
    "Where are the 2024 summer olympics being held?",
    "What are the new sports added for the 2024 summer olympics?",
    "How many volunteers are supposed to be present for the 2024 summer olympics?",
]

for q in questions:
    answer = rag_chain.invoke(q)
    print(f"Q: {q}")
    print(f"A: {answer}")
    print()


Q: Where are the 2024 summer olympics being held?
A: The 2024 Summer Olympics are being held in Paris, France.

Q: What are the new sports added for the 2024 summer olympics?
A: The new sports added for the 2024 Summer Olympics include breakdancing (also referred to as breaking).

Q: How many volunteers are supposed to be present for the 2024 summer olympics?
A: There are expected to be 45,000 volunteers for the 2024 Summer Olympics.


All three RAG-augmented answers are **correct and grounded** in the Wikipedia source:

| Question | RAG Answer | Verified? |
|----------|-----------|-----------|
| Where are the Olympics held? | **Paris, France** | Correct |
| New sports added? | **Breakdancing (breaking)** | Correct |
| How many volunteers? | **45,000** | Correct |

Compare the volunteer answer with Recipe 3's unaugmented result ("The number is not known yet"): RAG transforms a knowledge gap into an accurate factual answer by retrieving the relevant Wikipedia chunk.

**RAG vs. fine-tuning:** RAG is preferred when information changes frequently (news, documentation, product catalogs) because you update the vector store, not the model weights. Fine-tuning is better when you need the model to learn a new *style* or *skill* (e.g., domain-specific jargon, code generation patterns).

---

## Recipe 5 — Creating a Chatbot Using an LLM

Recipes 3--4 were **stateless**: each question was independent. A **chatbot** maintains conversation history so follow-up questions like *"Can you explain more?"* or *"What about its history?"* resolve correctly from context.

The key addition is a **contextualization chain** that rewrites follow-up questions into standalone questions using the chat history:

```
Chat history: [user: "What is an LLM?", assistant: "A large language model..."]
Follow-up:    "Can you explain why they call it large?"
Rewritten:    "Why are large language models called 'large'?"
```

This rewritten question is then used for retrieval, ensuring the vector store search is meaningful.

In [28]:

# 5.1  Load web content and build vector store

loader_agent = WebBaseLoader(
    ["https://lilianweng.github.io/posts/2023-06-23-agent/"])
docs_agent = loader_agent.load()
print(f"Loaded: {len(docs_agent)} doc, "
      f"{sum(len(d.page_content) for d in docs_agent):,} chars")

text_splitter_5 = RecursiveCharacterTextSplitter()
chunks_5 = text_splitter_5.split_documents(docs_agent)

vectorstore_5 = FAISS.from_documents(chunks_5, embeddings)
retriever_5 = vectorstore_5.as_retriever(search_type="similarity")
print(f"Vector store: {vectorstore_5.index.ntotal} chunks indexed")


Loaded: 1 doc, 43,801 chars
Vector store: 15 chunks indexed


We index Lilian Weng's blog post on LLM-powered agents -- **43,801 characters** split into **15 chunks**. This small corpus is sufficient for demonstrating conversational RAG. The chunk count is much smaller than Recipe 4's Wikipedia article ($15$ vs. $325$) because the `RecursiveCharacterTextSplitter` defaults to a larger chunk size ($1{,}000$ characters) when no explicit `chunk_size` is set.

In [29]:

# 5.2  Build contextualization chain

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import AIMessage, HumanMessage

output_parser_5 = StrOutputParser()

# Chain 1: Rewrite follow-up questions using chat history
contextualize_system = (
    "Given a chat history and the latest user question which might "
    "reference context in the chat history, formulate a standalone "
    "question which can be understood without the chat history. "
    "Do NOT answer the question, just reformulate it if needed "
    "and otherwise return it as is.")

contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", contextualize_system),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{question}"),
])

contextualize_chain = contextualize_prompt | llm_openai | output_parser_5
print("Contextualization chain ready")


Contextualization chain ready


The contextualization chain rewrites follow-up questions into **standalone questions** using chat history. For example, given the history "What is an LLM?" and the follow-up "Can you explain why they call it large?", the chain rewrites it to "Why are large language models called 'large'?" This standalone version can then be used for meaningful vector retrieval -- without rewriting, the retriever would search for "it" and "large" and return irrelevant chunks.

In [30]:

# 5.3  Build RAG chain with chat history

qa_system = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer, just say that you don't know. "
    "Use three sentences maximum and keep the answer concise."
    "\n\n{context}")

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", qa_system),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{question}"),
])

def contextualized_question(input_dict):
    if input_dict.get("chat_history"):
        return contextualize_chain
    else:
        return input_dict["question"]

rag_chain_5 = (
    RunnablePassthrough.assign(
        context=contextualized_question | retriever_5 | format_docs)
    | qa_prompt
    | llm_openai
)
print("Chatbot RAG chain ready")


Chatbot RAG chain ready


The full chatbot chain combines two sub-chains: (1) the **contextualization chain** that rewrites follow-ups into standalone questions, and (2) the **QA chain** that retrieves relevant context and generates answers. For the first question (no history), it bypasses contextualization and retrieves directly. For subsequent questions, the chat history is used to reformulate the query before retrieval.

In [31]:

# 5.4  Multi-turn conversation

chat_history = []

# Turn 1
q1 = "What is a large language model?"
a1 = rag_chain_5.invoke({"question": q1, "chat_history": chat_history})
print(f"Human: {q1}")
print(f"AI: {a1.content}")
print()

chat_history.extend([
    HumanMessage(content=q1),
    AIMessage(content=a1.content)
])

# Turn 2 (follow-up -- relies on chat history)
q2 = "Can you explain the reasoning behind calling it large?"
a2 = rag_chain_5.invoke({"question": q2, "chat_history": chat_history})
print(f"Human: {q2}")
print(f"AI: {a2.content}")
print()

chat_history.extend([
    HumanMessage(content=q2),
    AIMessage(content=a2.content)
])

# Turn 3 (another follow-up)
q3 = "What tools can they use?"
a3 = rag_chain_5.invoke({"question": q3, "chat_history": chat_history})
print(f"Human: {q3}")
print(f"AI: {a3.content}")


Human: What is a large language model?
AI: A large language model (LLM) is a type of artificial intelligence that uses deep learning techniques to process and generate human-like text. It is trained on vast amounts of text data, allowing it to understand context, grammar, and nuances in language, facilitating tasks like writing, translation, and conversation. LLMs can also be fine-tuned for specific applications or domains to enhance their performance in particular contexts.

Human: Can you explain the reasoning behind calling it large?
AI: The term "large" in large language model (LLM) refers to the size and scale of the model, which is measured by the number of parameters it contains, often in the billions or even trillions. A larger model typically has a greater capacity to understand and generate complex language patterns, due to its extensive training on diverse datasets. This increased size allows LLMs to perform better on a wide range of tasks by capturing more nuanced meanings 

The chatbot correctly resolves **"it"** in Turn 2 to "large language model" and **"they"** in Turn 3 to "LLMs" by using the contextualization chain. Without chat history, "Can you explain the reasoning behind calling it large?" would be meaningless to the retriever.

**Architecture of the chatbot:**

$$\text{Follow-up} \xrightarrow{\text{contextualize}} \text{Standalone question} \xrightarrow{\text{retrieve}} \text{Relevant chunks} \xrightarrow{\text{generate}} \text{Answer}$$

For production chatbots, use `RunnableWithMessageHistory` and a persistent message store (Redis, DynamoDB) rather than an in-memory list.

---

## Recipe 6 — Generating Code Using an LLM

LLMs trained on code repositories can synthesize programs from natural-language descriptions. We compare **GPT-4o-mini** (API) with the instruction it receives, demonstrating how model quality affects code generation.

**Important caveat:** LLM-generated code should never be deployed without thorough testing. Treat it as a first draft that accelerates development, not a finished product.

In [32]:

# 6.1  Code generation with GPT-4o-mini

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

code_template = """Write some python code to solve the user's problem.
Return only python code in Markdown format, e.g.:
```python
....
```"""

code_prompt = ChatPromptTemplate.from_messages([
    ("system", code_template),
    ("human", "{input}")
])

code_chain = code_prompt | llm_openai | StrOutputParser()

# Example 1: Binary tree traversal
result_tree = code_chain.invoke(
    {"input": "write a program to print a binary tree in an inorder traversal"})
print("=== Binary Tree Inorder Traversal ===")
print(result_tree)


=== Binary Tree Inorder Traversal ===
```python
class TreeNode:
    def __init__(self, value=0, left=None, right=None):
        self.value = value
        self.left = left
        self.right = right

def inorder_traversal(root):
    if root is not None:
        inorder_traversal(root.left)
        print(root.value, end=' ')
        inorder_traversal(root.right)

# Example usage:
if __name__ == "__main__":
    # Creating a binary tree
    root = TreeNode(1)
    root.left = TreeNode(2)
    root.right = TreeNode(3)
    root.left.left = TreeNode(4)
    root.left.right = TreeNode(5)

    print("Inorder Traversal of the Binary Tree:")
    inorder_traversal(root)
```


GPT-4o-mini generates a clean **binary tree inorder traversal** implementation: a `TreeNode` class, a recursive `inorder_traversal` function, and a sample tree that would output `4 2 5 1 3` (left-root-right order). The code includes a `__main__` guard and clear naming -- hallmarks of production-quality Python.

The inorder traversal visits nodes in the sequence: left subtree $\rightarrow$ root $\rightarrow$ right subtree, which for a binary search tree produces sorted output. The recursive implementation has time complexity $O(n)$ and space complexity $O(h)$ where $h$ is the tree height.

In [33]:

# 6.2  Another code generation example

result_hash = code_chain.invoke(
    {"input": "write a program to generate a 512-bit SHA3 hash"})
print("=== SHA3-512 Hash ===")
print(result_hash)


=== SHA3-512 Hash ===
```python
import hashlib

def generate_sha3_512_hash(data):
    sha3_512 = hashlib.sha3_512()
    sha3_512.update(data.encode('utf-8'))
    return sha3_512.hexdigest()

# Example usage
data = "Hello, World!"
hash_value = generate_sha3_512_hash(data)
print(f"SHA3-512 hash of '{data}': {hash_value}")
```


GPT-4o-mini generates **concise, well-structured code** for both tasks:

**Binary tree traversal:** A clean `TreeNode` class with a recursive `inorder_traversal` function, plus an example tree that would print `4 2 5 1 3` (left $\rightarrow$ root $\rightarrow$ right). The code is production-ready -- correct class structure, proper `__main__` guard, and clear variable naming.

**SHA3-512 hash:** A focused 6-line function using Python's built-in `hashlib.sha3_512()`. No unnecessary imports, no verbose explanations -- exactly what a developer needs.

Compare this with the book's Llama-3.1-8B output, which generated correct but overly verbose code (including unrequested preorder traversal and excess commentary). The quality gap between 8B-parameter local models and API-served models like GPT-4o-mini is still significant for code generation tasks, though it narrows rapidly with each model generation.

## Recipe 7 — Generating SQL Queries from Natural Language

Text-to-SQL allows non-technical users to query databases using plain English. The LLM infers the database schema and generates the appropriate SQL. We create an in-memory SQLite database to demonstrate the full pipeline: question $\rightarrow$ SQL $\rightarrow$ execution $\rightarrow$ natural-language answer.

In [34]:

# 7.1  Create a sample SQLite database

import sqlite3

conn = sqlite3.connect("/tmp/company.db")
cursor = conn.cursor()

# Create and populate tables
cursor.executescript("""
DROP TABLE IF EXISTS Employees;
DROP TABLE IF EXISTS Departments;

CREATE TABLE Departments (
    DeptID INTEGER PRIMARY KEY,
    DeptName TEXT NOT NULL
);

CREATE TABLE Employees (
    EmpID INTEGER PRIMARY KEY,
    FirstName TEXT NOT NULL,
    LastName TEXT NOT NULL,
    DeptID INTEGER,
    HireDate TEXT,
    Salary REAL,
    FOREIGN KEY (DeptID) REFERENCES Departments(DeptID)
);

INSERT INTO Departments VALUES (1, 'Engineering');
INSERT INTO Departments VALUES (2, 'Marketing');
INSERT INTO Departments VALUES (3, 'Sales');
INSERT INTO Departments VALUES (4, 'HR');

INSERT INTO Employees VALUES (1, 'Nancy', 'Davolio', 1, '2012-05-01', 95000);
INSERT INTO Employees VALUES (2, 'Andrew', 'Fuller', 1, '2012-08-14', 105000);
INSERT INTO Employees VALUES (3, 'Janet', 'Leverling', 2, '2012-04-01', 88000);
INSERT INTO Employees VALUES (4, 'Margaret', 'Peacock', 3, '2015-09-15', 78000);
INSERT INTO Employees VALUES (5, 'Steven', 'Buchanan', 1, '2016-03-21', 92000);
INSERT INTO Employees VALUES (6, 'Michael', 'Suyama', 2, '2018-01-10', 82000);
INSERT INTO Employees VALUES (7, 'Robert', 'King', 3, '2019-07-22', 75000);
INSERT INTO Employees VALUES (8, 'Laura', 'Callahan', 4, '2020-11-05', 70000);
INSERT INTO Employees VALUES (9, 'Anne', 'Dodsworth', 1, '2021-02-18', 90000);
""")
conn.commit()
conn.close()

from langchain_community.utilities import SQLDatabase
db = SQLDatabase.from_uri("sqlite:////tmp/company.db")
print(f"Database tables: {db.get_usable_table_names()}")
print(f"Schema preview:\n{db.get_table_info()[:500]}")


Database tables: ['Departments', 'Employees']
Schema preview:

CREATE TABLE "Departments" (
	"DeptID" INTEGER, 
	"DeptName" TEXT NOT NULL, 
	PRIMARY KEY ("DeptID")
)

/*
3 rows from Departments table:
DeptID	DeptName
1	Engineering
2	Marketing
3	Sales
*/


CREATE TABLE "Employees" (
	"EmpID" INTEGER, 
	"FirstName" TEXT NOT NULL, 
	"LastName" TEXT NOT NULL, 
	"DeptID" INTEGER, 
	"HireDate" TEXT, 
	"Salary" REAL, 
	PRIMARY KEY ("EmpID"), 
	FOREIGN KEY("DeptID") REFERENCES "Departments" ("DeptID")
)

/*
3 rows from Employees table:
EmpID	FirstName	LastName	Dep


We create a self-contained SQLite database with two tables and a foreign key relationship: **Departments** ($4$ rows: Engineering, Marketing, Sales, HR) and **Employees** ($9$ rows with names, hire dates from $2012$-$2021$, and salaries from $\$70{,}000$-$\$105{,}000$). LangChain's `SQLDatabase` wrapper automatically introspects the schema and sample rows, which it will pass to the LLM as context for SQL generation.

In [35]:

# 7.2  SQL generation chain

def get_schema(_):
    return db.get_table_info()

def run_query(query):
    return db.run(query)

sql_template = """You are a SQL expert. Based on the table schema below,
write just the SQL query (no explanation, no markdown) that would answer
the user's question:

{schema}

Question: {question}
SQL Query:"""

sql_prompt = ChatPromptTemplate.from_template(sql_template)

sql_chain = (
    RunnablePassthrough.assign(schema=get_schema)
    | sql_prompt
    | llm_openai.bind(stop=["\nSQLResult:"])
    | StrOutputParser()
)

# Test: Simple count
q1 = "How many employees are there?"
sql1 = sql_chain.invoke({"question": q1})
print(f"Q: {q1}")
print(f"SQL: {sql1}")
print(f"Result: {run_query(sql1)}")
print()

# Test: Schema inference
q2 = "How many employees have been hired before 2015?"
sql2 = sql_chain.invoke({"question": q2})
print(f"Q: {q2}")
print(f"SQL: {sql2}")
print(f"Result: {run_query(sql2)}")


Q: How many employees are there?
SQL: SELECT COUNT(*) FROM Employees;
Result: [(9,)]

Q: How many employees have been hired before 2015?
SQL: SELECT COUNT(*) FROM Employees WHERE HireDate < '2015-01-01';
Result: [(3,)]


Both generated SQL queries are **syntactically correct and semantically accurate**:

**Simple count:** `SELECT COUNT(*) FROM Employees;` $\rightarrow$ **9** employees. The LLM maps the natural-language concept "how many" to the `COUNT(*)` aggregate function.

**Date-based filter:** `SELECT COUNT(*) FROM Employees WHERE HireDate < '2015-01-01';` $\rightarrow$ **3** employees (Nancy, Andrew, Janet -- all hired in 2012). The LLM infers that "hired before 2015" maps to `HireDate < '2015-01-01'` using the ISO 8601 date format. This **schema inference** is remarkable -- the model correctly interprets the string-typed `HireDate` column as a date and applies the appropriate comparison operator.

The `.bind(stop=["\nSQLResult:"])` ensures the LLM stops after generating the SQL query rather than hallucinating fake results.

In [36]:

# 7.3  Full chain: question -> SQL -> execute -> answer

answer_template = """Based on the table schema below, question, SQL query,
and SQL response, write a natural language response:

{schema}

Question: {question}
SQL Query: {query}
SQL Response: {response}"""

answer_prompt = ChatPromptTemplate.from_template(answer_template)

full_sql_chain = (
    RunnablePassthrough.assign(query=sql_chain).assign(
        schema=get_schema,
        response=lambda x: run_query(x["query"]),
    )
    | answer_prompt
    | llm_openai
)

# End-to-end test
questions_sql = [
    "How many employees are there?",
    "What is the average salary in the Engineering department?",
    "Give me the names of employees hired before 2015.",
    "Which department has the most employees?",
]

for q in questions_sql:
    result = full_sql_chain.invoke({"question": q})
    print(f"Q: {q}")
    print(f"A: {result.content}")
    print()


Q: How many employees are there?
A: The total number of employees is 9.

Q: What is the average salary in the Engineering department?
A: The average salary in the Engineering department is $95,500.

Q: Give me the names of employees hired before 2015.
A: The employees who were hired before the year 2015 are Nancy Davolio, Andrew Fuller, and Janet Leverling.

Q: Which department has the most employees?
A: The department with the most employees is Engineering.


The end-to-end chain produces **human-readable answers** from plain-English questions:

| Question | Answer |
|----------|--------|
| How many employees? | **9** |
| Average salary in Engineering? | **$95,500** (correct: $(95{,}000 + 105{,}000 + 92{,}000 + 90{,}000)/4$) |
| Employees hired before 2015? | **Nancy Davolio, Andrew Fuller, Janet Leverling** |
| Department with most employees? | **Engineering** (4 employees: Nancy, Andrew, Steven, Anne) |

All four answers are verified correct against the database. The LLM handles schema joins (`Employees.DeptID = Departments.DeptID`), aggregations (`COUNT`, `AVG`), filtering (`WHERE`), and grouping (`GROUP BY`) without explicit instruction.

**Production considerations:** Always use **read-only** database connections. Validate generated SQL against an allowlist of operations (`SELECT` only). For large schemas ($100$+ tables), provide only relevant table descriptions to avoid context overflow. Log every generated query for audit and debugging.

---

## Recipe 8 — Agents: Making an LLM Reason and Act

An **agent** goes beyond static chains: it can dynamically choose which tools to use, observe results, and decide on next steps. This follows the **ReAct** (Reason + Act) pattern:

$$\text{Thought} \rightarrow \text{Action} \rightarrow \text{Observation} \rightarrow \text{Thought} \rightarrow \cdots \rightarrow \text{Final Answer}$$

The agent loops through reasoning steps until it has enough information to answer. Each "action" invokes a **tool** (web search, calculator, database query, API call, etc.).

We build an agent with two tools: a **web search** tool (for current information) and a **calculator** tool (for precise arithmetic).

In [37]:

# 8.1  Define tools

from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import Tool
from langchain.chains import LLMMathChain
from langchain_core.prompts import PromptTemplate

# Tool 1: Math calculator
math_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
math_chain = LLMMathChain.from_llm(llm=math_llm, verbose=False)
math_tool = Tool(
    name="Calculator",
    func=math_chain.run,
    description="Use this tool for mathematical calculations. "
                "Input should be a mathematical expression.")

# Tool 2: Simple knowledge tool (using the LLM itself)
def knowledge_search(query):
    """Use the LLM for general knowledge questions."""
    response = llm_openai.invoke(query)
    return response.content

search_tool = Tool(
    name="Knowledge",
    func=knowledge_search,
    description="Use this tool to look up factual information, "
                "including sports statistics, geography, and history.")

tools = [search_tool, math_tool]
print(f"Tools registered: {[t.name for t in tools]}")


Tools registered: ['Knowledge', 'Calculator']


We define two tools:
- **Knowledge:** Queries the LLM for factual information (in production, replace with a real search API like SerpAPI or Tavily)
- **Calculator:** Uses `LLMMathChain` which translates math questions into Python expressions and evaluates them with `numexpr` -- guaranteeing correct arithmetic

Each tool has a `name` and `description` that the agent reads to decide which tool to use for each step.

In [38]:

# 8.2  Create ReAct agent

# ReAct prompt template
react_template = """Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}"""

react_prompt = PromptTemplate.from_template(react_template)

agent = create_react_agent(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    tools=tools,
    prompt=react_prompt)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    handle_parsing_errors=True,
    max_iterations=5)

print("ReAct agent ready")


ReAct agent ready


The **ReAct agent** uses a structured prompt template that teaches the LLM to alternate between Thought $\rightarrow$ Action $\rightarrow$ Observation steps. The `AgentExecutor` handles the loop: it parses the LLM's output to identify which tool to call, executes the tool, appends the observation back to the prompt, and re-invokes the LLM until it emits "Final Answer." The `max_iterations=5` guard prevents infinite loops if the agent gets stuck.

In [39]:

# 8.3  Agent in action: multi-step reasoning

result = agent_executor.invoke({
    "input": "How many FIFA world cup wins does Brazil have? "
             "How many does France have? "
             "What is the difference?"
})

print(f"\nFinal answer: {result['output']}")




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo answer the question, I need to look up the number of FIFA World Cup wins for both Brazil and France, and then calculate the difference between the two. 

Action: Knowledge
Action Input: "How many FIFA World Cup wins does Brazil have?"[0m[36;1m[1;3mAs of October 2023, Brazil has won the FIFA World Cup a total of five times. Their victories came in the years 1958, 1962, 1970, 1994, and 2002.[0m[32;1m[1;3mI have found that Brazil has won the FIFA World Cup five times. Now, I need to find out how many times France has won the World Cup.

Action: Knowledge  
Action Input: "How many FIFA World Cup wins does France have?"  [0m[36;1m[1;3mAs of October 2023, France has won the FIFA World Cup twice. They first won in 1998 and then again in 2018.[0m[32;1m[1;3mI have found that France has won the FIFA World Cup twice. Now, I need to calculate the difference in the number of wins between Brazil and France.

Action: Calculator  
Action Inpu

The agent demonstrates the ReAct loop:
1. **Thought:** "I need to find Brazil's World Cup wins"
2. **Action:** Knowledge tool $\rightarrow$ "Brazil has won 5 times"
3. **Thought:** "Now I need France's wins"
4. **Action:** Knowledge tool $\rightarrow$ "France has won 2 times"
5. **Thought:** "I need to calculate the difference"
6. **Action:** Calculator $\rightarrow$ $5 - 2 = 3$
7. **Final Answer:** "Brazil has 3 more World Cup wins than France"

The agent **chose the right tool for each step** -- Knowledge for factual lookup, Calculator for arithmetic -- without being told which to use. This emergent tool selection is what makes agents powerful.

In [40]:

# 8.4  Another agent example: compound question

result2 = agent_executor.invoke({
    "input": "What is the population of Tokyo? "
             "If each person produced 2 kg of waste per day, "
             "how many tonnes of waste would Tokyo produce in a week?"
})

print(f"\nFinal answer: {result2['output']}")




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo answer the question, I first need to find the current population of Tokyo. After that, I can calculate the total waste produced in a week based on the given waste production rate per person.

Action: Knowledge  
Action Input: "What is the population of Tokyo in 2023?"  [0m[36;1m[1;3mAs of 2023, the population of Tokyo is estimated to be around 14 million in the city proper, while the Greater Tokyo Area, which includes surrounding prefectures, is often cited as having a population of approximately 37 million. For the most current and precise figures, it’s best to consult official statistics or demographic resources.[0m[32;1m[1;3mI have the population of Tokyo as approximately 14 million for the city proper. Now, I need to calculate the total waste produced in a week if each person produces 2 kg of waste per day.

Action: Calculator  
Action Input: 14000000 * 2 * 7 / 1000  # (14 million people * 2 kg/day * 7 days) / 1000 to conv

The agent handles a compound question that requires both factual knowledge (Tokyo's population) and multi-step arithmetic (population $\times$ 2 kg $\times$ 7 days $\div$ 1000 kg/tonne). The Calculator tool ensures the arithmetic is exact rather than relying on the LLM's unreliable mental math.

**Agent limitations:**
- **Cost:** Each reasoning step is a separate LLM call ($\sim 3$-$8$ calls per question)
- **Latency:** Multi-step reasoning takes $5$-$15$ seconds
- **Reliability:** Agents can get stuck in loops or choose wrong tools. Always set `max_iterations`
- **Security:** Tool calls (especially web search and code execution) need sandboxing

---

## Summary and Key Takeaways

This chapter traced the full arc of LLM application development:

**Local LLMs** (Recipes 1--2) give you privacy and cost control. 4-bit quantization makes 7B-parameter models runnable on consumer GPUs, and instruction tuning transforms raw text generators into useful assistants.

**LangChain** (Recipe 3) provides the composability framework. Chains pipe prompts, models, and parsers together declaratively, making complex workflows readable and maintainable.

**RAG** (Recipe 4) is the single most important pattern: it grounds LLM responses in external, up-to-date knowledge, eliminating hallucination on factual questions.

**Chatbots** (Recipe 5) add conversational memory via question contextualization, enabling natural multi-turn interactions.

**Code and SQL generation** (Recipes 6--7) demonstrate LLMs as productivity tools that translate human intent into executable programs, with the critical caveat that generated code must always be verified.

**Agents** (Recipe 8) represent the frontier: LLMs that can reason about *which* tools to use, observe results, and iterate toward an answer. The ReAct pattern enables compound reasoning that no single model call could achieve.

The common thread: **LLMs are most powerful when combined with external tools and data.** The model provides reasoning and language generation; the tools provide grounding, precision, and access to current information.