# Lab: Chunking, Embeddings, and RAG with LlamaIndex + Ollama (Local Only)

In this lab, we will:

- Take a **long-ish text** and **chunk** it into smaller pieces.
- Use an **embedding model** to represent those chunks as vectors.
- Build a simple **RAG (Retrieval-Augmented Generation)** pipeline:
  - Retrieve the most relevant chunks using embeddings.
  - Let a local LLM (via **Ollama**) generate an answer based on those chunks.

Everything runs **locally** using:

- **Ollama** for the LLM and (optionally) the embedding model.
- **LlamaIndex** for chunking, indexing, and querying.

By the end, you should be able to:

- Explain what **chunking** is and why it helps with context windows.
- Explain what an **embedding** is and how it‚Äôs used for retrieval.
- Understand how RAG adds **external knowledge** to an LLM.

Before you run this notebook, ensure the following pre-requisites have met:
Ollama installed (on each machine)
1) Download from: https://ollama.com
2) Double-click and invoke Ollama and ask a question to force it to download the required packages
3) After install, in a terminal or command  window, run:

    ollama pull llama3
    ollama pull nomic-embed-text

Then install Conda environment and Python packages:

    conda create -n rag_ollama python=3.11 -y
    conda activate rag_ollama

    pip install \
      llama-index \
      llama-index-llms-ollama \
      llama-index-embeddings-ollama


In [None]:
%pip install llama-index-core llama-index-llms-ollama llama-index-embeddings-ollama


In [None]:
# Cell 2: Define a sample "document" and a user question

# This is our "long" document. In real scenarios, this might be
# several pages of a PDF or a long article.
doc_text = """
Retrieval-Augmented Generation (RAG) is a technique that combines large language models (LLMs)
with an external knowledge source. Instead of relying only on what was seen during training,
the model can look up relevant information at query time. This helps reduce hallucinations
and allows the model to stay up to date with new information.

A typical RAG pipeline works in two main steps. First, a retrieval component searches a
knowledge base to find the most relevant pieces of information, often called documents or chunks.
Second, the language model reads both the user question and the retrieved chunks, and then
generates an answer that is grounded in those sources.

To make retrieval efficient and accurate, we usually convert text into embeddings, which are
vector representations of meaning. Texts with similar meaning end up with similar embeddings.
We then store these embeddings in a vector database or index. At query time, we embed the user
question and look for the most similar vectors. Those corresponding chunks are fed into the LLM
as context.

Chunking is necessary because long documents cannot fit entirely into the model's context window.
By breaking documents into smaller chunks, we can efficiently search and select only the most
relevant parts. Chunk size and overlap are design choices: too small and we lose context; too large
and we may hit context limits or retrieve irrelevant material.
"""

# A user question we want the system to answer.
user_query = "Why do we need chunking in a RAG system, and how is it related to the context window?"

print("üìÑ Document preview (first 500 characters):\n")
print(doc_text[:500], "...\n")

print("‚ùì User query:")
print(user_query)


In [None]:
# Cell 3: Chunk the document into smaller pieces

from textwrap import wrap

# Simple manual chunking just for teaching:
# We'll break the document into chunks of about N characters.
# In real systems, we often chunk by tokens or sentences.
CHUNK_SIZE = 400
CHUNK_OVERLAP = 50

def simple_char_chunk(text, chunk_size=400, overlap=50):
    """
    Very simple character-based chunking.
    Not production-ready, but good for teaching.
    """
    chunks = []
    start = 0
    text_length = len(text)
    while start < text_length:
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start = end - overlap  # step with overlap
    return chunks

chunks = simple_char_chunk(doc_text, CHUNK_SIZE, CHUNK_OVERLAP)

print(f"üîπ Number of chunks created: {len(chunks)}\n")
for i, ch in enumerate(chunks, start=1):
    print(f"--- Chunk {i} ---")
    print(ch[:250], "...\n")  # show only first 250 chars for brevity


In [None]:
# Cell 4: Configure LlamaIndex to use Ollama locally (no OpenAI)

from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Make sure the Ollama server is running in the background (ollama serve).
# Also ensure you have pulled the necessary models:
#   ollama pull llama3
#   ollama pull nomic-embed-text

# Set the LLM to use Ollama's "llama3" model
Settings.llm = Ollama(model="llama3")

# Set the embedding model to use Ollama's "nomic-embed-text" model
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")

print("‚úÖ LlamaIndex is now configured to use Ollama for both LLM and embeddings (locally).")


In [None]:
# Cell 5: Build a vector index from the chunks and run a RAG-style query

from llama_index.core import Document, VectorStoreIndex

# Wrap each chunk as a LlamaIndex Document.
# Note: Document expects keyword arguments (text=...).
documents = [Document(text=chunk) for chunk in chunks]

# Build a vector index from the documents.
# Under the hood:
# - Each chunk is embedded via OllamaEmbedding.
# - The embeddings are stored in a vector index.
index = VectorStoreIndex.from_documents(documents)

# Create a query engine.
query_engine = index.as_query_engine()

# Ask our user query.
response = query_engine.query(user_query)

print("‚ùì Query:")
print(user_query)
print("\nüß† RAG-style answer (LlamaIndex + Ollama):\n")
print(response)


## Student Exercises

### Exercise 1 ‚Äì Change the Document

1. Go back to **Cell 2** and replace `doc_text` with your own text, for example:
   - A section from a textbook,
   - A class reading,
   - A web article (copied as plain text).
2. Keep the rest of the notebook the same.
3. Rerun **Cell 2 ‚Üí 3 ‚Üí 4 ‚Üí 5**.

**Questions:**
- How many chunks are created now?
- Does the answer from the RAG pipeline correctly reflect the content of your new document?


### Exercise 2 ‚Äì Experiment with Chunk Size

1. In **Cell 3**, change `CHUNK_SIZE` and `CHUNK_OVERLAP`, for example:
   - Small chunks: `CHUNK_SIZE = 200`, `CHUNK_OVERLAP = 20`
   - Larger chunks: `CHUNK_SIZE = 800`, `CHUNK_OVERLAP = 100`
2. Rerun **Cell 3 ‚Üí 5**.

**Questions:**
- Do smaller chunks make the answer more specific or more fragmented?
- Do larger chunks risk including irrelevant parts of the text?
- How might this relate to the **context window** of the LLM?


### Exercise 3 ‚Äì Ask Different Questions

1. In **Cell 2**, keep the same `doc_text` but change `user_query` to questions like:
   - "What is an embedding and why is it useful?"
   - "How does RAG reduce hallucinations in language models?"
   - "What is the role of chunking in this system?"
2. Rerun **Cell 2 ‚Üí 5**.

**Questions:**
- Does the model‚Äôs answer stay grounded in the text?
- Can you find a question where the model starts to guess beyond the given text?
