# 01 - Setup and Basics

This notebook covers the fundamental setup for building RAG systems:

1. **Environment Setup** - API keys and configuration
2. **Document Loading** - Loading documents from web sources  
3. **Text Splitting** - Breaking documents into chunks
4. **Strategy Comparison** - Comparing different chunking approaches

**Prerequisites:** None (start here!)

**Duration:** ~10 minutes

**Outputs:**
- Loaded documents from LangChain documentation
- Document chunks with optimized chunking strategy

## 1. Setup and Configuration

First, we'll set up our environment and verify that everything is configured correctly.

In [1]:
# Add parent directory to path to import shared module
import sys
sys.path.append('../..')

# Import shared utilities
from shared.config import verify_api_key, get_project_info, SECTION_WIDTH
from shared.utils import print_section_header

print_section_header("Environment Setup")

# Verify API key
api_key_ok = verify_api_key()

if not api_key_ok:
    raise ValueError("OpenAI API key not configured. See README.md for setup instructions.")

# Show configuration
print("\nProject Configuration:")
print("-" * SECTION_WIDTH)
info = get_project_info()
for key, value in info.items():
    print(f"{key:.<30} {value}")

print("\nâœ… Environment setup complete!")


ENVIRONMENT SETUP

âœ“ OpenAI API Key: LOADED
  Preview: sk-proj...vIQA

Project Configuration:
--------------------------------------------------------------------------------
environment................... dev
debug_mode.................... False
log_level..................... INFO
project_root.................. /Users/gianlucamazza/Workspace/notebooks/llm_rag/notebooks/fundamentals/../..
vector_store_dir.............. /Users/gianlucamazza/Workspace/notebooks/llm_rag/notebooks/fundamentals/../../data/vector_stores
cache_dir..................... /Users/gianlucamazza/Workspace/notebooks/llm_rag/notebooks/fundamentals/../../data/cache
openai_api_key_loaded......... True
huggingface_api_key_loaded.... False
langsmith_api_key_loaded...... True
default_model................. gpt-4o-mini
default_temperature........... 0.0
openai_embedding_model........ text-embedding-3-small
hf_embedding_model............ sentence-transformers/all-MiniLM-L6-v2
chunk_size.................... 1000
chunk_over

## 2. Document Loading

We'll load documentation from LangChain's website using the `WebBaseLoader`. 

The shared module provides a convenient function that:
- Loads documents from multiple URLs
- Adds custom metadata (source_type, process_date, domain)
- Returns Document objects ready for processing

In [2]:
from shared.loaders import load_langchain_docs
from shared.utils import print_results

print_section_header("Document Loading")

# Load documents using shared utility
# This loads from DEFAULT_LANGCHAIN_URLS defined in shared/config.py
docs = load_langchain_docs(verbose=True)

# Show sample document
if docs:
    print("\n" + "-" * SECTION_WIDTH)
    print("Sample Document:")
    print("-" * SECTION_WIDTH)
    sample_doc = docs[0]
    print(f"Source: {sample_doc.metadata.get('source', 'N/A')}")
    print(f"Title: {sample_doc.metadata.get('title', 'N/A')}")
    print(f"Source Type: {sample_doc.metadata.get('source_type', 'N/A')}")
    print(f"Process Date: {sample_doc.metadata.get('process_date', 'N/A')}")
    print(f"\nContent Preview (first 300 chars):\n{sample_doc.page_content[:300]}...")

print(f"\nâœ… Successfully loaded {len(docs)} documents")


DOCUMENT LOADING

Loading 4 documents from web...
  - https://python.langchain.com/docs/use_cases/question_answering/
  - https://python.langchain.com/docs/modules/data_connection/retrievers/
  - https://python.langchain.com/docs/modules/model_io/llms/
  - https://python.langchain.com/docs/use_cases/chatbots/
âœ“ Loaded 4 documents
âœ“ Added custom metadata to all documents

--------------------------------------------------------------------------------
Sample Document:
--------------------------------------------------------------------------------
Source: https://python.langchain.com/docs/use_cases/question_answering/
Title: Build a RAG agent with LangChain - Docs by LangChain
Source Type: web_documentation
Process Date: 2025-11-12

Content Preview (first 300 chars):
Build a RAG agent with LangChain - Docs by LangChainSkip to main contentWe've raised a $125M Series B to build the platform for agent engineering. Read more.Docs by LangChain home pageLangChain + LangGraphSearch...âŒ˜K

## 3. Text Splitting

Large documents must be split into smaller chunks for effective retrieval. Key parameters:

- **chunk_size**: Maximum characters per chunk
- **chunk_overlap**: Overlapping characters between chunks

### Why Overlap Matters

Overlap ensures context isn't lost at chunk boundaries. If a sentence is split, overlap helps preserve its meaning.

### Default Strategy

We use:
- `chunk_size=1000` (good balance of context and precision)
- `chunk_overlap=200` (preserves context at boundaries)

In [3]:
from shared.loaders import split_documents
from shared.config import DEFAULT_CHUNK_SIZE, DEFAULT_CHUNK_OVERLAP

print_section_header("Text Splitting")

# Split documents using default settings
chunks = split_documents(
    docs,
    chunk_size=DEFAULT_CHUNK_SIZE,
    chunk_overlap=DEFAULT_CHUNK_OVERLAP,
    verbose=True
)

print(f"\nâœ… Created {len(chunks)} chunks")
print(f"   Using chunk_size={DEFAULT_CHUNK_SIZE}, overlap={DEFAULT_CHUNK_OVERLAP}")


TEXT SPLITTING

Splitting documents...
  - Chunk size: 1000
  - Chunk overlap: 200
âœ“ Created 120 chunks

  Sample chunk:
    - Length: 839 chars
    - Source: https://python.langchain.com/docs/use_cases/question_answering/
    - Preview: Build a RAG agent with LangChain - Docs by LangChainSkip to main contentWe've raised a $125M Series B to build the platform for agent engineering. Rea...

âœ… Created 120 chunks
   Using chunk_size=1000, overlap=200


## 4. Strategy Comparison

Let's compare different splitting strategies to understand the trade-offs:

- **Large chunks (1000/200)**: More context, fewer chunks
- **Small chunks (500/100)**: More precision, more chunks

In [4]:
from shared.loaders import compare_splitting_strategies

print_section_header("Splitting Strategy Comparison")

# Compare two strategies
strategies = [
    (1000, 200),  # Default strategy - good balance
    (500, 100),   # Smaller chunks - more precise
]

results = compare_splitting_strategies(docs, strategies, verbose=True)

print("\nðŸ’¡ Recommendation:")
print("   - Use 1000/200 for context-heavy questions")
print("   - Use 500/100 for precise information retrieval")
print("   - We'll use 1000/200 (default) for this tutorial")


SPLITTING STRATEGY COMPARISON


=== Splitting Strategy Comparison ===

Strategy        Chunk Size      Overlap         Chunks    
------------------------------------------------------------
1000/200        1000            200             120       
500/100         500             100             256       

ðŸ’¡ Larger chunks = more context, fewer chunks
ðŸ’¡ Smaller chunks = more precise, more chunks

ðŸ’¡ Recommendation:
   - Use 1000/200 for context-heavy questions
   - Use 500/100 for precise information retrieval
   - We'll use 1000/200 (default) for this tutorial


## 5. Inspect Chunks

Let's examine a few chunks to understand the structure:

In [5]:
print_section_header("Sample Chunks")

# Show first 3 chunks
for i, chunk in enumerate(chunks[:3], 1):
    print(f"\nChunk {i}:")
    print("-" * SECTION_WIDTH)
    print(f"Source: {chunk.metadata.get('source', 'N/A')}")
    print(f"Length: {len(chunk.page_content)} characters")
    print(f"Content: {chunk.page_content[:200]}...")

print("\nðŸ“Š Summary:")
print(f"   Total chunks: {len(chunks)}")
print(f"   Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
print(f"   Shortest chunk: {min(len(c.page_content) for c in chunks)} chars")
print(f"   Longest chunk: {max(len(c.page_content) for c in chunks)} chars")


SAMPLE CHUNKS


Chunk 1:
--------------------------------------------------------------------------------
Source: https://python.langchain.com/docs/use_cases/question_answering/
Length: 839 characters
Content: Build a RAG agent with LangChain - Docs by LangChainSkip to main contentWe've raised a $125M Series B to build the platform for agent engineering. Read more.Docs by LangChain home pageLangChain + Lang...

Chunk 2:
--------------------------------------------------------------------------------
Source: https://python.langchain.com/docs/use_cases/question_answering/
Length: 395 characters
Content: One of the most powerful applications enabled by LLMs is sophisticated question-answering (Q&A) chatbots. These are applications that can answer questions about specific source information. These appl...

Chunk 3:
--------------------------------------------------------------------------------
Source: https://python.langchain.com/docs/use_cases/question_answering/
Length: 558 characters


## Summary

In this notebook, we:

âœ… Set up the environment and verified API keys  
âœ… Loaded documents from LangChain documentation  
âœ… Split documents into chunks using optimized strategy  
âœ… Compared different splitting strategies  

### Key Takeaways

- **chunk_size** controls how much context each chunk contains
- **chunk_overlap** prevents losing information at boundaries
- Larger chunks = more context, smaller chunks = more precision
- Default strategy (1000/200) provides good balance

### Next Steps

Continue to **[02_embeddings_comparison.ipynb](02_embeddings_comparison.ipynb)** to:
- Create embeddings with OpenAI and HuggingFace
- Build FAISS vector stores
- Compare performance and quality

---

**ðŸ’¾ Important:** The `chunks` variable created here will be used in the next notebook. Keep this kernel running or re-run this notebook before proceeding.