# Document Loading for RAG Systems

## Overview

This notebook demonstrates **document loading strategies** for building RAG (Retrieval Augmented Generation) applications. Loading is the first critical step in the RAG pipeline:

```
Load → Split → Embed → Store → Retrieve → Generate
```

## Why Document Loading Matters

1. **Data Diversity**: RAG systems need to ingest multiple formats (PDFs, videos, wikis, APIs)
2. **Metadata Preservation**: Source tracking enables citation and audit trails
3. **Batch Processing**: Efficient loading reduces pipeline latency
4. **Content Extraction**: Proper parsing ensures high-quality text for embedding

## What We'll Cover

1. **PDF Loading** - Extracting text and metadata from PDF documents
2. **YouTube Audio** - Transcribing video content with OpenAI Whisper
3. **Generic Loaders** - Flexible patterns for custom data sources
4. **Metadata Management** - Tracking provenance for retrieved content

## Environment Setup

Initialize OpenAI client and load environment variables. These credentials are needed for:
- OpenAI Whisper API (audio transcription)
- Token counting with tiktoken

In [18]:
import os
import openai
import tiktoken  # OpenAI's tokenizer for token counting
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file (must contain OPENAI_API_KEY)
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

---

## PDF Document Loading

### PyPDFLoader

LangChain's `PyPDFLoader` provides:
- **Page-by-page parsing** - Each page becomes a separate Document object
- **Metadata extraction** - PDF properties (creator, creation date, page numbers)
- **Text extraction** - Uses pypdf library under the hood

### Use Cases
- Legal documents (contracts, legislation)
- Research papers and technical documentation
- Reports and policy documents

### Example: Indian Data Protection Act (DPDPA)

In [19]:
# Load PDF - Digital Personal Data Protection Act (India, 2023)
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./99-DPDPA.pdf")
pages = loader.load()  # Returns list of Document objects (one per page)

### Verify Document Count

Check how many pages were loaded from the PDF.

In [20]:
# Count total pages in PDF
len(pages)

21

**Result**: 21 pages extracted from the PDF.

---

## Inspecting Document Structure

Each `Document` object contains:
- `page_content`: Extracted text from the page
- `metadata`: Dictionary with source info, page numbers, PDF properties

In [21]:
# Select first page for inspection
page = pages[0]

### Preview Page Content

Display first 500 characters to verify text extraction quality.

In [22]:
# Print first 500 characters of page content
print(page.page_content[0:500])

THE DIGITAL PERSONAL DATA PROTECTION ACT, 2023
(NO. 22 OF 2023)
[11th August, 2023.]
An Act to provide for the processing of digital personal data in a manner that
recognises both the right of individuals to protect their personal data and the
need to process such personal data for lawful purposes and for matters
connected therewith or incidental thereto.
BE it enacted by Parliament in the Seventy-fourth Year of the Republic of India as
follows:––
CHAPTER I
PRELIMINARY
1. (1) This Act may be cal


**Observation**: Clean text extraction from Act title, date, and preamble.

---

## Metadata Inspection

Metadata enables:
1. **Citation** - Reference source page in generated responses
2. **Filtering** - Retrieve only from specific documents/pages
3. **Audit Trail** - Track which sources influenced LLM output

In [23]:
# Inspect metadata dictionary
page.metadata

{'producer': 'iTextSharp™ 5.5.13.1 ©2000-2019 iText Group NV (AGPL-version)',
 'creator': 'PyPDF',
 'creationdate': '2023-08-12T02:13:03+05:30',
 'moddate': '2023-08-12T02:14:35+05:30',
 'source': './99-DPDPA.pdf',
 'total_pages': 21,
 'page': 0,
 'page_label': '1'}

**Metadata Fields**:
- `source`: File path
- `page`: Zero-indexed page number (0 = first page)
- `page_label`: Human-readable page number from PDF
- `total_pages`: Total document length
- `producer`, `creator`: PDF generation tools
- `creationdate`, `moddate`: Document timestamps

This rich metadata is critical for production RAG systems requiring provenance tracking.

---

## YouTube Audio Loading & Transcription

### Use Case: Video Content as Knowledge Base

Many organizations have knowledge in video format:
- Training videos and tutorials
- Conference talks and lectures
- Product demos and webinars

**Challenge**: Video content is unsearchable unless transcribed.

**Solution**: LangChain's `YoutubeAudioLoader` + `OpenAIWhisperParser` pipeline.

### Architecture

```
YouTube URL → Download Audio (yt-dlp) → Transcribe (Whisper API) → Document
```

In [24]:
# Import YouTube audio loading components
from langchain_community.document_loaders.generic import GenericLoader, FileSystemBlobLoader
from langchain_community.document_loaders.parsers import OpenAIWhisperParser
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

### Component Breakdown

1. **YoutubeAudioLoader**: Downloads audio using `yt-dlp`
2. **OpenAIWhisperParser**: Transcribes audio via Whisper API
3. **GenericLoader**: Orchestrates blob loading → parsing pipeline
4. **FileSystemBlobLoader**: Alternative for pre-downloaded audio files

### Example: Stanford CS229 Lecture

Loading Andrew Ng's Machine Learning lecture (Autumn 2018).

In [25]:
# # Stanford CS229 Lecture 1 - Andrew Ng
# url = "https://www.youtube.com/watch?v=Rj5M_c5mgk8"
# save_dir = "docs/youtube/"

# # GenericLoader orchestrates: Blob Loader → Parser
# loader = GenericLoader(
#     YoutubeAudioLoader([url], save_dir),  # Download audio via yt-dlp
#     # FileSystemBlobLoader(save_dir, glob="*.m4a"),  # Alternative: load pre-downloaded files
#     OpenAIWhisperParser()  # Transcribe audio via OpenAI Whisper API
# )

# # This will: 1) Download audio, 2) Send to Whisper API, 3) Return transcribed text
# docs = loader.load()

### Process Breakdown

**Step 1**: `YoutubeAudioLoader` downloads audio
- Uses `yt-dlp` (must be installed: see [README](../README.md))
- Requires `ffmpeg` for audio extraction
- Saves to `docs/youtube/` directory

**Step 2**: `OpenAIWhisperParser` transcribes
- Sends audio to OpenAI Whisper API
- Handles long audio via chunking (25MB limit per request)
- Returns text with timestamps (optional)

**Step 3**: Returns Document objects
- `page_content`: Full transcription text
- `metadata`: Source URL, video title, duration

**Note**: This process can take several minutes for long videos + uses OpenAI API credits.

---

## Inspect Transcription Output

In [26]:
# Preview first 500 characters of transcription
# docs[0].page_content[0:500]

**Result**: High-quality transcription of lecture audio, including speaker introductions and lecture content.

### Production Considerations

1. **Caching**: Audio files are saved locally - rerunning uses cached files (fast)
2. **Cost**: Whisper API charges ~$0.006/minute of audio
3. **Accuracy**: Whisper handles technical terminology well (CS/ML lectures)
4. **Batch Processing**: Use `FileSystemBlobLoader` for bulk transcription of downloaded files

---

## Web Content Loading

### WebBaseLoader

Load content directly from URLs - useful for:
- Documentation pages
- Blog posts and articles
- GitHub README files
- Wiki pages

**Example**: Loading Basecamp's engineering handbook from GitHub.

In [27]:
# Load web content via HTTP request
from langchain_community.document_loaders import WebBaseLoader

# Example: Basecamp's programmer title ladder (GitHub markdown)
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")

USER_AGENT environment variable not set, consider setting it to identify your requests.


### How WebBaseLoader Works

1. **HTTP Request**: Fetches HTML content via requests library
2. **HTML Parsing**: Uses BeautifulSoup to extract text
3. **Cleaning**: Removes scripts, styles, navigation elements
4. **Document Creation**: Returns clean text with URL metadata

**Limitations**:
- JavaScript-rendered content requires Selenium (see `SeleniumURLLoader`)
- Rate limiting may block frequent requests
- Some sites require authentication

---

## Load and Inspect Web Content

In [28]:
# Fetch and parse the web page
docs = loader.load()

# Preview extracted text content
docs[0].page_content[0:500]

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nhandbook/titles-for-programmers.md at master · basecamp/handbook · GitHub\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNavigation Menu\n\nToggle navigation\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n            Sign in\n          \n\n\n \n\n\nAppearance settings\n\n\n\n\n\n\n\n\n\n\n\nPlatformAI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare '

**Result**: Clean markdown content extracted from GitHub page, including title structure and job descriptions.

---

## Summary: Document Loading Best Practices

### Loader Selection Guide

| Content Type | Loader | Key Benefits |
|-------------|--------|--------------|
| **PDFs** | `PyPDFLoader` | Page-level metadata, reliable extraction |
| **Videos/Audio** | `YoutubeAudioLoader` + `OpenAIWhisperParser` | High-quality transcription, handles technical content |
| **Web Pages** | `WebBaseLoader` | Simple HTTP-based loading, good for static sites |
| **Notion** | `NotionDirectoryLoader` | Preserves block structure, exports workspace |
| **GitHub** | `GitHubIssuesLoader`, `GitHubRepositoryLoader` | Issue tracking, code documentation |
| **APIs** | `APILoader`, custom loaders | Real-time data integration |

### Production Checklist

1. **Error Handling**: Wrap loaders in try/catch for missing files, network failures
2. **Metadata Enrichment**: Add custom fields (department, category, timestamp)
3. **Caching**: Store loaded documents to avoid redundant API calls
4. **Batch Processing**: Process large document sets asynchronously
5. **Content Validation**: Check for empty documents, extraction failures
6. **Cost Tracking**: Monitor API usage (Whisper, web scraping limits)

### Next Steps

Once documents are loaded:
1. **Split** into chunks (see `2-DocumentSplitting.ipynb`)
2. **Embed** using text embeddings (OpenAI, Cohere, etc.)
3. **Store** in vector database (Pinecone, Weaviate, Chroma)
4. **Retrieve** relevant chunks for LLM context
5. **Generate** responses with citations