# 📄 Document Loaders in RAG

This notebook explores **document loaders** — a crucial step in **Retrieval-Augmented Generation (RAG)** pipelines.

Before embeddings and vector search, we need a way to **ingest raw data (PDFs, text, HTML, CSV, etc.) into structured document objects** that can be chunked and embedded.

---

## 🎯 Objectives
- Understand the role of **document loaders** in RAG.  
- Explore different **file types** that can be loaded.  
- Learn how to use **LangChain’s community loaders** for handling diverse data sources.  
- Prepare documents for **chunking + embedding** in later steps.  

---

## 📚 Topics Covered

### 1. Introduction to Document Loaders
- Why we need them in RAG pipelines  
- Structure of a `Document` object (text + metadata)  

### 2. PDF Loading Example
- Using `PyPDFLoader` to load research papers and textbooks  
- Inspecting how pages are stored as `Document` objects  

### 3. Other Loader Options
- **TextLoader** → plain `.txt` files  
- **CSVLoader** → tabular data  
- **UnstructuredFileLoader** → general-purpose for HTML, Word, PowerPoint  
- **WebBaseLoader** → load directly from websites  

### 4. Preparing for Chunking
- Why we can’t embed large documents directly  
- Loader → Chunker → Embedder → Retriever flow  

---


### PDF Loader

In [20]:
# Loading documents - We shall load a pdf into a sequence of document objects
from langchain_community.document_loaders import PyPDFLoader

file_path = "resume.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

2


In [24]:
print(docs[0].page_content[:500])  # First 500 characters of page 1
print(docs[1].page_content[:500])

CLEAVESTONE ADUNGO 
cleavestone94@gmail.com   |   +254703457427   |   Nairobi Kenya       
GitHub: Link  |   Linkedln: Link | Portfolio: Link 
Summary      
Data Scientist transitioning from finance with 2+ years of hands-on ML, NLP, and data analytics experience. 
Skilled in Python, SQL, statistical modeling, LLM fine-tuning, RAG systems, and MLOps pipelines. Diverse 
industry background (education, logistics, finance, data annotation) with a proven track record of turning raw 
data into action
Education and Training      
Moi University – Eldoret, Kenya 
Bachelor of Science in Mathematics | 2012 – 2017 
Second Class Honors, Upper Division 
Relevant coursework: Probability & Statistics, Linear Algebra, Real Analysis, Numerical Methods, 
Mathematical Modeling, Calculus 
Projects      
#HRJ#784bc4ba-1ee5-4196-8c02-14eda4a9e2f9# 
Customer Churn Prediction – End-to-End MLOps Pipeline 
Built a production-ready churn prediction system for a banking dataset using LightGBM, achieving an ROC A

### CSV Loader

In [27]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path='iris.csv',
    csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': ['ld', 'SepalLengthCm', 'SepalWidthCm','PetalLengthCm','PetalWidthCm','Species']
})

In [30]:
docs = loader.load()
print(docs[1].page_content[:100])
print(docs[1].metadata)

ld: 1
SepalLengthCm: 5.1
SepalWidthCm: 3.5
PetalLengthCm: 1.4
PetalWidthCm: 0.2
Species: Iris-setosa
{'source': 'iris.csv', 'row': 1}


## TextLoader

In [31]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Step 1: Load the text file
loader = TextLoader("RAG_COURSE.txt")  # replace with your file
documents = loader.load()

print("Loaded documents:", len(documents))
print("Sample:", documents[0].page_content[:300])

# Step 2: Split into chunks (important for RAG)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # characters per chunk
    chunk_overlap=100
)
docs = text_splitter.split_documents(documents)

print("Number of chunks:", len(docs))
print("First chunk:\n", docs[0].page_content)


Loaded documents: 1
Sample: # Complete RAG Course Outline: Beginner to Advanced

## Module 1: Foundations and Prerequisites (Week 1-2)

### 1.1 Introduction to RAG
- What is Retrieval-Augmented Generation?
- RAG vs Traditional Language Models
- Key components of RAG systems
- Use cases and applications
- Industry examples and 
Number of chunks: 34
First chunk:
 # Complete RAG Course Outline: Beginner to Advanced

## Module 1: Foundations and Prerequisites (Week 1-2)

### 1.1 Introduction to RAG
- What is Retrieval-Augmented Generation?
- RAG vs Traditional Language Models
- Key components of RAG systems
- Use cases and applications
- Industry examples and success stories


### WebBaseLoader

In [36]:
from langchain_community.document_loaders import WebBaseLoader

# Load a webpage directly
url = "https://en.wikipedia.org/wiki/Kenya"
loader = WebBaseLoader(url)

docs = loader.load()

# Inspect the first document
print(docs[0].page_content[:500])   # Preview first 500 characters
print(docs[0].metadata)






Kenya - Wikipedia





























Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in





		Pages f
{'source': 'https://en.wikipedia.org/wiki/Kenya', 'title': 'Kenya - Wikipedia', 'language': 'en'}


In [None]:
docs[0]

[Document(metadata={'source': 'https://en.wikipedia.org/wiki/Kenya', 'title': 'Kenya - Wikipedia', 'language': 'en'}, page_content='\n\n\n\nKenya - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact us\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload fileSpecial pages\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAppearance\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDonate\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\nDonate Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1\nEtymology\n\n\n\n\n