# Lesson 3.1: Document Loaders

---

To build practical Large Language Model (LLM) applications, especially Question Answering (Q&A) systems or knowledge-intensive chatbots, LLMs need access to information not present in their initial training data. This data can be scattered across various formats: text files, PDFs, web pages, databases, etc. **Document Loaders** in LangChain are tools that help you load this data into a format that LangChain can understand and process.

## 1. Introduction to Document Loaders

### 1.1. What are Document Loaders?

**Document Loaders** are classes in LangChain designed to load data from external sources and transform them into **`Document`** objects that LangChain can work with. This is the first and most crucial step in bringing your own data into LLM applications.

* **Relationship:** Document Loaders are the foundation for **Retrieval-Augmented Generation (RAG)** systems, where LLMs need to retrieve information from a specific data store to generate accurate and contextual answers.

### 1.2. The `Document` Object in LangChain

When data is loaded by a Document Loader, it is converted into one or more `Document` objects. Each `Document` object has two main attributes:

* **`page_content`:** A string containing the actual text content of the document (or a part of it).
* **`metadata`:** A dictionary containing additional information about the document, such as its source, page number, author, creation date, etc. Metadata is very useful for tracking the origin of information or filtering documents.



### 1.3. Why are Document Loaders Important?

* **Diverse Data Integration:** Allows you to work with data from almost any source without writing complex parsing code from scratch.
* **Format Standardization:** Converts all data types into a unified `Document` format, making subsequent processing steps (like text splitting, embedding generation) easier.
* **RAG Support:** An essential step for building RAG systems, enabling LLMs to access and utilize external knowledge.


---

## 2. Common Document Loaders

LangChain provides a rich library of Document Loaders for various data formats and sources. Here are some of the most common types:

To run the examples below, ensure your virtual environment is activated.

In [None]:
# Install the library if not already installed:
# pip install langchain
# Specific packages for each loader will be listed below

### 2.1. `TextLoader`: Loading `.txt` Files

* **Concept:** `TextLoader` is the most basic loader, used to load content from plain text files (`.txt`).
* **When to Use:** When you have simple text data stored in `.txt` files.

In [None]:
# Install if not already installed:
# pip install langchain

from langchain_community.document_loaders import TextLoader
import os

# Create a sample text file
file_path_txt = "sample.txt"
with open(file_path_txt, "w", encoding="utf-8") as f:
    f.write("This is a sample text document.\n")
    f.write("It contains information about Document Loaders in LangChain.\n")
    f.write("This is the third line of the text file.")

# Initialize TextLoader and load documents
loader = TextLoader(file_path_txt, encoding="utf-8")
documents_txt = loader.load()

print(f"--- Content from TextLoader ({file_path_txt}) ---")
for doc in documents_txt:
    print(f"Content: {doc.page_content[:50]}...") # Print first 50 characters
    print(f"Metadata: {doc.metadata}")
    print("-" * 20)

# Clean up sample file
os.remove(file_path_txt)

### 2.2. `PyPDFLoader`: Loading PDF Files

* **Concept:** `PyPDFLoader` uses the `pypdf` library to extract text from PDF files. It can extract text from individual pages.
* **When to Use:** When you need to process information from PDF documents.
* **Requirement:** Requires the `pypdf` library: `pip install pypdf`.

In [None]:
# Install if not already installed:
# pip install pypdf langchain

from langchain_community.document_loaders import PyPDFLoader
import os

# Create a sample PDF file (in reality, you need an existing PDF file)
# For illustration, you can download any small PDF file into this directory
# Example: sample.pdf
# Or create a dummy PDF file (without actual content) to prevent code errors
# (Note: To create a real PDF, more complex libraries like reportlab are needed)
# For simplicity, assume you have a "sample.pdf" in the same directory.
pdf_file_path = "sample.pdf"
# Create a dummy file so the code runs without a real PDF file
# In a real-world scenario, you'd place a valid PDF file here.
with open(pdf_file_path, "w") as f:
    f.write("This is just a dummy PDF file, not a real PDF.\n")
    f.write("Please replace with a valid PDF file for testing.")

try:
    # Initialize PyPDFLoader and load documents
    loader = PyPDFLoader(pdf_file_path)
    documents_pdf = loader.load()

    print(f"--- Content from PyPDFLoader ({pdf_file_path}) ---")
    if documents_pdf:
        for i, doc in enumerate(documents_pdf):
            print(f"Page {i+1} - Content: {doc.page_content[:50]}...")
            print(f"Page {i+1} - Metadata: {doc.metadata}")
            print("-" * 20)
    else:
        print("No documents loaded from PDF (file might be empty or corrupted).")

except Exception as e:
    print(f"An error occurred while loading PDF: {e}")
    print("Please ensure you have 'pypdf' installed and 'sample.pdf' is a valid PDF file.")

# Clean up sample file
os.remove(pdf_file_path)

### 2.3. `WebBaseLoader`: Loading Content from Web Pages

* **Concept:** `WebBaseLoader` allows you to load text content from a specific URL. It uses the `BeautifulSoup4` library to parse HTML and extract relevant text.
* **When to Use:** When you need to retrieve information from public web pages, blog posts, or online documentation.
* **Requirement:** Requires the `BeautifulSoup4` library: `pip install beautifulsoup4`.

In [None]:
# Install if not already installed:
# pip install beautifulsoup4 langchain

from langchain_community.document_loaders import WebBaseLoader

# Initialize WebBaseLoader with a URL
# Use a public and stable URL for illustration
url = "https://www.langchain.com/blog/langchain-expression-language"
loader = WebBaseLoader(url)

try:
    # Load documents from URL
    documents_web = loader.load()

    print(f"--- Content from WebBaseLoader ({url}) ---")
    if documents_web:
        for doc in documents_web:
            print(f"Content (partial): {doc.page_content[:200]}...")
            print(f"Metadata: {doc.metadata}")
            print("-" * 20)
    else:
        print("No documents loaded from URL.")

except Exception as e:
    print(f"An error occurred while loading from web: {e}")
    print("Please check internet connection and valid URL.")

### 2.4. `CSVLoader`: Loading Data from CSV Files

* **Concept:** `CSVLoader` loads data from CSV files. Each row in the CSV file can be converted into a separate `Document` object, or you can specify a particular column as the main content and other columns as metadata.
* **When to Use:** When you have structured tabular data in CSV files and want to bring it into LangChain.
* **Requirement:** No special dependencies beyond `langchain` for the basic `CSVLoader`. However, if you use `UnstructuredCSVLoader`, you would need `pip install unstructured`.

In [None]:
# Install if not already installed:
# pip install langchain

from langchain_community.document_loaders import CSVLoader
import os

# Create a sample CSV file
csv_file_path = "sample.csv"
with open(csv_file_path, "w", encoding="utf-8") as f:
    f.write("ID,Product Name,Description,Price\n")
    f.write("1,Gaming Laptop,Powerful laptop for gamers,1500\n")
    f.write("2,Mechanical Keyboard,Keyboard with RGB lighting and mechanical switches,120\n")
    f.write("3,Wireless Mouse,Lightweight and precise mouse,50\n")

# Initialize CSVLoader and load documents
# You can specify column_for_content if you want a specific column as page_content
loader = CSVLoader(csv_file_path, encoding="utf-8")
documents_csv = loader.load()

print(f"--- Content from CSVLoader ({csv_file_path}) ---")
for doc in documents_csv:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")
    print("-" * 20)

# Clean up sample file
os.remove(csv_file_path)

### 2.5. Other Loaders (Optional): Google Drive, Notion, YouTube, etc.

LangChain supports a wide array of Document Loaders for other diverse data sources, including:

* **Cloud Storage:** `GoogleDriveLoader`, `S3DirectoryLoader` (Amazon S3), `AzureBlobStorageContainerLoader`.
* **Productivity Tools:** `NotionDirectoryLoader`, `ConfluenceLoader`.
* **Multimedia:** `YoutubeLoader` (extracts subtitles), `AssemblyAIAudioTranscriptLoader`.
* **Databases:** `PostgresLoader`, `MongoDBLoader`, `SQLAlchemyLoader`.
* **Other File Formats:** `UnstructuredFileLoader` (for complex formats like DOCX, PPTX), `EvernoteLoader`.

Using these loaders is similar to the examples above, often requiring additional Python libraries specific to each data source (e.g., `google-api-python-client` for Google Drive, `notion-client` for Notion).

You can find a complete list and detailed instructions for each loader in the official LangChain documentation: [https://python.langchain.com/docs/modules/data_connection/document_loaders/](https://python.langchain.com/docs/modules/data_connection/document_loaders/)


---

## 3. Practical Example: Loading Data from Various Formats

To summarize, let's consider a scenario where we need to load data from multiple different sources and process them.

In [None]:
# Install all necessary libraries for this example
# pip install langchain pypdf beautifulsoup4 reportlab

import os
from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader, CSVLoader

# --- Prepare sample files ---
# TXT File
txt_content = "This is important information from a text note.\nIt talks about the benefits of learning programming."
with open("note.txt", "w", encoding="utf-8") as f:
    f.write(txt_content)

# CSV File
csv_content = "Product,Price,Quantity\nLaptop,1200,50\nPhone,800,120\nHeadphones,50,300"
with open("products.csv", "w", encoding="utf-8") as f:
    f.write(csv_content)

# PDF File (dummy, you need to replace with a real PDF for content)
# For simplicity, create an empty file if no real PDF is available
pdf_test_path = "document.pdf"
try:
    # Try to create a simple PDF using reportlab if available
    from reportlab.pdfgen import canvas
    c = canvas.Canvas(pdf_test_path)
    c.drawString(100, 750, "This is content from a sample PDF file.")
    c.drawString(100, 730, "It is generated for illustration purposes.")
    c.save()
except ImportError:
    with open(pdf_test_path, "w") as f:
        f.write("This is a dummy PDF file. Please replace with a real PDF.\n")
    print("Could not create real PDF. Using dummy file.")


# --- Load data using different loaders ---

all_documents = []

# 1. Load from TXT
print("Loading from note.txt...")
txt_loader = TextLoader("note.txt", encoding="utf-8")
all_documents.extend(txt_loader.load())

# 2. Load from CSV
print("Loading from products.csv...")
csv_loader = CSVLoader("products.csv", encoding="utf-8")
all_documents.extend(csv_loader.load())

# 3. Load from Web (using a public URL)
print("Loading from a web page...")
web_loader = WebBaseLoader("https://www.langchain.com/blog")
try:
    all_documents.extend(web_loader.load())
except Exception as e:
    print(f"Error loading from web: {e}. Skipping web load.")

# 4. Load from PDF
print("Loading from document.pdf...")
try:
    pdf_loader = PyPDFLoader(pdf_test_path)
    all_documents.extend(pdf_loader.load())
except Exception as e:
    print(f"Error loading from PDF: {e}. Skipping PDF load.")


print("\n--- All loaded documents ---")
for i, doc in enumerate(all_documents):
    print(f"Document {i+1}:")
    print(f"  Source: {doc.metadata.get('source', 'Unknown')}")
    print(f"  Content (partial): {doc.page_content[:150]}...")
    print(f"  Metadata: {doc.metadata}")
    print("=" * 50)

# --- Clean up sample files ---
os.remove("note.txt")
os.remove("products.csv")
os.remove(pdf_test_path)
print("\nSample files removed.")

**Explanation:**
This example illustrates how you can use various Document Loaders to collect data from diverse sources into a unified list of `Document` objects. This sets the stage for subsequent processing steps like text splitting and embedding generation, which are fundamental to RAG applications.


---

## Lesson Summary

This lesson introduced **Document Loaders** in LangChain, essential tools for loading data from external sources into the **`Document`** format that LangChain can process. We learned about the structure of the `Document` object (consisting of `page_content` and `metadata`) and the importance of Document Loaders in integrating diverse data and building RAG systems. The lesson also delved into common Document Loader types such as **`TextLoader`**, **`PyPDFLoader`**, **`WebBaseLoader`**, and **`CSVLoader`**, along with specific practical examples for each. Finally, a comprehensive example illustrated how to load data from multiple formats into a unified processing flow, laying the groundwork for building more complex LLM applications.