# Document Loaders and Text Splitters in LangChain

In our previous lessons, we explored multimodal capabilities with images. Now, let's dive into handling different types of documents using LangChain. We'll learn about document loaders and how to effectively process document data.

## Types of Document Loaders

LangChain provides two main categories of document loaders:
1. **File Loaders**: For loading local files (CSV, PDF, TXT, etc.)
2. **Web Loaders**: For loading content from web sources

Let's start by importing the necessary libraries:

In [None]:
from langchain_community.document_loaders import CSVLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

## Understanding Text Splitting

Before we dive into document loading, let's understand text splitting. Text splitters are crucial for processing large documents as they help break down text into manageable chunks.

LangChain provides several text splitters:
- RecursiveCharacterTextSplitter (most commonly used)
- CharacterTextSplitter
- TokenTextSplitter

We'll use RecursiveCharacterTextSplitter as it's the most versatile option.

In [None]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Number of characters per chunk
    chunk_overlap=50,  # Number of overlapping characters
    length_function=len,
)

print("Text splitter initialized with chunk size of 500 and overlap of 50 characters")

## File Loader Example: CSV

Let's start with loading a CSV file. We'll use a sample customer dataset.

In [None]:
def load_and_process_csv():
    # Initialize the CSV loader
    loader = CSVLoader(
        file_path="customers-100.csv",
        csv_args={
            'delimiter': ',',
            'quotechar': '"',
        }
    )
    
    # Load the documents
    documents = loader.load()
    print(f"Loaded {len(documents)} documents")
    
    # Split documents into chunks
    splits = text_splitter.split_documents(documents)
    print(f"Created {len(splits)} splits")
    
    # Take first few chunks to stay within token limits
    limited_splits = splits[:5]
    
    return "\n\n".join([doc.page_content for doc in limited_splits])

context = load_and_process_csv()

Now that we have our processed CSV data, let's set up a simple QA chain to query it:

In [None]:
def setup_qa_chain():
    # Initialize OpenAI model
    llm = OpenAI(temperature=0)
    
    # Create prompt template
    prompt = PromptTemplate(
        template="""Based on the following customer data, please answer the question.
        
        Customer Data:
        {context}
        
        Question: {question}
        
        Answer: """,
        input_variables=["context", "question"]
    )
    
    return prompt | llm

# Set up the chain
qa_chain = setup_qa_chain()

# Test with a sample question
question = "What is sheryl's email address?"
response = qa_chain.invoke({"context": context, "question": question})
print(f"Question: {question}")
print(f"Answer: {response}")

## Web Loader Example

Now let's look at how we can load content from web pages. We'll use the WebBaseLoader to fetch and process web content.

In [None]:
def load_and_process_webpage(url):
    # Initialize web loader with custom headers
    loader = WebBaseLoader(
        url,
        verify_ssl=False,
        header_template={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        }
    )
    
    # Load and process the webpage
    documents = loader.load()
    print(f"Loaded webpage with {len(documents)} documents")
    
    # Split the content
    splits = text_splitter.split_documents(documents)
    print(f"Created {len(splits)} splits")
    
    limited_splits = splits[:3]
    return "\n\n".join([doc.page_content for doc in limited_splits])

# Test with a sample URL
url = "https://en.wikipedia.org/wiki/LangChain"
web_context = load_and_process_webpage(url)

# Test with some questions
questions = [
    "What is LangChain?",
    "What are the main features of LangChain?"
]

for question in questions:
    response = qa_chain.invoke({"context": web_context, "question": question})
    print(f"\nQuestion: {question}")
    print(f"Answer: {response}")

## Key Takeaways

1. Document loaders help us process different types of documents (files and web content)
2. Text splitting is crucial for handling large documents effectively
3. The same QA chain can be used with different types of document loaders
4. RecursiveCharacterTextSplitter is versatile for most use cases