# Document Loaders

Document loaders in LangChain are responsible for reading and parsing various types of documents into a format that can be used for further processing (such as indexing or querying). These loaders are essential for turning unstructured data (e.g., PDFs, websites, text files) into structured representations that LLMs (Large Language Models) or other systems can process effectively.

#### Common Document Loaders in LangChain:
- TextLoader: Loads plain text files and returns the contents as a string.
- PDFLoader: Loads and extracts text from PDF files.
- CSVLoader: Reads CSV files and parses the rows into a list of dictionaries or other suitable structures.
- WebBaseLoader: Loads content from web pages via URLs, extracting text from HTML.
- DocxLoader: Loads DOCX (Word) files and extracts the text from them.
- JSONLoader: Loads JSON files and parses them into Python dictionaries.
- NotebookLoader: Loads Jupyter Notebooks and extracts text from them.
- HTMLLoader: Loads HTML files, extracting readable text while ignoring HTML tags.

## Text Files

In [21]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("files/data.txt")
documents = loader.load()

print(documents[0].page_content)

TESTING FILE


- Handling directories

In [22]:
from langchain_community.document_loaders import DirectoryLoader

directory_loader = DirectoryLoader(
    "./files/",
    glob="**/*.txt",
    loader_cls=TextLoader
)
documents = directory_loader.load()

print(documents)

[Document(metadata={'source': 'files/data_2.txt.txt'}, page_content='TESTING FILE'), Document(metadata={'source': 'files/data.txt'}, page_content='TESTING FILE')]


## CSV files

In [23]:
from langchain_community.document_loaders import CSVLoader

csv_loader = CSVLoader("files/data.csv")
documents = csv_loader.load()

documents

[Document(metadata={'source': 'files/data.csv', 'row': 0}, page_content='Years: 2011\nHistorical Data: 11.1%\nTarget: '),
 Document(metadata={'source': 'files/data.csv', 'row': 1}, page_content='Years: 2012\nHistorical Data: 11.5%\nTarget: '),
 Document(metadata={'source': 'files/data.csv', 'row': 2}, page_content='Years: 2013\nHistorical Data: 11.0%\nTarget: '),
 Document(metadata={'source': 'files/data.csv', 'row': 3}, page_content='Years: 2014\nHistorical Data: 11.2%\nTarget: '),
 Document(metadata={'source': 'files/data.csv', 'row': 4}, page_content='Years: 2015\nHistorical Data: \nTarget: '),
 Document(metadata={'source': 'files/data.csv', 'row': 5}, page_content='Years: 2016\nHistorical Data: \nTarget: '),
 Document(metadata={'source': 'files/data.csv', 'row': 6}, page_content='Years: 2017\nHistorical Data: \nTarget: '),
 Document(metadata={'source': 'files/data.csv', 'row': 7}, page_content='Years: 2018\nHistorical Data: \nTarget: '),
 Document(metadata={'source': 'files/data.cs

## PDFs

In [24]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("files/data.pdf")
pages = loader.load_and_split()

pages[:2]

[Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2022-06-24T00:24:36+00:00', 'author': '', 'keywords': '', 'moddate': '2022-06-24T00:24:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'files/data.pdf', 'total_pages': 34, 'page': 0, 'page_label': '1'}, page_content='FlashAttention: Fast and Memory-Eﬃcient Exact Attention\nwith IO-Awareness\nTri Daoy, Daniel Y. Fuy, Stefano Ermony, Atri Rudraz, and Christopher Réy\nyDepartment of Computer Science, Stanford University\nzDepartment of Computer Science and Engineering, University at Buﬀalo, SUNY\n{trid,danfu}@cs.stanford.edu, ermon@stanford.edu, atri@buffalo.edu,\nchrismre@cs.stanford.edu\nJune 24, 2022\nAbstract\nTransformers are slow and memory-hungry on long sequences, since the time and memory complexity\nof self-attention are quadratic in sequence length. A

In [42]:
# Using UnstructuredPDFLoader (requires unstructured package)
from langchain_community.document_loaders import UnstructuredPDFLoader

## Web Pages


In [43]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://example.com")
documents = loader.load()

# Loading multiple URLs
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
]
multi_loader = WebBaseLoader(urls)
documents = multi_loader.load()

documents

USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(metadata={'source': 'https://example.com/page1', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
 Document(metadata={'source': 'https://example.com/page2', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')]

## Databases

In [44]:
import sqlite3

import requests
from langchain_community.utilities.sql_database import SQLDatabase
from sqlalchemy import create_engine
from sqlalchemy.pool import StaticPool

# db from https://github.com/lerocha/chinook-database

def get_engine_for_chinook_db():
    """Pull sql file, populate in-memory database, and create engine."""
    url = "https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql"
    response = requests.get(url)
    sql_script = response.text

    connection = sqlite3.connect(":memory:", check_same_thread=False)
    connection.executescript(sql_script)
    return create_engine(
        "sqlite://",
        creator=lambda: connection,
        poolclass=StaticPool,
        connect_args={"check_same_thread": False},
    )


engine = get_engine_for_chinook_db()

db = SQLDatabase(engine)

# Define your query string
query = "SELECT Name from Artist WHERE Name like 'A%'"
loader = SQLDatabaseLoader(db=db, query=query)

documents = loader.load()
documents

[Document(metadata={}, page_content='Name: AC/DC'),
 Document(metadata={}, page_content='Name: Accept'),
 Document(metadata={}, page_content='Name: Aerosmith'),
 Document(metadata={}, page_content='Name: Alanis Morissette'),
 Document(metadata={}, page_content='Name: Alice In Chains'),
 Document(metadata={}, page_content='Name: Antônio Carlos Jobim'),
 Document(metadata={}, page_content='Name: Apocalyptica'),
 Document(metadata={}, page_content='Name: Audioslave'),
 Document(metadata={}, page_content='Name: Azymuth'),
 Document(metadata={}, page_content='Name: A Cor Do Som'),
 Document(metadata={}, page_content='Name: Aquaman'),
 Document(metadata={}, page_content="Name: Aerosmith & Sierra Leone's Refugee Allstars"),
 Document(metadata={}, page_content='Name: Avril Lavigne'),
 Document(metadata={}, page_content='Name: Aisha Duo'),
 Document(metadata={}, page_content='Name: Aaron Goldberg'),
 Document(metadata={}, page_content='Name: Alberto Turco & Nova Schola Gregoriana'),
 Document(m