# Document & Document Loader


## Overview

This tutorial covers the fundamental methods for loading Documents.

By completing this tutorial, you will learn how to load Documents and check their content and associated metadata.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Document](#document)
- [Document Loader](#document-loader)


## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can check out the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "pypdf",
    ],
    verbose=False,
    upgrade=False,
)

## Document

Class for storing a piece of text and its associated metadata.

- `page_content` (Required): Stores a piece of text as a string.
- `metadata` (Optional): Stores metadata related to `page_content` as a dictionary.

In [3]:
from langchain_core.documents import Document

document = Document(page_content="Hello, welcome to LangChain Open Tutorial!")

# Check the attributes using __dict__
document.__dict__

{'id': None,
 'metadata': {},
 'page_content': 'Hello, welcome to LangChain Open Tutorial!',
 'type': 'Document'}

The metadata is empty. Let's add some values.

In [4]:
# Add metadata
document.metadata["source"] = "./example-file.pdf"
document.metadata["page"] = 0

# Check metadata
document.metadata

{'source': './example-file.pdf', 'page': 0}

## Document Loader

Document Loader is a class that loads Documents from various sources.

Listed below are some examples of Document Loaders.

- `PyPDFLoader` : Loads PDF files
- `CSVLoader` : Loads CSV files
- `UnstructuredHTMLLoader` : Loads HTML files
- `JSONLoader` : Loads JSON files
- `TextLoader` : Loads text files
- `DirectoryLoader` : Loads documents from a directory

Now, let's learn how to load Documents .

In [7]:
# Example file path
FILE_PATH = "/content/01-document-loader-sample.pdf"

In [16]:
from langchain_community.document_loaders import PyPDFLoader

# Set up the loader
loader = PyPDFLoader(FILE_PATH)

### load()

- Loads Documents and returns them as a `list[Document]` .

In [17]:
# Load Documents
docs = loader.load()

In [18]:
# Check the number of loaded Documents
len(docs)

48

In [19]:
# Check Documents
docs[0:2]

[Document(metadata={'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, ML, Deep Learning, DL, Neural Networks,', 'moddate': '2016-10-11T20:19:58-04:00', 'title': 'The National Artificial Intelligence Research and Development Strategic Plan', 'source': '/content/01-document-loader-sample.pdf', 'total_pages': 48, 'page': 0, 'page_label': '1'}, page_content='October 2016 \n \n \n \n \n \n \n \n \n \nTHE NATIONAL  \nARTIFICIAL INTELLIGENCE \nRESEARCH AND DEVELOPMENT \nSTRATEGIC PLAN  \nNational Science and Technology Council \n \nNetworking and Information Technology \nResearch and Development Subcommittee'),
 Document(metadata={'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, 

### aload()

- Asynchronously loads Documents and returns them as a `list[Document]` .

In [26]:
# Load Documents asynchronously
docs = await loader.aload()

In [21]:
len(docs)

48

In [22]:
docs[0:2]

[Document(metadata={'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, ML, Deep Learning, DL, Neural Networks,', 'moddate': '2016-10-11T20:19:58-04:00', 'title': 'The National Artificial Intelligence Research and Development Strategic Plan', 'source': '/content/01-document-loader-sample.pdf', 'total_pages': 48, 'page': 0, 'page_label': '1'}, page_content='October 2016 \n \n \n \n \n \n \n \n \n \nTHE NATIONAL  \nARTIFICIAL INTELLIGENCE \nRESEARCH AND DEVELOPMENT \nSTRATEGIC PLAN  \nNational Science and Technology Council \n \nNetworking and Information Technology \nResearch and Development Subcommittee'),
 Document(metadata={'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, 

### load_and_split()

- Loads Documents and automatically splits them into chunks using TextSplitter , and returns them as a `list[Document]` .

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Set up the TextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=0)

# Split Documents into chunks
docs = loader.load_and_split(text_splitter=text_splitter)

In [None]:
# Check the number of loaded Documents
len(docs)

1441

In [None]:
# Check Documents
docs[0:10]

[Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 0}, page_content='October  2016 \n \n \n \n \n \n \n \n \n \nTHE NATIONAL  \nARTIFICIAL INTELLIGENCE \nRESEARCH AND DEVELOPMENT \nSTRATEGIC PLAN'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 0}, page_content='National Science and Technology Council  \n \nNetworking and Information Technology \nResearch and Development Subcommittee'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 1}, page_content='ii'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='iii About the National Science and Technology Council'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='The National Science and Technology Council (NSTC) is the principal means by which the Executive'),
 Document(metadata={'source': './data/01-document-loader-sample.pdf', 'page': 2}, page_content='Bran

### lazy_load()

- Loads Documents sequentially and returns them as an `Iterator[Document]` .

In [None]:
loader.lazy_load()

<generator object PyPDFLoader.lazy_load at 0x000001902A0117B0>

It can be observed that this method operates as a `generator` . This is a special type of iterator that produces values on-the-fly, without storing them all in memory at once.

In [None]:
# Load Documents sequentially
docs = loader.lazy_load()
for doc in docs:
    print(doc.metadata)
    break  # Used to limit the output length

{'source': './data/01-document-loader-sample.pdf', 'page': 0}


### alazy_load()

- Asynchronously loads Documents sequentially and returns them as an `AsyncIterator[Document]` .

In [None]:
loader.alazy_load()

<async_generator object BaseLoader.alazy_load at 0x000001902A00B140>

It can be observed that this method operates as an `async_generator` . This is a special type of asynchronous iterator that produces values on-the-fly, without storing them all in memory at once.

In [27]:
# Load Documents asynchronously and sequentially
docs = loader.alazy_load()
async for doc in docs:
    print(doc.metadata)
    break  # Used to limit the output length

{'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, ML, Deep Learning, DL, Neural Networks,', 'moddate': '2016-10-11T20:19:58-04:00', 'title': 'The National Artificial Intelligence Research and Development Strategic Plan', 'source': '/content/01-document-loader-sample.pdf', 'total_pages': 48, 'page': 0, 'page_label': '1'}
