# Document & Document Loader

This tutorial covers the fundamental methods for loading Documents.

By completing this tutorial, you will learn how to load Documents and check their content and associated metadata.


- Requirement libs

```bash
pip install langchain_text_splitters pypdf
```

In [1]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

load_dotenv(override=True, dotenv_path="../.env")

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.0)

## Document

Lớp (Class) để lưu trữ một đoạn văn bản và siêu dữ liệu (metadata) liên quan.

-   `page_content` (Bắt buộc): Lưu trữ một đoạn văn bản dưới dạng chuỗi (string).
-   `metadata` (Tùy chọn): Lưu trữ siêu dữ liệu liên quan đến `page_content` dưới dạng từ điển (dictionary).


In [1]:
from langchain_core.documents import Document

document = Document(page_content="Hello, welcome to LangChain Open Tutorial!")

In [2]:
document

Document(metadata={}, page_content='Hello, welcome to LangChain Open Tutorial!')

In [4]:
document.__dict__

{'id': None,
 'metadata': {},
 'page_content': 'Hello, welcome to LangChain Open Tutorial!',
 'type': 'Document'}

In [5]:
# Add metadata
document.metadata["source"] = "./example-file.pdf"
document.metadata["page"] = 0

# Check metadata
document.metadata

{'source': './example-file.pdf', 'page': 0}

## Document Loader

Document Loader is a class that loads Documents from various sources.

Listed below are some examples of Document Loaders.

- `PyPDFLoader`: Loads PDF files
- `CSVLoader`: Loads CSV files
- `UnstructuredHTMLLoader`: Loads HTML files
- `JSONLoader`: Loads JSON files
- `TextLoader`: Loads text files
- `DirectoryLoader`: Loads documents from a directory

In [6]:
FILE_PATH="data/01-document-loader-sample.pdf"

In [7]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(file_path=FILE_PATH)

### `load()`

- Loads Documents and returns them as a `list[Document]`.

In [8]:
docs = loader.load()
docs[:10]

[Document(metadata={'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, ML, Deep Learning, DL, Neural Networks,', 'moddate': '2016-10-11T20:19:58-04:00', 'title': 'The National Artificial Intelligence Research and Development Strategic Plan', 'source': 'data/01-document-loader-sample.pdf', 'total_pages': 48, 'page': 0, 'page_label': '1'}, page_content='October 2016 \n \n \n \n \n \n \n \n \n \nTHE NATIONAL  \nARTIFICIAL INTELLIGENCE \nRESEARCH AND DEVELOPMENT \nSTRATEGIC PLAN  \nNational Science and Technology Council \n \nNetworking and Information Technology \nResearch and Development Subcommittee'),
 Document(metadata={'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, ML, 

In [9]:
len(docs)

48

### `aload()`

- Asynchronously loads Documents and returns them as a `list[Document]`.

In [10]:
# Load Documents asynchronously
docs = await loader.aload()

### `load_and_split()`

- Loads Documents and automatically splits them into chunks using TextSplitter , and returns them as a `list[Document]`.

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=0)
docs = loader.load_and_split(text_splitter)

In [12]:
len(docs)

1430

### `lazy_load()`

- Loads Documents sequentially and returns them as an `Iterator[Document]`.

In [13]:
# Load Documents sequentially
docs = loader.lazy_load()
for doc in docs:
    print(doc.metadata)
    break  # Used to limit the output length

{'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, ML, Deep Learning, DL, Neural Networks,', 'moddate': '2016-10-11T20:19:58-04:00', 'title': 'The National Artificial Intelligence Research and Development Strategic Plan', 'source': 'data/01-document-loader-sample.pdf', 'total_pages': 48, 'page': 0, 'page_label': '1'}


### `alazy_load()`

- Asynchronously loads Documents sequentially and returns them as an `AsyncIterator[Document]`.

It can be observed that this method operates as an `async_generator`. This is a special type of asynchronous iterator that produces values on-the-fly, without storing them all in memory at once.

In [14]:
# Load Documents asynchronously and sequentially
docs = loader.alazy_load()
async for doc in docs:
    print(doc.metadata)
    break  # Used to limit the output length

{'producer': 'Microsoft® Word 2010', 'creator': 'Microsoft® Word 2010', 'creationdate': '2016-10-11T15:32:07-04:00', 'author': 'NITRD AI Task Force', 'keywords': 'Artificial Intelligence, AI, Machine Learning, ML, Deep Learning, DL, Neural Networks,', 'moddate': '2016-10-11T20:19:58-04:00', 'title': 'The National Artificial Intelligence Research and Development Strategic Plan', 'source': 'data/01-document-loader-sample.pdf', 'total_pages': 48, 'page': 0, 'page_label': '1'}
