# **Document Loaders**
Use document loaders to load data from a source as `Document`.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

1. **Text Loaders**
```python
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./index.md")
loader.load()
```

2. **CSV**
```python
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
```
3. **HTML**
```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
```
4. **Web Base Loader**
```python
# !pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")
data = loader.load()
```
5. **JSON**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) for detailed docs.  
Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data. This can easily be done through the JSONLoader as shown below.
```python
#!pip install jq
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
```

6. **PDF**  
[Click here](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) for detailed docs.  
Make sure to install: `pip install pypdf`

```python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
```

7. **File Directory**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for detailed docs.
Under the hood it uses UnstructuredLoader.  
Make sure to install: `pip install "unstructured[all-docs]"`  
This covers how to load all documents in a directory. We can use the `glob` parameter to control which files to load.

```python
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.md", show_progress=True, loader_cls=TextLoader)
docs = loader.load()
```

In [None]:
# !pip install "unstructured[all-docs]"
# !pip install jq
# !pip install pypdf
# !pip install pymupdf

In [None]:
!pip install "unstructured[all-docs]"

Collecting unstructured[all-docs]
  Downloading unstructured-0.16.0-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured[all-docs])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured[all-docs])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[all-docs])
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured[all-docs])
  Downloading python_iso639-2024.4.27-py3-none-any.whl.metadata (13 kB)
Collecting langdetect (from unstructured[all-docs])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     ---- --------------------------------- 122.9/981.5 kB 3.6 MB/s eta 0:00:01
     ------------------- ------------------ 512.0/981.5 kB 6.4 MB/s eta 0:00:01
     -------------------------------------- 981.5/981.5 kB 8.8 MB/s eta 0:00:00
  Pre

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pycaret 3.3.2 requires pandas<2.2.0, but you have pandas 2.2.3 which is incompatible.
sktime 0.26.0 requires pandas<2.2.0,>=1.1, but you have pandas 2.2.3 which is incompatible.


In [None]:
!pip install jq

Collecting jq
  Downloading jq-1.8.0-cp312-cp312-win_amd64.whl.metadata (7.2 kB)
Downloading jq-1.8.0-cp312-cp312-win_amd64.whl (417 kB)
   ---------------------------------------- 0.0/417.3 kB ? eta -:--:--
    --------------------------------------- 10.2/417.3 kB ? eta -:--:--
   --- ----------------------------------- 41.0/417.3 kB 653.6 kB/s eta 0:00:01
   ---------------------------------------- 417.3/417.3 kB 4.3 MB/s eta 0:00:00
Installing collected packages: jq
Successfully installed jq-1.8.0


In [None]:
!pip install pypdf



In [None]:
!pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.24.11-cp38-abi3-win_amd64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.11-cp38-abi3-win_amd64.whl (16.0 MB)
   ---------------------------------------- 0.0/16.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/16.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/16.0 MB 653.6 kB/s eta 0:00:25
    --------------------------------------- 0.3/16.0 MB 2.9 MB/s eta 0:00:06
   ---- ----------------------------------- 1.7/16.0 MB 11.8 MB/s eta 0:00:02
   ---------- ----------------------------- 4.2/16.0 MB 22.3 MB/s eta 0:00:01
   --------------- ------------------------ 6.0/16.0 MB 25.6 MB/s eta 0:00:01
   ------------------- -------------------- 7.9/16.0 MB 28.2 MB/s eta 0:00:01
   ---------------------- ----------------- 9.2/16.0 MB 29.4 MB/s eta 0:00:01
   -------------------------- ------------- 10.7/16.0 MB 40.9 MB/s eta 0:00:01
   ----------------------------- ---------- 11.8/16.0 MB 36.3 MB/s eta 0:00:01
   

In [None]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.md", show_progress=True, loader_cls=TextLoader)
docs = loader.load()

0it [00:00, ?it/s]


## **Loading one .srt File**

In [None]:
# Web Base Loader
#!pip install beautifulsoup4

In [None]:
from langchain_community.document_loaders import WebBaseLoader

In [None]:
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")

In [None]:
data = loader.load()

In [None]:
type(data)

list

In [None]:
len(data)

1

In [None]:
data[0]

Document(metadata={'source': 'https://docs.smith.langchain.com/user_guide', 'title': 'LangSmith User Guide | ğŸ¦œï¸�ğŸ›\xa0ï¸� LangSmith', 'description': 'LangSmith is a platform for LLM application development, monitoring, and testing. In this guide, weâ€™ll highlight the breadth of workflows LangSmith supports and how they fit into each stage of the application development lifecycle. We hope this will inform users how to best utilize this powerful platform or give them something to consider if theyâ€™re just starting their journey.', 'language': 'en'}, page_content="\n\n\n\n\nLangSmith User Guide | ğŸ¦œï¸�ğŸ›\xa0ï¸� LangSmith\n\n\n\n\n\n\n\nSkip to main contentGo to API DocsSearchRegionUSEUGo to AppQuick StartUser GuideTracingEvaluationProduction Monitoring & AutomationsPrompt HubProxyPricingSelf-HostingCookbookThis is outdated documentation for ğŸ¦œï¸�ğŸ›\xa0ï¸� LangSmith, which is no longer actively maintained.For up-to-date documentation, see the latest version.User GuideOn this p

In [None]:
print(data)

[Document(metadata={'source': 'https://docs.smith.langchain.com/user_guide', 'title': 'LangSmith User Guide | ğŸ¦œï¸�ğŸ›\xa0ï¸� LangSmith', 'description': 'LangSmith is a platform for LLM application development, monitoring, and testing. In this guide, weâ€™ll highlight the breadth of workflows LangSmith supports and how they fit into each stage of the application development lifecycle. We hope this will inform users how to best utilize this powerful platform or give them something to consider if theyâ€™re just starting their journey.', 'language': 'en'}, page_content="\n\n\n\n\nLangSmith User Guide | ğŸ¦œï¸�ğŸ›\xa0ï¸� LangSmith\n\n\n\n\n\n\n\nSkip to main contentGo to API DocsSearchRegionUSEUGo to AppQuick StartUser GuideTracingEvaluationProduction Monitoring & AutomationsPrompt HubProxyPricingSelf-HostingCookbookThis is outdated documentation for ğŸ¦œï¸�ğŸ›\xa0ï¸� LangSmith, which is no longer actively maintained.For up-to-date documentation, see the latest version.User GuideOn this 

In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/subtitles/Friends_2x01.srt")

data = loader.load()

In [None]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print()
print("Page Content:", data[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': 'data/subtitles/Friends_2x01.srt'}

Page Content: 1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he 


## **Loading all .srt Files**

In [None]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 950.38it/s]


In [None]:
print("Type of Data Variable: ", type(data))

print("Number of Documents:", len(data))

Type of Data Variable:  <class 'list'>
Number of Documents: 10


## **Load .csv File**

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./data/csv_data/movies_data.csv')

data = loader.load()

In [None]:
print("Type of loaded data:", type(data))

print("Number of datapoints:", len(data))

print("Type of each datapoints:", type(data[0]))

Type of loaded data: <class 'list'>
Number of datapoints: 436
Type of each datapoints: <class 'langchain_core.documents.base.Document'>


In [None]:
data[:5]

[Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 0}, page_content="movieId: 1\ntitle: Toy Story (1995)\ngenres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 1}, page_content="movieId: 2\ntitle: Jumanji (1995)\ngenres: ['Adventure', 'Children', 'Fantasy']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 2}, page_content="movieId: 3\ntitle: Grumpier Old Men (1995)\ngenres: ['Comedy', 'Romance']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 3}, page_content="movieId: 6\ntitle: Heat (1995)\ngenres: ['Action', 'Crime', 'Thriller']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 4}, page_content="movieId: 7\ntitle: Sabrina (1995)\ngenres: ['Comedy', 'Romance']")]

In [None]:
print(data[0].page_content)

movieId: 1
title: Toy Story (1995)
genres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']
