# **Document Loaders**
Use document loaders to load data from a source as `Document`.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

1. **Text Loaders**
```python
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./index.md")
loader.load()
```

2. **CSV**
```python
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
```
3. **HTML**
```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
```
4. **Web Base Loader**
```python
# !pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")
data = loader.load()
```
5. **JSON**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) for detailed docs.  
Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data. This can easily be done through the JSONLoader as shown below.
```python
#!pip install jq
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
```

6. **PDF**  
[Click here](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) for detailed docs.  
Make sure to install: `pip install pypdf`

```python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
```

7. **File Directory**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for detailed docs.
Under the hood it uses UnstructuredLoader.  
Make sure to install: `pip install "unstructured[all-docs]"`  
This covers how to load all documents in a directory. We can use the `glob` parameter to control which files to load.

```python
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.md", show_progress=True, loader_cls=TextLoader)
docs = loader.load()
```

## **Loading one .srt File**

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/subtitles_data/Friends - [2x01] - The One with Ross's New Girlfriend.srt")

doc = loader.load()

In [2]:
doc

[Document(page_content='1\n00:00:01,435 --> 00:00:04,082\nThis is pretty much\nwhat\'s happened so far.\n\n2\n00:00:04,395 --> 00:00:07,179\nRoss was in love\nwith Rachel since forever.\n\n3\n00:00:07,423 --> 00:00:10,437\nEvery time he tried to tell her,\nsomething got in the way...\n\n4\n00:00:10,651 --> 00:00:12,529\n...Iike cats, Italian guys.\n\n5\n00:00:12,736 --> 00:00:15,922\nAnd finally, Chandler was,\nlike, "Forget about her."\n\n6\n00:00:16,166 --> 00:00:20,762\nWhen Ross was in China, Chandler\nlet it slip that Ross loved Rachel.\n\n7\n00:00:20,975 --> 00:00:22,818\nShe was, like, "Oh, my God!"\n\n8\n00:00:23,061 --> 00:00:25,845\nSo she went to the airport to meet him.\n\n9\n00:00:26,089 --> 00:00:29,710\nShe didn \'t know Ross was getting\noff the plane with another woman.\n\n10\n00:00:31,165 --> 00:00:33,651\nThat\'s pretty much everything\nyou need to know.\n\n11\n00:00:33,922 --> 00:00:36,097\nBut enough about us.\nHow have you been?\n\n12\n00:00:37,991 --> 00:00:40,12

In [3]:
# doc[0].page_content

## **Loading all .srt Files**

In [4]:
# !pip install libmagic

In [5]:
# !pip install python-magic-bin

In [6]:
# !pip install "unstructured[all-docs]"

In [7]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/subtitles_data', glob="*.srt", show_progress=True, loader_cls=TextLoader)

docs = loader.load()

100%|████████████████████████████████████████████████████████████████████████████████| 23/23 [00:00<00:00, 2899.14it/s]


In [8]:
print("Number of Documents:", len(docs))

Number of Documents: 23


## **Load .csv File**

In [9]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./data/csv_data/movies_data.csv')

data = loader.load()

In [10]:
print("Type of loaded data:", type(data))

print("Number of datapoints:", len(data))

print("Type of each datapoints:", type(data[0]))

Type of loaded data: <class 'list'>
Number of datapoints: 436
Type of each datapoints: <class 'langchain_core.documents.base.Document'>


In [11]:
data[:5]

[Document(page_content="movieId: 1\ntitle: Toy Story (1995)\ngenres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']", metadata={'source': './data/csv_data/movies_data.csv', 'row': 0}),
 Document(page_content="movieId: 2\ntitle: Jumanji (1995)\ngenres: ['Adventure', 'Children', 'Fantasy']", metadata={'source': './data/csv_data/movies_data.csv', 'row': 1}),
 Document(page_content="movieId: 3\ntitle: Grumpier Old Men (1995)\ngenres: ['Comedy', 'Romance']", metadata={'source': './data/csv_data/movies_data.csv', 'row': 2}),
 Document(page_content="movieId: 6\ntitle: Heat (1995)\ngenres: ['Action', 'Crime', 'Thriller']", metadata={'source': './data/csv_data/movies_data.csv', 'row': 3}),
 Document(page_content="movieId: 7\ntitle: Sabrina (1995)\ngenres: ['Comedy', 'Romance']", metadata={'source': './data/csv_data/movies_data.csv', 'row': 4})]

In [12]:
print(data[0].page_content)

movieId: 1
title: Toy Story (1995)
genres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']
