# **Document Loaders**
Use document loaders to load data from a source as `Document`.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

1. **Text Loaders**
```python
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./index.md")
loader.load()
```

2. **CSV**
```python
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
```
3. **HTML**
```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
```
4. **Web Base Loader**
```python
# !pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")
data = loader.load()
```
5. **JSON**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) for detailed docs.  
Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data. This can easily be done through the JSONLoader as shown below.
```python
#!pip install jq
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
```

6. **PDF**  
[Click here](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) for detailed docs.  
Make sure to install: `pip install pypdf`

```python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
```

7. **File Directory**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for detailed docs.
Under the hood it uses UnstructuredLoader.  
Make sure to install: `pip install "unstructured[all-docs]"`  
This covers how to load all documents in a directory. We can use the `glob` parameter to control which files to load.

```python
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.md", show_progress=True, loader_cls=TextLoader)
docs = loader.load()
```

In [1]:
# !pip install "unstructured[all-docs]"
# !pip install jq
# !pip install pypdf
# !pip install pymupdf

## **Loading one .srt File**

In [2]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/subtitles/Friends_2x01.srt")

data = loader.load()



In [3]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print()
print("Page Content:", data[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': 'data/subtitles/Friends_2x01.srt'}

Page Content: 1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he 


## **Loading all .srt Files**

In [4]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|█████████████████████████████████████████| 10/10 [00:00<00:00, 5149.54it/s]


In [5]:
print("Type of Data Variable: ", type(data))

print("Number of Documents:", len(data))

Type of Data Variable:  <class 'list'>
Number of Documents: 10


## **Load .csv File**

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./data/csv_data/movies_data.csv')

data = loader.load()

In [7]:
print("Type of loaded data:", type(data))

print("Number of datapoints:", len(data))

print("Type of each datapoints:", type(data[0]))

Type of loaded data: <class 'list'>
Number of datapoints: 436
Type of each datapoints: <class 'langchain_core.documents.base.Document'>


In [8]:
data[:5]

[Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 0}, page_content="movieId: 1\ntitle: Toy Story (1995)\ngenres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 1}, page_content="movieId: 2\ntitle: Jumanji (1995)\ngenres: ['Adventure', 'Children', 'Fantasy']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 2}, page_content="movieId: 3\ntitle: Grumpier Old Men (1995)\ngenres: ['Comedy', 'Romance']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 3}, page_content="movieId: 6\ntitle: Heat (1995)\ngenres: ['Action', 'Crime', 'Thriller']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 4}, page_content="movieId: 7\ntitle: Sabrina (1995)\ngenres: ['Comedy', 'Romance']")]

In [9]:
print(data[0].page_content)

movieId: 1
title: Toy Story (1995)
genres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']
