# **Document Loaders**
Document loaders provide a **standard interface** for reading data from different sources (such as Slack, Notion, or Google Drive) into LangChain‚Äôs **Document** format. This ensures that data can be handled consistently regardless of the source.

## **Document Loaders API**
Each document loader may define its own parameters, but they share a common API:
- **load()** ‚Äì Loads all documents at once.
- **lazy_load()** ‚Äì Streams documents lazily, useful for large datasets.

## **Example Usage**
```python
# Load all documents
documents = loader.load()

# For large datasets, lazily load documents
for document in loader.lazy_load():
    print(document)
```

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

1. **Text Loaders**
```python
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./index.md")
loader.load()
```

2. **CSV**
```python
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
```
3. **HTML**
```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
```
4. **Web Base Loader**
```python
# !pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")
data = loader.load()
```
5. **JSON**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) for detailed docs.  
Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data. This can easily be done through the JSONLoader as shown below.
```python
#!pip install jq
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
```

6. **PDF**  
[Click here](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) for detailed docs.  
Make sure to install: `pip install pypdf`

```python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
```

7. **File Directory**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for detailed docs.
Under the hood it uses UnstructuredLoader.  
Make sure to install: `pip install "unstructured[all-docs]"`  
This covers how to load all documents in a directory. We can use the `glob` parameter to control which files to load.

```python
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.md", show_progress=True, loader_cls=TextLoader)
docs = loader.load()
```

8. **YouTube Transcripts Loader**
Many videos on YouTube include transcripts, which are textual representations of the audio content. These transcripts can be great for analysis, accessibility, content repurposing, and educational purposes.

```python
# Import libraries
from langchain_community.document_loaders import YoutubeLoader

# Define the path to the YouTube video and load the transcript
file_path = 'https://www.youtube.com/watch?v=9t1IkQndfTs'
loader = YoutubeLoader.from_youtube_url(file_path)
data = loader.load()

# Print the data
print(data[0].page_content)
```

In [1]:
# !pip install "unstructured[all-docs]"
# !pip install jq
# !pip install pypdf
# !pip install pymupdf

In [2]:
# !pip install langchain_community -U

## **Load .csv File**

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./data/csv_data/movies_data.csv')

data = loader.load()

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
print("Type of loaded data:", type(data))

print("Number of datapoints:", len(data))

print("Type of each datapoints:", type(data[0]))

Type of loaded data: <class 'list'>
Number of datapoints: 436
Type of each datapoints: <class 'langchain_core.documents.base.Document'>


In [5]:
data[:5]

[Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 0}, page_content="movieId: 1\ntitle: Toy Story (1995)\ngenres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 1}, page_content="movieId: 2\ntitle: Jumanji (1995)\ngenres: ['Adventure', 'Children', 'Fantasy']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 2}, page_content="movieId: 3\ntitle: Grumpier Old Men (1995)\ngenres: ['Comedy', 'Romance']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 3}, page_content="movieId: 6\ntitle: Heat (1995)\ngenres: ['Action', 'Crime', 'Thriller']"),
 Document(metadata={'source': './data/csv_data/movies_data.csv', 'row': 4}, page_content="movieId: 7\ntitle: Sabrina (1995)\ngenres: ['Comedy', 'Romance']")]

In [6]:
print(data[0].page_content)

movieId: 1
title: Toy Story (1995)
genres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']


## **Loading Web Page**

In [7]:
# !pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader

page_url = "https://python.langchain.com/docs/integrations/providers/"

loader = WebBaseLoader(web_paths=[page_url])

data = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [8]:
print("Type of loaded data:", type(data))

print("Number of datapoints:", len(data))

print("Type of each datapoints:", type(data[0]))

Type of loaded data: <class 'list'>
Number of datapoints: 1
Type of each datapoints: <class 'langchain_core.documents.base.Document'>


In [9]:
print(data[0].metadata)

{'source': 'https://python.langchain.com/docs/integrations/providers/', 'title': 'Integration packages - Docs by LangChain', 'language': 'en'}


In [10]:
print(data[0].page_content[:500])

Integration packages - Docs by LangChainSkip to main contentüöÄ Share how you're building agents for a chance to win LangChain swag!Docs by LangChain home pageLangChain + LangGraphSearch...‚åòKAsk AIGitHubTry LangSmithTry LangSmithSearch...NavigationIntegration packagesLangChainLangGraphDeep AgentsIntegrationsLearnReferenceContributePythonOverviewAll providersPopular ProvidersOpenAIAnthropic (Claude)GoogleAWS (Amazon)Hugging FaceMicrosoftOllamaGroqIntegrations by componentChat modelsTools and toolki


## **Parsing specific data from Web Page**
**Important Note**  
The above example is essentially a dump of the text from the page's HTML. It may contain extraneous information like headings and navigation bars. 

If you are familiar with the expected HTML, you can specify desired `<div>` classes and other parameters via BeautifulSoup. Below we parse only the body text of the article:

In [11]:
import bs4

page_url = "https://docs.langchain.com/oss/python/integrations/providers/all_providers"

loader = WebBaseLoader(
    web_paths=[page_url],
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(class_="not-prose font-semibold text-base text-gray-800 dark:text-white mt-4"),
    },
    bs_get_text_kwargs={"separator": " | ", "strip": True},
)

data = loader.load()

In [12]:
print("Type of loaded data:", type(data))

print("Number of datapoints:", len(data))

print("Type of each datapoints:", type(data[0]))

Type of loaded data: <class 'list'>
Number of datapoints: 1
Type of each datapoints: <class 'langchain_core.documents.base.Document'>


In [13]:
print(data[0].metadata)

{'source': 'https://docs.langchain.com/oss/python/integrations/providers/all_providers'}


In [14]:
print(data[0].page_content[:500])

Abso | Acreom | ActiveLoop DeepLake | Ads4GPTs | AgentQL | AI21 | AIM Tracking | AI/ML API | AI Network | Airbyte | Airtable | Alchemy | Aleph Alpha | Alibaba Cloud | AnalyticDB | Anchor Browser | Annoy | Anthropic | Anyscale | Apache Doris | Apache | Apify | Apple | ArangoDB | Arcee | ArcGIS | Argilla | Arize | Arthur Tracking | arXiv | Ascend | Ask News | AssemblyAI | AstraDB | Atlas | AwaDB | AWS | AZLyrics | Azure AI | BAAI | Bagel | BagelDB | Baichuan | Baidu | BananaDev | Baseten | Beam | 


## **Loading one .srt File**

In [15]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/subtitles/Friends_2x01.srt")

data = loader.load()

In [16]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print()
print("Page Content:", data[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': 'data/subtitles/Friends_2x01.srt'}

Page Content: 1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he 


## **Loading all .srt Files**

In [17]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 2994.22it/s]


In [18]:
print("Type of Data Variable: ", type(data))

print("Number of Documents: ", len(data))

print("Type of each Document: ", type(data[0]))

Type of Data Variable:  <class 'list'>
Number of Documents:  10
Type of each Document:  <class 'langchain_core.documents.base.Document'>


## **Loading YouTube Transcripts**

In [3]:
# ! pip install youtube-transcript-api

In [14]:
# Import libraries
from langchain_community.document_loaders import YoutubeLoader

# Define the path to the YouTube video and load the transcript
yt_url = 'https://www.youtube.com/watch?v=h0bfKnBdjbs'
loader = YoutubeLoader.from_youtube_url(yt_url)
data = loader.load()

# Print the data
print(data[0].page_content)

I'm excited to announce
the release of our latest LangChain Academy course,
LangSmith Essentials. In this quickstart course,
you'll learn to observe, evaluate, and deploy an AI agent
in less than 30 minutes. Testing applications is an essential part
of the development lifecycle, but LLM systems are non-deterministic,
meaning we can't predict exactly what output a
given input will produce. When you add multi-turn interactions
and tool-calling agents into the mix, the process becomes
even more complex and less straightforward than
traditional software testing. And that's where LangSmith comes in. LangSmith is a comprehensive
platform for agent engineering that helps AI teams use live production data
for continuous testing and improvement. In this course, you'll learn how to trace
your agent step by step to understand its behavior and fix issues
that slow it down or hurt quality. When building agents,
it's easy to get stuck guessing why performance changes
or which version actually works 

In [15]:
data

[Document(metadata={'source': 'h0bfKnBdjbs'}, page_content="I'm excited to announce\nthe release of our latest LangChain Academy course,\nLangSmith Essentials. In this quickstart course,\nyou'll learn to observe, evaluate, and deploy an AI agent\nin less than 30 minutes. Testing applications is an essential part\nof the development lifecycle, but LLM systems are non-deterministic,\nmeaning we can't predict exactly what output a\ngiven input will produce. When you add multi-turn interactions\nand tool-calling agents into the mix, the process becomes\neven more complex and less straightforward than\ntraditional software testing. And that's where LangSmith comes in. LangSmith is a comprehensive\nplatform for agent engineering that helps AI teams use live production data\nfor continuous testing and improvement. In this course, you'll learn how to trace\nyour agent step by step to understand its behavior and fix issues\nthat slow it down or hurt quality. When building agents,\nit's easy to 

In [32]:
# ! pip install yt_dlp

In [31]:
from yt_dlp import YoutubeDL

In [34]:
ydl_opts = {"quiet": True, "no_warnings": True, "skip_download": True}

with YoutubeDL(ydl_opts) as ydl:
    yt = ydl.extract_info("https://www.youtube.com/watch?v=h0bfKnBdjbs", download=False)
    print(yt)

{'id': 'h0bfKnBdjbs', 'title': 'LangChain Academy New Course: LangSmith Essentials', 'formats': [{'format_id': 'sb3', 'format_note': 'storyboard', 'ext': 'mhtml', 'protocol': 'mhtml', 'acodec': 'none', 'vcodec': 'none', 'url': 'https://i.ytimg.com/sb/h0bfKnBdjbs/storyboard3_L0/default.jpg?sqp=-oaymwENSDfyq4qpAwVwAcABBqLzl_8DBgj_y9TIBg==&sigh=rs$AOn4CLA8hbVj0IXugjLAovrecynr-RS32g', 'width': 48, 'height': 27, 'fps': 0.9523809523809523, 'rows': 10, 'columns': 10, 'fragments': [{'url': 'https://i.ytimg.com/sb/h0bfKnBdjbs/storyboard3_L0/default.jpg?sqp=-oaymwENSDfyq4qpAwVwAcABBqLzl_8DBgj_y9TIBg==&sigh=rs$AOn4CLA8hbVj0IXugjLAovrecynr-RS32g', 'duration': 105.0}], 'audio_ext': 'none', 'video_ext': 'none', 'vbr': 0, 'abr': 0, 'tbr': None, 'resolution': '48x27', 'aspect_ratio': 1.78, 'filesize_approx': None, 'http_headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,a

In [21]:
# ! pip install langchain-yt-dlp

In [23]:
from langchain_yt_dlp.youtube_loader import YoutubeLoaderDL

yt_url = 'https://www.youtube.com/watch?v=h0bfKnBdjbs'
video_id = "h0bfKnBdjbs"

# Basic transcript loading
loader = YoutubeLoaderDL(
    video_id=video_id, 
    add_video_info=True
)

ModuleNotFoundError: No module named 'langchain.document_loaders'

In [30]:
from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document
from langchain_community.document_loaders.youtube import _parse_video_id

## **Load Youtube Audio**

In [6]:
! pip install yt_dlp

Collecting yt_dlp
  Downloading yt_dlp-2025.12.8-py3-none-any.whl.metadata (180 kB)
Downloading yt_dlp-2025.12.8-py3-none-any.whl (3.3 MB)
[2K   [38;2;114;156;31m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.3/3.3 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yt_dlp
Successfully installed yt_dlp-2025.12.8

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
