# Document Loading
<img src=""></img>

## Retrieval augmented generation
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [2]:
# ! pip install langchain

## PDFs

In [6]:
# ! pip install pypdf 

Each page is a Document.

A Document contains **text (page_content)** and **metadata**.

In [10]:
# Import the PyPDFLoader class from the langchain.document_loaders module
from langchain.document_loaders import PyPDFLoader

# Create an instance of PyPDFLoader with the path to the PDF file
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")

# Load the pages of the PDF document into a variable
pages = loader.load()

# Print the type of the loaded pages (should be a list or similar structure)
print(type(pages))

# Print the number of pages loaded from the PDF document
print(len(pages))

# Access the first page of the loaded PDF
page = pages[0] 

# Print the first 500 characters of the content from the first page
print(page.page_content[0:500])

# Print the metadata associated with the first page
page.metadata

<class 'list'>
22
MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

## YouTube

In [12]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [14]:
! pip install yt_dlp
! pip install pydub

Collecting yt_dlp
  Downloading yt_dlp-2024.8.6-py3-none-any.whl.metadata (170 kB)
     ---------------------------------------- 0.0/170.1 kB ? eta -:--:--
     ------ ------------------------------ 30.7/170.1 kB 435.7 kB/s eta 0:00:01
     --------------- --------------------- 71.7/170.1 kB 787.7 kB/s eta 0:00:01
     --------------- --------------------- 71.7/170.1 kB 787.7 kB/s eta 0:00:01
     -------------------------------------- 170.1/170.1 kB 1.0 MB/s eta 0:00:00
Collecting mutagen (from yt_dlp)
  Downloading mutagen-1.47.0-py3-none-any.whl.metadata (1.7 kB)
Collecting pycryptodomex (from yt_dlp)
  Downloading pycryptodomex-3.20.0-cp35-abi3-win_amd64.whl.metadata (3.4 kB)
Collecting websockets>=12.0 (from yt_dlp)
  Downloading websockets-13.0.1-cp312-cp312-win_amd64.whl.metadata (6.9 kB)
Downloading yt_dlp-2024.8.6-py3-none-any.whl (3.1 MB)
   ---------------------------------------- 0.0/3.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.1 MB ? eta -:--:--
 

**Note**: This can take several minutes to complete.

In [15]:
# Define the URL of the YouTube video to be processed
url = "https://www.youtube.com/watch?v=jGwO_UgTS7I"

# Specify the directory where the audio will be saved
save_dir = "docs/youtube/"

# Initialize a GenericLoader with a YoutubeAudioLoader for the specified URL and save directory,
# and an OpenAIWhisperParser for processing the audio
loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)

# Load the documents (audio content) from the specified YouTube video
docs = loader.load()

# Display the first 500 characters of the page content from the first document
docs[0].page_content[0:500]

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading web creator player API JSON
[youtube] jGwO_UgTS7I: Downloading player e38bb6de
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.71MiB


ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location


DownloadError: ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location

## URLs

In [18]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [19]:
docs = loader.load()

In [20]:
print(docs[0].page_content[:500])













































































handbook/37signals-is-you.md at master · basecamp/handbook · GitHub

















































Skip to content







Toggle navigation










            Sign up
          


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Cod


## Notion

Follow steps [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

- Duplicate the page into your own Notion space and export as Markdown / CSV.
- Unzip it and save it as a folder that contains the markdown file for the Notion page.

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [None]:
print(docs[0].page_content[0:200])

In [None]:
docs[0].metadata