# Document Loading

Note to students.
During periods of high load you may find the notebook unresponsive. It may appear to execute a cell, update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed. This is particularly obvious on print statements when there is no output. If this happens, restart the kernel using the command under the Kernel tab.

Retrieval augmented generation
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [None]:
#! pip install langchain

In [2]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs
Let's load a PDF transcript from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [4]:
#%pip install pypdf

Collecting pypdfNote: you may need to restart the kernel to use updated packages.

  Downloading pypdf-3.15.0-py3-none-any.whl (270 kB)
                                              0.0/270.3 kB ? eta -:--:--
     -------------                           92.2/270.3 kB 2.6 MB/s eta 0:00:01
     -------------------------------        225.3/270.3 kB 3.5 MB/s eta 0:00:01
     -------------------------------------- 270.3/270.3 kB 3.3 MB/s eta 0:00:00
Installing collected packages: pypdf
Successfully installed pypdf-3.15.0



[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [13]:
# There are 80 different types of documents loaders in langchain.document_loaders
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf")
pages = loader.load()

Each page is a Document.

A Document contains text (page_content) and metadata

In [14]:
len(pages)

28

In [15]:
page = pages[0]

In [16]:
print(page.page_content[0:500])

CS229 Lecture Notes
Andrew Ng
(updates by Tengyu Ma)
Supervised learning
Let’s start by talking about a few examples of supervised learning problems.
Suppose we have a dataset giving the living areas and prices of 47 houses
from Portland, Oregon:
Living area (feet2)Price (1000$s)
2104 400
1600 330
2400 369
1416 232
3000 540
......
We can plot this data:
500 1000 1500 2000 2500 3000 3500 4000 4500 500001002003004005006007008009001000housing prices
square feetprice (in $1000)
1


In [17]:
page.metadata

{'source': 'C:\\Users\\alexm\\AppData\\Local\\Temp\\tmpt1gknetr\\tmp.pdf',
 'page': 0}

## ## YouTube

In [18]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [21]:
#%!pip install yt_dlp
#%!pip install pydub

UsageError: Line magic function `%!pip` not found.


In [None]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

In [None]:
docs[0].page_content[0:500]

## URLs

In [22]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [23]:
docs = loader.load()

In [26]:
docs

[Document(page_content='\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nhandbook/37signals-is-you.md at master · basecamp/handbook · GitHub\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\nToggle navigation\n\n\n\n\n\n\n\n\n\n\n            Sign\xa0up\n          \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n        Product\n        \n\n\n\n\n\n\n\n\n\n\n\n\nActions\n        Automate any workflow\n      \n\n\n\n\n\n\n\nPackages\n        Host and manage packages\n      \n\n\n\n\n\n\n\nSecurity\n        Find and fix vulnerabilities\n      \n\n\n\n\n\n\n\nCodespaces\n        Instant dev environments\n      \n\n\n\n\n\n\n\nCopilot\n        Write better code with AI\n      \n\n\n\n\n\n\n\nCode review\n        Manage code changes\n      \n\n\n\n\n\n\n\nIssues\n        Plan and track work\n      \n\n\n\n\n\n\n\nDiscussions\n     

In [None]:
print(docs[0].page_content[:1000])

## Notion
Follow steps [here](https://python.langchain.com/docs/integrations/document_loaders/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

Duplicate the page into your own Notion space and export as Markdown / CSV.
Unzip it and save it as a folder that contains the markdown file for the Notion page.

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [None]:
print(docs[0].page_content[0:200])

In [None]:
docs[0].metadata