# Document Loading

## Note to students.
During periods of high load you may find the notebook unresponsive. It may appear to execute a cell, update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed. This is particularly obvious on print statements when there is no output. If this happens, restart the kernel using the command under the Kernel tab.

## Retrieval augmented generation
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [2]:
# ! pip install langchain

In [3]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file

# openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs
Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [6]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
# ! pip install pypdf 

In [5]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a Document.

A Document contains text (page_content) and metadata.

In [6]:
len(pages)

22

In [7]:
page = pages[0]

In [8]:
print(page.page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [9]:
page.metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

In [13]:
ldr = PyPDFLoader("docs/Python-Institute-PE1-PE2-PCAP-ALL-Quizzes-and-Tests-dated-2023.05.22-FROM-ExamPoster-and-ExamTask-updated-v2024.03.09.pdf")
pgs = ldr.load()

In [23]:
print(len(pgs))
pg = pgs[0]
print(pg)
print(pg.page_content[0:500])
print(pg.metadata)

169
page_content='Python Institute\nhttps://examposter.com/python-institute/[5/22/2023 3:33:12 PM]Python Institute\nHome  » Python Institute\nLast Updated on April 25, 2022 by Admin\nPython Institute Certification Exam Preparation with Questions and Answers plus\nExplanations\nPE1\nPE2\nPCAP : Certified Associate in Python Programming\nCopyright 2023 - ExamPosterExamPoster\n1' metadata={'source': 'docs/Python-Institute-PE1-PE2-PCAP-ALL-Quizzes-and-Tests-dated-2023.05.22-FROM-ExamPoster-and-ExamTask-updated-v2024.03.09.pdf', 'page': 0}
Python Institute
https://examposter.com/python-institute/[5/22/2023 3:33:12 PM]Python Institute
Home  » Python Institute
Last Updated on April 25, 2022 by Admin
Python Institute Certification Exam Preparation with Questions and Answers plus
Explanations
PE1
PE2
PCAP : Certified Associate in Python Programming
Copyright 2023 - ExamPosterExamPoster
1
{'source': 'docs/Python-Institute-PE1-PE2-PCAP-ALL-Quizzes-and-Tests-dated-2023.05.22-FROM-ExamPoster-and-Ex

In [17]:
# Example: reuse your existing OpenAI setup
from openai import OpenAI

# Point to the local server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

completion = client.chat.completions.create(
  model="TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
  messages=[
    {"role": "system", "content": "Always answer in rhymes."},
    {"role": "user", "content": "Introduce yourself."}
  ],
  temperature=0.7,
)

print(completion.choices[0].message)

ChatCompletionMessage(content="Hello there, I'm your friendly AI, here to help and guide you all the way!", role='assistant', function_call=None, tool_calls=None)


In [54]:
import ollama
print(ollama.embeddings(model='nomic-embed-text', prompt='The sky is blue because of rayleigh scattering'))
for split in splits:
    #print(split.page_content[:80])
    embedd = ollama.embeddings(model='nomic-embed-text', prompt=split.page_content)
    print(type(embedd), len(embedd), embedd['embedding'][:8] )

{'embedding': [0.2722543776035309, 0.4934713542461395, -2.4129905700683594, -0.47824615240097046, 0.6915696263313293, 1.4351977109909058, 0.06205877289175987, 0.4073667824268341, 0.20422813296318054, -1.0250886678695679, 0.792400598526001, 0.8639875054359436, 0.7337700724601746, 0.8805103898048401, 0.3235052824020386, 0.36504435539245605, 0.1402503401041031, -0.4418219029903412, -0.16611653566360474, -0.25743478536605835, -1.8509089946746826, -0.40723416209220886, 0.018559740856289864, -0.7916025519371033, 0.9549084901809692, 1.3176765441894531, -0.5174331068992615, -0.04875214770436287, -0.353937566280365, -0.3222334384918213, 1.070691704750061, -0.8693562746047974, -0.5364569425582886, -0.8162270784378052, 0.6912280321121216, -0.7703320980072021, 0.5024803876876831, -0.03490164503455162, -0.2792035639286041, -0.05019675940275192, 0.18278557062149048, -0.4195391833782196, 0.3449172377586365, -0.17463165521621704, 0.2637757956981659, -0.7249665856361389, 0.46155643463134766, 1.51180803

KeyboardInterrupt: 

In [45]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(base_url="http://localhost:1234/v1", api_key="lm-studio")

e2 = embedding.embed_documents("Python Institute Certification Exam Preparation with Questions and Answers plus", 1)
e1 = embedding.embed_query("Python Institute Certification Exam Preparation with Questions and Answers plus")

TypeError: 'NoneType' object is not iterable

In [48]:
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model='text-embedding-ada-002',
                              deployment='your deployment name',
                              openai_api_base="http://localhost:1234/v1",
                              openai_api_key="lm-studio",
                              chunk_size=1)
embeddings.embed_query("Python Institute Certification Exam Preparation with Questions and Answers plus")

TypeError: 'NoneType' object is not iterable

In [35]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

splits = text_splitter.split_documents(ldr.load())
print(len(splits))

343


In [42]:
import numpy as np

e = embedding.embed_query("Python Institute Certification Exam Preparation with Questions and Answers plus")

for split in splits:
    print(split.page_content)
    embedd = embedding.embed_query(split.page_content)
    print(embedd)

TypeError: 'NoneType' object is not iterable

In [21]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.indexes import VectorstoreIndexCreator

# Create a retrieval QA model
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch, llm=client
).from_loaders([ldr])

query ="Please list all test qestions \
in a table in markdown and summarize each one."

response = index.query(query, llm=client)
display(Markdown(response))


ValidationError: 1 validation error for OpenAIEmbeddings
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. (type=value_error)

## YouTube

In [24]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [15]:
# ! pip install yt_dlp
# ! pip install pydub

**Note**: This can take several minutes to complete.

In [26]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.71MiB


ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location


DownloadError: ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location

In [17]:
docs[0].page_content[0:500]

"Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend s"

## URLs

In [18]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [19]:
docs = loader.load()

In [20]:
print(docs[0].page_content[:500])













































































handbook/37signals-is-you.md at master · basecamp/handbook · GitHub

















































Skip to content







Toggle navigation










            Sign up
          


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Cod


## Notion

Follow steps [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

- Duplicate the page into your own Notion space and export as Markdown / CSV.
- Unzip it and save it as a folder that contains the markdown file for the Notion page.

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [None]:
print(docs[0].page_content[0:200])

In [None]:
docs[0].metadata