## Load Document $\rightarrow$ Split Document $\rightarrow$ **Storage** $\rightarrow$ Retrieval $\rightarrow$ Output

![steps](image.png)

In [1]:
import os
import openai

from dotenv import load_dotenv
load_dotenv() 

openai.api_key  = os.environ['OPENAI_API_KEY']

### Embeddings

In [2]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [3]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [4]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [6]:
print(embedding1)
print(len(embedding1))

[-0.02746324164862178, -0.0053625327819535945, -0.025757842797248312, -0.033072111589731175, -0.027286385347265962, 0.02251126893594929, -0.010352404271333701, -0.008192232206662783, 0.0024823030519515003, -0.019858424415612003, 0.0007054509761090247, 0.02925706958199577, -0.0054035887423155736, 0.000664395073954708, 0.00030258991802989795, 0.014173761577700438, 0.02998975970723485, -0.0013674774056233383, 0.004070850717854286, -0.003998213607722557, -0.011691458060087631, 0.0069668710227412545, 0.013062093929933925, -0.046892160564719974, -0.0023859796138311538, 0.004674056732053473, 0.016851869155208303, -0.0002712058303040619, -0.02577047525715621, -0.016220240571877646, 0.026654756763935306, 0.002896020304886968, -0.015601243517132279, -0.0245703803900344, 0.00597205502610673, -0.014969614002479013, 0.009714458526726484, -0.011445123229238364, -0.0045793123514216135, -0.010977717556033018, -0.01700346053674832, 0.011072461936664876, 0.006821596336816491, -0.02210702463096083, -0.00

In [7]:
import numpy as np

In [8]:
np.dot(embedding1, embedding2)

0.9631675619330513

In [9]:
np.dot(embedding2, embedding3)

0.7596682675219103

In [10]:
np.dot(embedding1, embedding3)

0.7710630976675918

### Load, Split and Store

#### Load Doc

In [3]:
from langchain.document_loaders import PyPDFLoader

In [15]:
loader = PyPDFLoader('docs/test.pdf')
pages = loader.load()
len(pages)

86

In [17]:
print(pages[0].page_content)

The Rise and Potential of Large Language Model
Based Agents: A Survey
Zhiheng Xi∗†, Wenxiang Chen∗, Xin Guo∗, Wei He∗, Yiwen Ding∗, Boyang Hong∗,
Ming Zhang∗, Junzhe Wang∗, Senjie Jin∗, Enyu Zhou∗,
Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Qin Liu, Yuhao Zhou,
Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin,
Shihan Dou‡, Rongxiang Weng‡, Wensen Cheng‡,
Qi Zhang†, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang and Tao Gui†
Fudan NLP Group,‡miHoYo Inc
Abstract
For a long time, humanity has pursued artificial intelligence (AI) equivalent to or
surpassing the human level, with AI agents considered a promising vehicle for
this pursuit. AI agents are artificial entities that sense their environment, make
decisions, and take actions. Many efforts have been made to develop intelligent AI
agents since the mid-20th century. However, these efforts have mainly focused on
advancement in algorithms or training strategies to enhance specific capabilities
or perform

#### Split Doc

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)

In [20]:
splits = text_splitter.split_documents(pages)
len(splits)

290

#### Vectorstore

In [5]:
from langchain.vectorstores import Chroma

In [22]:
persist_directory = 'docs/chroma/'

In [23]:
vectordb = Chroma.from_documents(
    documents=splits, embedding=embedding, persist_directory=persist_directory
)

In [24]:
print(vectordb._collection.count())

290


### Similarity Search

In [25]:
question = "how long has humanity pursued artificial intelligence"

In [26]:
docs = vectordb.similarity_search(question, k=3)

In [27]:
len(docs)

3

In [31]:
print(docs[2].page_content)

actions [ 5]. This idea transitioned into computer science, intending to enable computers to understand
users’ interests and autonomously perform actions on their behalf [ 6;7;8]. As AI advanced, the term
“agent” found its place in AI research to depict entities showcasing intelligent behavior and possessing
qualities like autonomy, reactivity, pro-activeness, and social ability [ 4;9]. Since then, the exploration
and technical advancement of agents have become focal points within the AI community [ 1;10]. AI
agents are now acknowledged as a pivotal stride towards achieving Artificial General Intelligence
(AGI)2, as they encompass the potential for a wide range of intelligent activities [4; 11; 12].
From the mid-20th century, significant strides were made in developing smart AI agents, as research
delved deep into their design and advancement [ 13;14;15;16;17;18]. However, these efforts have
predominantly focused on enhancing specific capabilities, such as symbolic reasoning, or master

### Limitation 1 

Does not know about document structure

In [32]:
question = "from page 1, how long has humanity pursued artificial intelligence"

In [33]:
docs = vectordb.similarity_search(question, k=3)

In [34]:
for doc in docs:
    print(doc.metadata)

{'page': 3, 'source': 'docs/test.pdf'}
{'page': 76, 'source': 'docs/test.pdf'}
{'page': 48, 'source': 'docs/test.pdf'}


### Limitation 2

Results are not unique

In [1]:
from langchain.document_loaders import PyPDFLoader 

In [2]:
import glob 

path = 'docs/cs229_lectures/*.pdf'
files = glob.glob(path)

files = sorted(files)

print(files)

['docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'docs/cs229_lectures/MachineLearning-Lecture03.pdf']


In [3]:
files.append(files[0])

files = sorted(files)
print(files)

['docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'docs/cs229_lectures/MachineLearning-Lecture03.pdf']


In [4]:
pages = []

for file in files:
    loader = PyPDFLoader(file)
    pages.extend(loader.load())
    
len(pages) 

78

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [6]:
splits = text_splitter.split_documents(pages)
len(splits)

209

In [8]:
splits[0].metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

In [9]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

In [10]:
persist_directory = "docs/chroma/"

vectordb = Chroma.from_documents(
    documents=splits, embedding=embedding, persist_directory=persist_directory
)

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-UODwUjyYPzLPSUgqp1SkNJpG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..


In [11]:
vectordb._collection.count() 

209

In [12]:
question = "what did they say about matlab?"
docs = vectordb.similarity_search(question, k=5)

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-UODwUjyYPzLPSUgqp1SkNJpG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..


In [13]:
print(docs[0].page_content)

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to  learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your  own home computer or something if you 
don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will work for just about 
everythin

In [14]:
print(docs[0].page_content)

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to  learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your  own home computer or something if you 
don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will work for just about 
everythin