## Reading documents

In [1]:
import os

from langchain_community.document_loaders import PyPDFLoader

In [2]:
DATA_FOLDER = 'data'
FILE_NAME = 'numpy-user.pdf'
DATA_PATH = os.path.join(DATA_FOLDER, FILE_NAME)

loader = PyPDFLoader(DATA_PATH)
document = loader.load()

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 100,
    add_start_index = True
)

chunks = text_splitter.split_documents(document)

len(document), len(chunks)

(170, 860)

In [6]:
document = chunks[42]

print(document.page_content)
print(document.metadata)

The function zeros creates an array full of zeros, the function ones creates an array full of ones, and the function
empty creates an array whose initial content is random and depends on the state of the memory. By default, the dtype
of the created array is float64 .
>>>np.zeros( (3,4) )
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
>>>np.ones( (2,3,4), dtype=np.int16 ) # dtype can also be specified
array([[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]],
[[ 1, 1, 1, 1],
{'source': 'data/numpy-user.pdf', 'page': 14, 'start_index': 447}


## Embeddings

In [7]:
from langchain_chroma import Chroma

from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

In [8]:
embedding_function = SentenceTransformerEmbeddings(model_name="multi-qa-MiniLM-L6-cos-v1")

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange


In [9]:
db = Chroma.from_documents(chunks, embedding_function, persist_directory="chroma_db")
# db = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

In [10]:
query = "What is an array?"

docs = db.similarity_search(query)

In [14]:
print(docs[0].page_content)
print(docs[1].page_content)
print(docs[2].page_content)
print(docs[3].page_content)

NumPy User Guide, Release 1.18.4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.00.00.20.40.60.8
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.00.00.20.40.60.8
2.8. Tricks and Tips 31
Python. Indeed, the NumPy idiom is even simpler! This last example illustrates two of NumPy’s features which are
the basis of much of its power: vectorization and broadcasting.
1.1.1 Why is NumPy Fast?
Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place,
of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among
which are:
NumPy User Guide, Release 1.18.4
So note that x[0,2] = x[0][2] though the second case is more inefﬁcient as a new temporary array is created
after the ﬁrst index that is subsequently indexed by 2.
Note to those used to IDL or Fortran memory order as it relates to indexing. NumPy uses C-order indexing. That
means that the last index usually represents the most rapidly changing memory locati

In [15]:
docs = db.similarity_search_with_relevance_scores(query, k = 4)

In [16]:
docs

[(Document(metadata={'page': 34, 'source': 'data/numpy-user.pdf', 'start_index': 0}, page_content='NumPy User Guide, Release 1.18.4\n0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.00.00.20.40.60.8\n0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.00.00.20.40.60.8\n2.8. Tricks and Tips 31'),
  0.44000843726782),
 (Document(metadata={'page': 7, 'source': 'data/numpy-user.pdf', 'start_index': 798}, page_content='Python. Indeed, the NumPy idiom is even simpler! This last example illustrates two of NumPy’s features which are\nthe basis of much of its power: vectorization and broadcasting.\n1.1.1 Why is NumPy Fast?\nVectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place,\nof course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among\nwhich are:'),
  0.4229862102380887),
 (Document(metadata={'page': 51, 'source': 'data/numpy-user.pdf', 'start_index': 0}, page_content='NumPy User Guide, Release 1.18.4\nSo