## Reading documents

In [2]:
import os

from langchain_community.document_loaders import PyPDFLoader

In [3]:
DATA_FOLDER = 'data'
FILE_NAME = 'numpy-user.pdf'
DATA_PATH = os.path.join(DATA_FOLDER, FILE_NAME)

loader = PyPDFLoader(DATA_PATH)
document = loader.load()

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 100,
    add_start_index = True
)

chunks = text_splitter.split_documents(document)

len(document), len(chunks)

(170, 860)

In [6]:
document = chunks[42]

print(document.page_content)
print(document.metadata)

The function zeros creates an array full of zeros, the function ones creates an array full of ones, and the function
empty creates an array whose initial content is random and depends on the state of the memory. By default, the dtype
of the created array is float64 .
>>>np.zeros( (3,4) )
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
>>>np.ones( (2,3,4), dtype=np.int16 ) # dtype can also be specified
array([[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]],
[[ 1, 1, 1, 1],
{'source': 'data\\numpy-user.pdf', 'page': 14, 'start_index': 447}


## Embeddings

In [1]:
from langchain_chroma import Chroma

from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

In [2]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange


In [3]:
# db = Chroma.from_documents(chunks, embedding_function, persist_directory="chroma_db")
db = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

In [4]:
query = "Why is numpy Fast?"

docs = db.similarity_search(query)

  attn_output = torch.nn.functional.scaled_dot_product_attention(


In [5]:
print(docs[0].page_content)
print(docs[1].page_content)
print(docs[2].page_content)
print(docs[3].page_content)

CHAPTER
ONE
SETTING UP
1.1 What is NumPy?
NumPy is the fundamental package for scientiﬁc computing in Python. It is a Python library that provides a multidi-
mensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for
fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier
transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Python. Indeed, the NumPy idiom is even simpler! This last example illustrates two of NumPy’s features which are
the basis of much of its power: vectorization and broadcasting.
1.1.1 Why is NumPy Fast?
Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place,
of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among
which are:
could be used to pass data between these applications.

In [12]:
docs = db.similarity_search_with_relevance_scores(query, k = 4)

In [13]:
docs

[(Document(metadata={'page': 41, 'source': 'data\\numpy-user.pdf', 'start_index': 913}, page_content='senting variation in that dimension. An example illustrates much better than a verbal description:\n>>>np.indices((3,3))\narray([[[0, 0, 0], [1, 1, 1], [2, 2, 2]], [[0, 1, 2], [0, 1, 2], [0, 1, 2]]])\nThis is particularly useful for evaluating functions of multiple dimensions on a regular grid.\n3.2.4 Reading Arrays From Disk\nThis is presumably the most common case of large array creation. The details, of course, depend greatly on the format'),
  0.3334464949821496),
 (Document(metadata={'page': 123, 'source': 'data\\numpy-user.pdf', 'start_index': 2955}, page_content='these ﬂags set. This will set the underlying base array writable without causing the\ncontents to be copied back into the original array.\nOther useful ﬂags that can be OR’d as additional requirements are:\n120 Chapter 7. Using NumPy C-API'),
  0.332782682118103),
 (Document(metadata={'page': 122, 'source': 'data\\numpy