# Lab 1 - Overview of embeddings-based retrieval

Welcome! Here's a few notes about the Chroma course notebooks.
 - A number of warnings pop up when running the notebooks. These are normal and can be ignored.
 - Some operations such as calling an LLM or an opeation using generated data return unpredictable results and so your notebook outputs may differ from the video.
  
Enjoy the course!

In [1]:
from helper_utils import word_wrap

In [2]:
from pypdf import PdfReader

reader = PdfReader("microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0]))

Microsoft Corporation Annual Report 2022
Form 10-K
(NASDAQ:MSFT)
Published: July 28th, 2022
PDF generated by

stocklight.com


You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter


In [4]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(word_wrap(character_split_texts[10]))
print(f"\nTotal chunks: {len(character_split_texts)}")

result,” and similar expressions. Forward-looking statements are based
on current expectations and assumptions that are subject to risks
and
uncertainties that may cause actual results to diﬀer materially. We
describe risks and uncertainties that could cause actual results and
events to diﬀer
materially in “Risk Factors,” “Management’s Discussion
and Analysis of Financial Condition and Results of Operations,” and
“Quantitative and Qualitative
Disclosures about Market Risk” (Part II,
Item 7A of this Form 10-K). Readers are cautioned not to place undue
reliance on forward-looking statements,
which speak only as of the date
they are made. We undertake no obligation to update or revise publicly
any forward-looking statements, whether because
of new information,
future events, or otherwise.
PART I
ITEM 1. BUSINESS
GENERAL
Embracing
Our Future
Microsoft is a technology company whose mission is to
empower every person and every organization on the planet to achieve
more. We strive to create



In [5]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal chunks: {len(token_split_texts)}")

result, ” and similar expressions. forward - looking statements are
based on current expectations and assumptions that are subject to risks
and uncertainties that may cause actual results to [UNK] materially. we
describe risks and uncertainties that could cause actual results and
events to [UNK] materially in “ risk factors, ” “ management ’ s
discussion and analysis of financial condition and results of
operations, ” and “ quantitative and qualitative disclosures about
market risk ” ( part ii, item 7a of this form 10 - k ). readers are
cautioned not to place undue reliance on forward - looking statements,
which speak only as of the date they are made. we undertake no
obligation to update or revise publicly any forward - looking
statements, whether because of new information, future events, or
otherwise. part i item 1. business general embracing our future
microsoft is a technology company whose mission is to empower every
person and every organization on the planet to achieve more. we

In [6]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

[[-0.0096299322322011, 0.05235861614346504, 0.025509063154459, 0.04331742227077484, 0.017067700624465942, 0.04469361901283264, 0.04379584267735481, 0.02210894413292408, 0.051840271800756454, 0.0018128578085452318, -0.032639868557453156, 0.022467266768217087, 0.014086359180510044, -0.06495849788188934, -0.016644615679979324, -0.02735571376979351, -0.002238996559754014, -0.002325415378436446, -0.06282889097929001, 0.06006384268403053, -0.002002537250518799, 0.03988948464393616, -0.042229827493429184, 0.06987124681472778, 0.01914088986814022, -0.04514952749013901, -0.0006419369019567966, 0.0007112495950423181, -0.030000118538737297, 0.009689174592494965, -0.07661660015583038, 0.03926936537027359, 0.02613859437406063, 0.03812098130583763, 0.04546708986163139, 0.022274013608694077, -0.08642200380563736, 0.018159357830882072, -0.0045646438375115395, -0.024512972682714462, -0.046027328819036484, -0.09869948774576187, 0.011289143934845924, 0.004516446962952614, 0.020046401768922806, -0.0339147

In [7]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

444

In [8]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(word_wrap(document))
    print('\n')

include legal, including settlements and ﬁnes, information technology,
human resources, ﬁnance, excise taxes, ﬁeld selling, shared facilities
services, and customer service and support. each allocation is measured
differently based on the specific facts and circumstances of the costs
being allocated. segment revenue and operating income were as follows
during the periods presented : ( in millions ) year ended june 30, 2022
2021 2020 revenue productivity and business processes $ 63, 364 $ 53,
915 $ 46, 398 intelligent cloud 75, 251 60, 080 48, 366 more personal
computing 59, 655 54, 093 48, 251 total $ 198, 270 $ 168, 088 $ 143,
015 operating income


part ii item 8 item 8. financial statements and supplementary data
income statements ( in millions, except per share amounts ) year ended
june 30, 2022 2021 2020 revenue : product $ 72, 732 $ 71, 074 $ 68, 041
service and other 125, 538 97, 014 74, 974 total revenue 198, 270 168,
088 143, 015 cost of revenue : product 19, 064 18, 219 16, 0

In [9]:
import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [10]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [11]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
The total revenue for the year ended June 30, 2022, was $198,270
million.
