# Lab 1 - Overview of embeddings-based retrieval

Welcome! Here's a few notes about the Chroma course notebooks.
 - A number of warnings pop up when running the notebooks. These are normal and can be ignored.
 - Some operations such as calling an LLM or an opeation using generated data return unpredictable results and so your notebook outputs may differ from the video.
  
Enjoy the course!

In [3]:
from helper_utils import word_wrap

In [4]:
from pypdf import PdfReader

reader = PdfReader("microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0], 250))

1 Dear shareholders, colleagues, customers, and partners:  
We are living through a period of historic economic, societal, and geopolitical change. The world in 2022 looks nothing like 
the world in 2019. As I write this, inflation is at a 40 -year
high, supply chains are stretched, and the war in Ukraine is 
ongoing. At the same time, we are entering a technological era with the potential to power awesome advancements 
across every sector of our economy and society. As the world’s largest
software company, this places us at a historic 
intersection of opportunity and responsibility to the world around us.  
Our mission to empower every person and every organization on the planet to achieve more has never been more 
urgent or more
necessary. For all the uncertainty in the world, one thing is clear: People and organizations in every 
industry are increasingly looking to digital technology to overcome today’s challenges and emerge stronger. And no 
company is better positioned
to help th

You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter


In [6]:
# The separators is a hierarchy of splitting characters.  It will use the minimum required to get to chunk of `chunk_size` characters.
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(word_wrap(character_split_texts[10]))
print(f"\nTotal chunks: {len(character_split_texts)}")

increased, due in large part to significant global datacenter
expansions and the growth in Xbox sales and usage. Despite 
these
increases, we remain dedicated to achieving a net -zero future. We
recognize that progress won’t always be linear, 
and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time.  
On the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate 
over 1.3  million cubic meters of volumetric benefits in nine
water basins around the world. Progress toward our zero waste

commitment included diverting more than 15,200 metric tons of solid
waste otherwise headed to landfills and incinerators, 
as well as
launching new Circular Centers to increase reuse and reduce e -waste at
our datacenters.  
We contracted to protect over 17,000 acres of land
(50% more than the land we use to operate), thus achieving our

Total chunks: 347


In [7]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal chunks: {len(token_split_texts)}")

  from .autonotebook import tqdm as notebook_tqdm


increased, due in large part to significant global datacenter
expansions and the growth in xbox sales and usage. despite these
increases, we remain dedicated to achieving a net - zero future. we
recognize that progress won ’ t always be linear, and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time. on the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate over 1. 3 million cubic meters of volumetric benefits in nine
water basins around the world. progress toward our zero waste
commitment included diverting more than 15, 200 metric tons of solid
waste otherwise headed to landfills and incinerators, as well as
launching new circular centers to increase reuse and reduce e - waste
at our datacenters. we contracted to protect over 17, 000 acres of land
( 50 % more than the land we use to operate ), thus achieving our

Total chunks: 349


In [8]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
output = embedding_function([token_split_texts[10]])
print(output)
print(f"len = {len(output[0])}")

[[0.04256264865398407, 0.033211804926395416, 0.030340073630213737, -0.034866590052843094, 0.06841649860143661, -0.08090909570455551, -0.015474398620426655, -0.0014509257161989808, -0.01674444042146206, 0.06770771741867065, -0.050541363656520844, -0.049195390194654465, 0.05139993876218796, 0.09192727506160736, -0.07177841663360596, 0.03951968625187874, -0.0128335477784276, -0.024947471916675568, -0.04622863233089447, -0.02435753308236599, 0.033949654549360275, 0.025502434000372887, 0.027317125350236893, -0.00412623630836606, -0.036338359117507935, 0.003690894693136215, -0.027430463582277298, 0.004796718247234821, -0.028896236792206764, -0.018870722502470016, 0.03666628897190094, 0.02569587342441082, 0.03131282329559326, -0.0639343410730362, 0.05394404008984566, 0.08225350826978683, -0.04175683856010437, -0.006995777599513531, -0.023486042395234108, -0.030747951939702034, -0.0029791633132845163, -0.07790937274694443, 0.009353100322186947, 0.00316286226734519, -0.02225702442228794, -0.018

In [9]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

349

In [23]:
# query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(word_wrap(document))
    print('\n')

addressing the world ’ s most pressing issues. this year, we provided $
3. 2 billion in donated and discounted technology to 302, 000
nonprofits serving over 1. 2 billion people globally. and earlier this
month, we announced that microsoft will double the number of nonprofits
we reach worldwide over the next five years. protect fundamental rights
we unequivocally support the fundamental rights of people, from
defending democracy, to protecting human rights, to addressing racial
injustice and inequity. and, as people ’ s access to education,
healthcare, jobs, and other critical services becomes increasingly
dependent on technology, it ’ s clear that access to broadband and
accessible technology is also fundamental to building a more equitable
future. since 2017, we ’ ve helped more than 50 million people in
unserved rural communities globally gain access to affordable


7 220, 000 people who work at microsoft. essential to this is our
commitment to continually exercise our growth mindse

In [11]:
import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [19]:
def rag(query, retrieved_documents, model="gpt-4"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [22]:
query = "How many people has Microsoft helped in unserved rural communities since 2017?"
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output,n_chars=100))

The information provided does not contain data on how many people Microsoft has helped in unserved
rural communities since 2017.
