# Lab 1 - Overview of embeddings-based retrieval

In this notebook we can find first a sample of how a document can be prepared and added into Chroma DB.    
Then we create a RAG methon and use a LLM (ChatGPT) to answer a question based on the output of queryng the DB. 

In [1]:
from pypdf import PdfReader

reader = PdfReader("microsoft_annual_report_2022.pdf")

pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

In [2]:
print(pdf_texts[0])

1 Dear shareholders, colleagues, customers, and partners:  
We are living through a period of historic economic, societal, and geopolitical change. The world in 2022 looks nothing like 
the world in 2019. As I write this, inflation is at a 40 -year high, supply chains are stretched, and the war in Ukraine is 
ongoing. At the same time, we are entering a technological era with the potential to power awesome advancements 
across every sector of our economy and society. As the world’s largest software company, this places us at a historic 
intersection of opportunity and responsibility to the world around us.  
Our mission to empower every person and every organization on the planet to achieve more has never been more 
urgent or more necessary. For all the uncertainty in the world, one thing is clear: People and organizations in every 
industry are increasingly looking to digital technology to overcome today’s challenges and emerge stronger. And no 
company is better positioned to help th

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [5]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 347


In [7]:
#character_split_texts

In [8]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)
   
print(token_split_texts[10])
print(f"\nTotal chunks: {len(token_split_texts)}")

  from .autonotebook import tqdm as notebook_tqdm


increased, due in large part to significant global datacenter expansions and the growth in xbox sales and usage. despite these increases, we remain dedicated to achieving a net - zero future. we recognize that progress won ’ t always be linear, and the rate at which we can implement emissions reductions is dependent on many factors that can fluctuate over time. on the path to becoming water positive, we invested in 21 water replenishment projects that are expected to generate over 1. 3 million cubic meters of volumetric benefits in nine water basins around the world. progress toward our zero waste commitment included diverting more than 15, 200 metric tons of solid waste otherwise headed to landfills and incinerators, as well as launching new circular centers to increase reuse and reduce e - waste at our datacenters. we contracted to protect over 17, 000 acres of land ( 50 % more than the land we use to operate ), thus achieving our

Total chunks: 349


In [15]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

my_embedding_function = SentenceTransformerEmbeddingFunction()

In [17]:
print(my_embedding_function([token_split_texts[10]]))

[[0.042562685906887054, 0.03321181237697601, 0.03034008853137493, -0.034866560250520706, 0.0684165209531784, -0.08090909570455551, -0.015474434942007065, -0.0014508796157315373, -0.016744473949074745, 0.06770768761634827, -0.05054137110710144, -0.04919534549117088, 0.05139994993805885, 0.09192728251218796, -0.07177843898534775, 0.03951972723007202, -0.012833558954298496, -0.02494749426841736, -0.04622863978147507, -0.02435753308236599, 0.03394967317581177, 0.02550244703888893, 0.027317114174365997, -0.004126247484236956, -0.036338403820991516, 0.0036908735055476427, -0.02743045799434185, 0.004796713590621948, -0.02889619767665863, -0.01887074112892151, 0.036666300147771835, 0.025695864111185074, 0.03131284937262535, -0.0639343336224556, 0.053944047540426254, 0.08225346356630325, -0.04175683110952377, -0.0069957817904651165, -0.023486008867621422, -0.03074798732995987, -0.0029791586566716433, -0.07790941745042801, 0.009353111498057842, 0.0031628564465790987, -0.02225707285106182, -0.018

In [None]:
chroma_client = chromadb.Client()

# Use this line if the colletion already exists
#chroma_client.delete_collection('MicrosoftAnnualReport')

chroma_collection = chroma_client.create_collection(
    "MicrosoftAnnualReport", embedding_function=my_embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

# The .add method will embedd the token_split_texts using the embedding_function specified above

chroma_collection.add(ids=ids, documents=token_split_texts)

chroma_collection.count()

349

In [31]:
my_query = "What was the worst thing that happened?" #total revenue?"

results = chroma_collection.query(query_texts=[my_query], n_results=5)

# Under the hood the .query() method will embedd the query using the same embedding funtion used when adding the documents. 
# Here is where chroma_db searchs for the documents that look similar to the query and then return some documents (5 here)

retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(document)
    print('\n')

vigorously, adverse outcomes that we estimate could reach approximately $ 600 million in aggregate beyond recorded amounts are reasonably possible. were unfavorable final outcomes to occur, there exists the possibility of a material adverse impact in our consolidated financial statements for the period in which the effects become reasonably estimable.


occurs shortly before the products are released to production. the amortization of these costs is included in cost of revenue over the estimated life of the products. legal and other contingencies the outcomes of legal proceedings and claims brought against us are subject to significant uncertainty. an estimated loss from a loss contingency such as a legal proceeding or claim is accrued by a charge to income if it is probable that an asset has been impaired or a liability has been incurred and the amount of the loss can be reasonably estimated. in determining whether a loss should be accrued we evaluate, among other factors, the degree 

In [37]:
import os
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv

# Load environment variables
_ = load_dotenv(find_dotenv())  # Read local .env file

# Initialize OpenAI client
openai_client = OpenAI()

In [38]:
def rag(query, retrieved_documents, model="gpt-4-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful expert financial research assistant. "
                "Your users are asking questions about information contained in an annual report. "
                "You will be shown the user's question and relevant excerpts from the report. "
                "Answer the user's question **using only this information**. "
                "If the information is insufficient, say so."
            ),
        },
        {
            "role": "user",
            "content": f"Question: {query}\n\nRelevant Information:\n{information}",
        },
    ]
  
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,  # Lower temperature for more deterministic responses
        max_tokens=1024,  # Adjust based on needs
    )

    return response.choices[0].message.content

In [None]:
output = rag(query=my_query, retrieved_documents=retrieved_documents)

print(output)

Based on the information provided, the worst thing that happened appears to be the potential for adverse outcomes from legal proceedings and claims, which could reach approximately $600 million in aggregate beyond recorded amounts. This situation is described as having the possibility of a material adverse impact on the consolidated financial statements if unfavorable outcomes occur and the effects become reasonably estimable. This represents a significant financial risk and uncertainty for the company.
