# Chroma Playground 

This playground connect to Chroma Vector Data Base build wiht 5 documents:
- [COCACOLA_2021_10K.pdf](../../data/financebench/COCACOLA_2021_10K.pdf)
- [PFIZER_2021_10K.pdf](../../data/financebench/PFIZER_2021_10K.pdf)
- [VERIZON_2022_10K.pdf](../../data/financebench/VERIZON_2022_10K.pdf)
- [PEPSICO_2021_10K.pdf](../../data/financebench/PEPSICO_2022_10K.pdf)
- [NETFLIX_2017_10K.pdf](../../data/financebench/NETFLIX_2017_10K.pdf)

In [1]:
import pandas as pd
import os
from langchain.vectorstores import Chroma
from langchain.chat_models import AzureChatOpenAI

from dotenv import load_dotenv
load_dotenv()
import sys
sys.path.append(os.path.abspath('../../src'))
from azure_openai_conn import OpenAIembeddings, llm

In [2]:
embeddings = OpenAIembeddings()
query = "What is the Coca Cola Balance Sheet?"
vectordb = Chroma(persist_directory="db_chroma", embedding_function=embeddings)
docs = vectordb.similarity_search(query)
print(docs[0].page_content)

THE COCA-COLA COMPANY AND SUBSIDIARIES
CONSOLIDATED BALANCE SHEETS
(In millions except par value)
December 31, 2021 2020
ASSETS
Current Assets   
Cash and cash equivalents $ 9,684 $ 6,795 
Short-term investments 1,242 1,771 
Total Cash, Cash Equivalents and Short-Term Investments 10,926 8,566 
Marketable securities 1,699 2,348 
Trade accounts receivable, less allowances of $516 and $526, respectively 3,512 3,144 
Inventories 3,414 3,266 
Prepaid expenses and other current assets 2,994 1,916 
Total Current Assets 22,545 19,240 
Equity method investments 17,598 19,273 
Other investments 818 812 
Other noncurrent assets 6,731 6,184 
Deferred income tax assets 2,129 2,460 
Property, plant and equipment — net 9,920 10,777 
Trademarks with indefinite lives 14,465 10,395 
Goodwill 19,363 17,506 
Other intangible assets 785 649 
Total Assets $ 94,354 $ 87,296 
LIABILITIES AND EQUITY
Current Liabilities   
Accounts payable and accrued expenses $ 14,619 $ 11,145 
Loans and notes payable 3,307 2,

In [3]:
docs[0].metadata

{'page': 63,
 'source': '../../data/financebench/COCACOLA_2021_10K.pdf',
 'start_index': 0}

### Addressing Diversity: Maximum marginal relevance


How to enforce diversity in the search results and avoid repetition.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [4]:
question = "what these companies say about market conditions?"
docs_ss = vectordb.similarity_search(question,k=3)

In [20]:
docs_ss[0].page_content

'markets, or achieve the return on capital we expect from our investments in these markets.\nChanges in economic conditions can adversely impact our business.\nMany of the jurisdictions in which our products are sold have experienced and could continue to experience uncertain or\nunfavorable economic conditions, such as recessions or economic slowdowns,\n16'

In [21]:
docs_ss[1].page_content

'Table of Contents\nbe unwilli ng or unable to increase our product prices or unable to effective ly hedge against price increases to offset these\nincreased costs without suf fering reduced volume, revenue, mar gins and operating results.\nPolitical and social conditions can adversely af fect our business.\nPolitical and social conditions in the markets in which our products are sold have been and could continue to be difficult to\npredict, resulting in adverse effects on our business. The results of elections, referendums or other political conditions (including\ngovernment shutdow ns or hostilities between countries) in these markets have in the past and could continue to impact how\nexisting laws, regulations and government programs or policies are implemen ted or result in uncertainty as to how such laws,\nregulations, program s or policies may change, including with respect to tariffs, sanctions, environmental and climate change\nregulations, taxes, benefit programs, the movement

In [22]:
docs_ss[2].page_content

'enhancements to our networks. \nAs we introduce new offerings and technologies, such as 5G technology, we must phase out outdated and unprofitable \ntechnologies and services. If we are unable to do so on a cost-effective basis, we could experience reduced profits. In addition, \nthere could be legal or regulatory restraints on our ability to phase out current services. \nAdverse conditions in the U.S. and international economies could impact our results of operations and \nfinancial condition. \nUnfavorable economic conditions, such as a recession or economic slowdown in the U.S. or elsewhere, or inflation in the \nmarkets in which we operate, could negatively affect the affordability of and demand for some of our products and services and \nour cost of doing business. In difficult economic conditions, consumers may seek to reduce discretionary spending by forgoing \npurchases of our products, electing to use fewer higher margin services, dropping down in price plans or obtaining low

Note the difference in results with `MMR`.

In [23]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [24]:
docs_mmr[0].page_content

'markets, or achieve the return on capital we expect from our investments in these markets.\nChanges in economic conditions can adversely impact our business.\nMany of the jurisdictions in which our products are sold have experienced and could continue to experience uncertain or\nunfavorable economic conditions, such as recessions or economic slowdowns,\n16'

In [25]:
docs_mmr[1].page_content

'FORWARD-LOOKING STATEMENTS\nThis report contains information that may constitute “forward-looking statements.” Generally, the words “believe,” “expect,” “intend,” “estimate,” “anticipate,” “project,”\n“will” and similar expressions identify forward-looking statements, which generally are not historical in nature. However, the absence of these words or similar expressions\ndoes not mean that a statement is not forward-looking. All statements that address operating performance, events or developments that we expect or anticipate will occur in the\nfuture — including statements relating to volume growth, share of sales and earnings per share growth, and statements expressing general views about future operating\nresults — are forward-looking statements. Management believes that these forward-looking statements are reasonable as and when made. However, caution should be taken not\nto place undue reliance on any such forward-looking statements because such statements speak only as of the d

### Addressing Specificity: working with metadata

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [26]:
question = "what did they say about reveneu in the the Coca Cola 10-k report?"

In [27]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"../../data/financebench/COCACOLA_2021_10K.pdf"}
)

In [28]:
for d in docs:
    print(d.metadata)

{'page': 3, 'source': '../../data/financebench/COCACOLA_2021_10K.pdf', 'start_index': 1399}
{'page': 127, 'source': '../../data/financebench/COCACOLA_2021_10K.pdf', 'start_index': 0}
{'page': 137, 'source': '../../data/financebench/COCACOLA_2021_10K.pdf', 'start_index': 1372}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [5]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [12]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The file the chunk is from, should be of '../../data/financebench/NETFLIX_2017_10K.pdf'",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [13]:
document_content_description = "10-K fillings"
llm = AzureChatOpenAI(model_name="gtp35turbo-latest")
retriever = SelfQueryRetriever.from_llm(
    llm,     
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [14]:
question = "what did they say about inflation in the the report?"

In [15]:
docs = retriever.get_relevant_documents(question)

In [16]:
for d in docs:
    print(d.metadata)

{'page': 10, 'source': '../../data/financebench/NETFLIX_2017_10K.pdf', 'start_index': 4292}
{'page': 23, 'source': '../../data/financebench/NETFLIX_2017_10K.pdf', 'start_index': 0}
{'page': 25, 'source': '../../data/financebench/NETFLIX_2017_10K.pdf', 'start_index': 2868}
{'page': 6, 'source': '../../data/financebench/NETFLIX_2017_10K.pdf', 'start_index': 4313}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [17]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [18]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [19]:
compressor = LLMChainExtractor.from_llm(llm)

In [20]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [21]:
question = "what did they say about inflation?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

In 2022, as a result of the inflationary environment in the U.S., we experienced increases in our direct costs... We expect the inflationary environment and these other pressures to continue into 2023.
----------------------------------------------------------------------------------------------------
Document 2:

the costs of raw materials, packaging materials, labor, energy, fuel, transportation and other inputs necessary for the production and distribution of our products have rapidly increased. We expect the inflationary pressures on input and other costs to continue to impact our business in 2022. Our attempts to offset these cost pressures, such as through price increases of some of our products, may not be successful. Higher product prices may result in reductions in
----------------------------------------------------------------------------------------------------
Document 3:

We experienced higher than anticipated commodity, packaging, and transportation costs du

## Combining various techniques

In [22]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [23]:
question = "what did they say about inflation?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

In 2022, as a result of the inflationary environment in the U.S., we experienced increases in our direct costs, including electricity and other energy-related costs for our network operations, and transportation and labor costs, as well as increased interest expenses related to rising interest rates. We believe that this inflationary environment and the resulting decline in real wages in the U.S. are altering consumer preferences and causing consumers to become more price conscious. These factors, along with impacts of the intense competition in our industries, resulted in increased costs and lower earnings per share during 2022, and caused us to lower our growth expectations and related financial guidance. We expect the inflationary environment and these other pressures to continue into 2023.
----------------------------------------------------------------------------------------------------
Document 2:

The consequences of these developments cannot be entirely predicted 