# Qdrant Playground 

This playground connect to Qdrant Vector Data Base build wiht 5 documents:
- [ULTABEAUTY_2023Q4_EARNINGS.pdf](../../data/financebench/ULTABEAUTY_2023Q4_EARNINGS.pdf)   
- [COCACOLA_2022_10K.pdf](../../data/financebench/COCACOLA_2022_10K.pdf)  
- [GENERALMILLS_2022_10K.pdf](../../data/financebench/GENERALMILLS_2022_10K.pdf)   
- [JPMORGAN_2022_10K.pdf](../../data/financebench/JPMORGAN_2022_10K.pdf)  
- [AMCOR_2022_8K_dated-2022-07-01.pdf]((../../data/financebench/AMCOR_2022_8K_dated-2022-07-01.pdf))  


In [1]:
import pandas as pd
import os
import requests
import sys
import time
from datasets import load_dataset, DatasetDict
from dotenv import load_dotenv
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader, DirectoryLoader, UnstructuredURLLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import CTransformers
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import Qdrant


# Load OpenAI access and other custom paths
sys.path.append(os.path.abspath('../../src'))
from azure_openai_conn import OpenAIembeddings, qdrant_load_by_chunks

# Load environment variables
load_dotenv()

if os.path.isfile('../../data/financebench_sample_150.csv'):
    df = pd.read_csv('../../data/financebench_sample_150.csv')
else:    
    ds = load_dataset("PatronusAI/financebench")
    df = pd.DataFrame(ds)
    all_dicts = []
    for index, row in df.iterrows():    
        dictionary = row['train']    
        all_dicts.append(dictionary)
    df = pd.DataFrame(all_dicts)


destination_folder = '../../data/financebench'

if not os.path.exists(destination_folder):

    os.makedirs(destination_folder)

    for index, row in df.iterrows():
        url = row['doc_link']
        doc_name = row['doc_name']
        doc_name_with_extension = doc_name + '.pdf'        
        file_path = os.path.join(destination_folder, doc_name_with_extension)
        response = requests.get(url)
        if response.status_code == 200:            
            with open(file_path, 'wb') as file:
                file.write(response.content)
            print(f"Downloaded: {doc_name_with_extension}")
        else:
            print(f"Failed to download: {doc_name_with_extension} ({url})")


pdf_folder_path = destination_folder
documents = []
for file in os.listdir(pdf_folder_path)[:5]:
    print(file)
    if file.endswith('.pdf'):
        pdf_path = os.path.join(pdf_folder_path, file)
        loader = PyPDFLoader(pdf_path)
        documents.extend(loader.load())


# Load Embeddings: Many Problems for Exceed call rate
embeddings = OpenAIembeddings()
# Spliter
# todo: smarter spliter
# https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/scripts/prepdocslib/textsplitter.py
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100, add_start_index=True)
# Generate Splits
chunked_documents = text_splitter.split_documents(documents)
# Chunk size and overlap
chunk_size=1500
overlap=100
# Initialize Qdrant database
vectordb = Qdrant.from_documents(
    documents=chunked_documents,
    embedding=embeddings, 
    location=":memory:", 
    collection_name="financebench")

COCACOLA_2021_10K.pdf
PFIZER_2021_10K.pdf
VERIZON_2022_10K.pdf
PEPSICO_2021_10K.pdf
NETFLIX_2017_10K.pdf


In [2]:
query = "What is the Coca Cola Balance Sheet?"
docs = vectordb.similarity_search(query)
print(docs[0].page_content)

THE COCA-COLA COMPANY AND SUBSIDIARIES
CONSOLIDATED BALANCE SHEETS
(In millions except par value)
December 31, 2021 2020
ASSETS
Current Assets   
Cash and cash equivalents $ 9,684 $ 6,795 
Short-term investments 1,242 1,771 
Total Cash, Cash Equivalents and Short-Term Investments 10,926 8,566 
Marketable securities 1,699 2,348 
Trade accounts receivable, less allowances of $516 and $526, respectively 3,512 3,144 
Inventories 3,414 3,266 
Prepaid expenses and other current assets 2,994 1,916 
Total Current Assets 22,545 19,240 
Equity method investments 17,598 19,273 
Other investments 818 812 
Other noncurrent assets 6,731 6,184 
Deferred income tax assets 2,129 2,460 
Property, plant and equipment — net 9,920 10,777 
Trademarks with indefinite lives 14,465 10,395 
Goodwill 19,363 17,506 
Other intangible assets 785 649 
Total Assets $ 94,354 $ 87,296 
LIABILITIES AND EQUITY
Current Liabilities   
Accounts payable and accrued expenses $ 14,619 $ 11,145 
Loans and notes payable 3,307 2,

In [3]:
print(docs[0].metadata)

{'source': '../../data/financebench/COCACOLA_2021_10K.pdf', 'page': 63, 'start_index': 0}


### Addressing Diversity: Maximum marginal relevance


How to enforce diversity in the search results and avoid repetition.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [4]:
question = "what these companies say about market conditions?"
docs_ss = vectordb.similarity_search(question,k=3)

In [5]:
docs_ss[0].page_content

'markets, or achieve the return on capital we expect from our investments in these markets.\nChanges in economic conditions can adversely impact our business.\nMany of the jurisdictions in which our products are sold have experienced and could continue to experience uncertain or\nunfavorable economic conditions, such as recessions or economic slowdowns,\n16'

In [6]:
docs_ss[1].page_content

'Table of Contents\nbe unwilli ng or unable to increase our product prices or unable to effective ly hedge against price increases to offset these\nincreased costs without suf fering reduced volume, revenue, mar gins and operating results.\nPolitical and social conditions can adversely af fect our business.\nPolitical and social conditions in the markets in which our products are sold have been and could continue to be difficult to\npredict, resulting in adverse effects on our business. The results of elections, referendums or other political conditions (including\ngovernment shutdow ns or hostilities between countries) in these markets have in the past and could continue to impact how\nexisting laws, regulations and government programs or policies are implemen ted or result in uncertainty as to how such laws,\nregulations, program s or policies may change, including with respect to tariffs, sanctions, environmental and climate change\nregulations, taxes, benefit programs, the movement

In [7]:
docs_ss[2].page_content

'enhancements to our networks. \nAs we introduce new offerings and technologies, such as 5G technology, we must phase out outdated and unprofitable \ntechnologies and services. If we are unable to do so on a cost-effective basis, we could experience reduced profits. In addition, \nthere could be legal or regulatory restraints on our ability to phase out current services. \nAdverse conditions in the U.S. and international economies could impact our results of operations and \nfinancial condition. \nUnfavorable economic conditions, such as a recession or economic slowdown in the U.S. or elsewhere, or inflation in the \nmarkets in which we operate, could negatively affect the affordability of and demand for some of our products and services and \nour cost of doing business. In difficult economic conditions, consumers may seek to reduce discretionary spending by forgoing \npurchases of our products, electing to use fewer higher margin services, dropping down in price plans or obtaining low

Note the difference in results with `MMR`.

In [8]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [9]:
docs_mmr[0].page_content

'markets, or achieve the return on capital we expect from our investments in these markets.\nChanges in economic conditions can adversely impact our business.\nMany of the jurisdictions in which our products are sold have experienced and could continue to experience uncertain or\nunfavorable economic conditions, such as recessions or economic slowdowns,\n16'

In [10]:
docs_mmr[1].page_content

'the U.S. are altering consumer preferences, and causing certain consumers to become more price conscious. We expect the \ninflationary environment and the resulting pressures to continue in 2023. For a discussion of the risks relating to unfavorable \neconomic conditions and inflation to our business, refer to Item 1A Risk Factors. \n2023 Connection Trends \nIn our Consumer segment, we expect to continue to attract new customers and maintain high-quality retail postpaid customers, \ncapitalizing on demand for data services and providing our customers new ways of using wireless services in their daily lives. We \nexpect that future connection growth will be driven by FWA, as well as smartphones, tablets and other connected devices such \nas wearables. We believe the combination of our wireless network performance and service offerings provides a superior \nVerizon 2022 Annual Report on Form 10-K                                36'

### Addressing Specificity: working with metadata

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [12]:
question = "what did they say about reveneu in the the Netflix 10-k report?"

In [13]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"../../data/financebench/NETFLIX_2017_10K.pdf"}
)

In [14]:
for d in docs:
    print(d.metadata)

{'source': '../../data/financebench/NETFLIX_2017_10K.pdf', 'page': 69, 'start_index': 1391}
{'source': '../../data/financebench/NETFLIX_2017_10K.pdf', 'page': 2, 'start_index': 1362}
{'source': '../../data/financebench/NETFLIX_2017_10K.pdf', 'page': 40, 'start_index': 0}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [15]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [40]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The file the chunk is from, should be of 'PFIZER_2021_10K.pdf.pdf'",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [42]:
from langchain.chat_models import AzureChatOpenAI
document_content_description = "10-K fillings"
llm = AzureChatOpenAI(model_name="gtp35turbo-latest")
retriever = SelfQueryRetriever.from_llm(
    llm,     
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [43]:
question = "what did they say about revenue in the the report?"

In [44]:
docs = retriever.get_relevant_documents(question)

In [45]:
docs[0].metadata

{'source': '../../data/financebench/VERIZON_2022_10K.pdf',
 'page': 66,
 'start_index': 1325}

In [47]:
for d in docs:
    print(d.metadata)

{'source': '../../data/financebench/VERIZON_2022_10K.pdf', 'page': 66, 'start_index': 1325}
{'source': '../../data/financebench/VERIZON_2022_10K.pdf', 'page': 60, 'start_index': 5001}
{'source': '../../data/financebench/VERIZON_2022_10K.pdf', 'page': 58, 'start_index': 4635}
{'source': '../../data/financebench/PFIZER_2021_10K.pdf', 'page': 39, 'start_index': 2644}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [33]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [34]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [35]:
compressor = LLMChainExtractor.from_llm(llm)

In [36]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [37]:
question = "what did they say about inflation?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

In 2022, as a result of the inflationary environment in the U.S., we experienced increases in our direct costs, including electricity and other energy-related costs for our network operations, and transportation and labor costs, as well as increased interest expenses related to rising interest rates. We believe that this inflationary environment and the resulting decline in real wages in the U.S. are altering consumer preferences and causing consumers to become more price conscious. These factors, along with impacts of the intense competition in our industries, resulted in increased costs and lower earnings per share during 2022, and caused us to lower our growth expectations and related financial guidance. We expect the inflationary environment and these other pressures to continue into 2023.
----------------------------------------------------------------------------------------------------
Document 2:

The costs of raw materials, packaging materials, labor, energy, fuel

## Combining various techniques

In [38]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [39]:
question = "what did they say about inflation?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

In 2022, as a result of the inflationary environment in the U.S., we experienced increases in our direct costs, including electricity and other energy-related costs for our network operations, and transportation and labor costs, as well as increased interest expenses related to rising interest rates. We believe that this inflationary environment and the resulting decline in real wages in the U.S. are altering consumer preferences and causing consumers to become more price conscious. These factors, along with impacts of the intense competition in our industries, resulted in increased costs and lower earnings per share during 2022, and caused us to lower our growth expectations and related financial guidance. We expect the inflationary environment and these other pressures to continue into 2023.
----------------------------------------------------------------------------------------------------
Document 2:

The following levels of inflation protection may be provided to any 