
Certification Challenge - RAG for Effective Information Retrieval

Task 1 - 
Business Use Case

Problem Definitin :

 10-K reports from different industry players contain valuable financial information,risk factors, market trends and strategic initiatives. Extracting all this information from reports is manual and labor intensive. Going through the document to find the exact information needed takes a long time.

User Base: 
Many fintecs specialize in providing advanced analytics and insights for investment management and financial planning. The company handles an extensive collection of 10-K reports from various industry players, which contain detailed information about financial performance, risk factors, market trends, and strategic initiatives. Despite the richness of these documents, financial analysts struggle with extracting actionable insights efficiently in a short span due to the manual and labor-intensive nature of the analysis. Going through the document to find the exact information needed at the moment takes too long. This bottleneck hampers the company's ability to deliver timely and accurate recommendations to its clients.  

Task 2 -

Proposed Solution :

The objective is to develop an advanced RAG-based system to streamline the extraction and analysis of key information from 10-K reports.

The project will involve testing the RAG system on a current business problem. The Financial analysts are asked to research major cloud and AI platforms such as Amazon AWS, Google Cloud, Microsoft Azure, Meta AI, and IBM Watson to determine the most effective platform for this application. The primary goals include improving the efficiency of data extraction. Once the project is deployed, the system will be tested by a financial analyst with the following questions. Accurate text retrieval for these questions will imply the project's success.

Questions:
1.Has the company made any significant acquisitions in the AI space, and how are these acquisitions being integrated into the company's strategy?
2.How much capital has been allocated towards AI research and development?
3.What initiatives has the company implemented to address ethical concerns surrounding AI, such as fairness, accountability, and privacy?
4.How does the company plan to differentiate itself in the AI space relative to competitors?
5.Each Question must be asked for each of the five companies.
By successfully developing this project, we aim to:
Improve the productivity of financial analysts by providing a competent tool.
Provide timely insights to improve client recommendations.


### Setup

In [1]:
!pip install -q openai==1.55.3 \
                tiktoken==0.6.0 \
                pypdf==4.0.1 \
                langchain==0.1.20 \
                langchain-community==0.0.38 \
                chromadb==0.4.22 \
                sentence-transformers==2.3.1

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.6/389.6 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m71.5 MB/s[0m eta [36m0:00:00

## Set Environment Variables

In [224]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [225]:
# Import the necessary Libraries
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_openai.embeddings import OpenAIEmbeddings



from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings
)
from langchain_community.vectorstores import Chroma


## Impementing RAG

## Prepare Data

Let's start by loading the dataset.

In [8]:
#Upload Dataset-10k.zip and unzip it dataset folder using -d option
!unzip Dataset-10k.zip -d dataset

Archive:  Dataset-10k.zip
replace dataset/IBM-10-k-2023.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


### Chunking

In [226]:
# Provide pdf_folder_location
pdf_folder_location = "dataset"

In [227]:
# Load the directory to pdf_loader
pdf_loader = PyPDFDirectoryLoader(pdf_folder_location)

Let's split the contents of the pdf into chunks of size 512 (as this is the max size allowed by the embedding model we have choosen. Leet's also have some overlap between the chunks. 16 token should give us 2 sentences of overlap.

In [228]:
# Create text_splitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap=16
)

In [229]:
# Create chunks
report_chunks = pdf_loader.load_and_split(text_splitter)

In [230]:
# Check the total number of chunks
len(report_chunks)

908

In [231]:
# Check the first object in report_chunks and print it
report_chunks[500]

Document(metadata={'source': 'dataset/IBM-10-k-2023.pdf', 'page': 17, 'page_label': '18'}, page_content='the event of the death of a participant or in the event a participant is deemed by the company to be disabled and eligible for \nbenefits under the terms of the IBM Long-Term Disability Plan (or any successor plan or similar plan of another \nemployer), the participant’s estate, beneficiaries or representative, as the case may be, shall have the rights and duties of the \nparticipant under the applicable award agreement. In addition, unless the award agreement specifies otherwise, the \nCommittee may cancel, rescind, suspend, withhold or otherwise limit or restrict any unexpired, unpaid, or deferred award \nat any time if the participant is not in compliance with all applicable provisions of the awards agreement and the 2001 Plan. \nIn addition, awards may be cancelled if the participant engages in any conduct or act determined to be injurious, \ndetrimental or prejudicial to any in

In [232]:
print(report_chunks[0].metadata)


{'source': 'dataset/msft-10-k-2023.pdf', 'page': 1, 'page_label': '2'}


In [233]:
sources = set(chunk.metadata.get("source") for chunk in report_chunks)
print(sources)


{'dataset/IBM-10-k-2023.pdf', 'dataset/google-10-k-2023.pdf', 'dataset/Meta-10-k-2023.pdf', 'dataset/aws-10-k-2023.pdf', 'dataset/msft-10-k-2023.pdf'}


# Embeddings

In [234]:
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

In [235]:
embedding_dim = 1536

### Database Creation

In [236]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

In [237]:
client.create_collection(
    collection_name="reports_collection",
    vectors_config=VectorParams(size=embedding_dim, distance=Distance.COSINE),
)

True

In [249]:
collection_name = 'reports_collection'

In [250]:
# Create the vector Database
vector_store = QdrantVectorStore(
    client=client,
    collection_name="reports_collection",
    embedding=embedding_model,
)

In [155]:
texts = [doc.page_content for doc in report_chunks]

_ = vector_store.add_texts(texts=texts)


In [251]:
vector_store.add_texts(
    texts=[chunk.page_content for chunk in report_chunks],
    metadatas=[chunk.metadata for chunk in report_chunks]
)


['5d57c98b1bd64e14aee6bce54dc08c44',
 '2b393e450b754e98a2e398e17d8625c4',
 'bf4279197df44da1840864623cfb6240',
 'f6cdd94ed8a04f2a832bef36850f10cf',
 '5c817bcdc556428fb0e9f6c8b2f38e0a',
 '9daf298a4c2d4039ad55b327ff800a84',
 '6d3890e6f75542d7b6638e2535820003',
 '2d5ffc4496ae447a94ab1dee0db4f472',
 '8a2454e468cc45d4baf6ba8aa1e73438',
 '596f063e143143bb80b8ab1c8b8bce6a',
 '0841ab89d9304d6ab6ec1050f81d6b93',
 '3e9f227e64864941864867c7ad878ce6',
 '9e64cd8bd3ad488da41af9d82aa34bef',
 '0b2cf519ec414876a96f4aece1ba5083',
 'b16e31223e974d3d97899d29c8fcfa96',
 '8591f765bb1a442eaf6561ed3ccaa35c',
 '5f9242b6a6bb49679fc6cda56a314b97',
 'e65cbbb472c4413bad4ae2ec20ef96fc',
 '5e35c70d5b774280b9e1655e1867aaec',
 '0a024ae69e1941f9a7cef499b4d5c032',
 'edaf231e4c744d9e922a794ffc859f02',
 'adaa40512b284d9e9488d845e06a1037',
 'c1ccae31e7fc42d28b6c4fcee3fbe376',
 '650e18438aed4e3d8d872a9e486f37d5',
 'fffb766442784402a5ac4e3d8c71441c',
 'eb0334e27e2c46039e34801db30a1e54',
 '09924fa61c9549b88db211fe6add3f0c',
 

In [252]:
for i in range(9):
    print(report_chunks[i].metadata)


{'source': 'dataset/msft-10-k-2023.pdf', 'page': 1, 'page_label': '2'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 1, 'page_label': '2'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 1, 'page_label': '2'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 2, 'page_label': '3'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 2, 'page_label': '3'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 3, 'page_label': '4'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 3, 'page_label': '4'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 4, 'page_label': '5'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 4, 'page_label': '5'}


In [287]:
user_question = "Who are the executives"

In [254]:
docs = vector_store.similarity_search(user_question, k=3)
for doc in docs:
    print(doc.metadata)
print(docs)


{'source': 'dataset/msft-10-k-2023.pdf', 'page': 11, 'page_label': '12', '_id': 'fd94745efd6e42fea9461fa586bae6b0', '_collection_name': 'reports_collection'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 11, 'page_label': '12', '_id': '650e18438aed4e3d8d872a9e486f37d5', '_collection_name': 'reports_collection'}
{'source': 'dataset/msft-10-k-2023.pdf', 'page': 2, 'page_label': '3', '_id': '9bf2b6fe01884d1e8afeb225991ae544', '_collection_name': 'reports_collection'}
[Document(metadata={'source': 'dataset/msft-10-k-2023.pdf', 'page': 11, 'page_label': '12', '_id': 'fd94745efd6e42fea9461fa586bae6b0', '_collection_name': 'reports_collection'}, page_content='lower cost per unit than smaller ones; datacenters that coordinate and aggregate diverse c ustomer, geographic, and \napplication demand patterns, improving the utilization of computing, storage, and network resources; and multi -tenancy \nlocations that lower application maintenance labor costs.  \nThe Microsoft Cloud provides the be

In [257]:
from qdrant_client.models import Filter, FieldCondition, MatchValue

# Define the correct filter
qfilter = Filter(
    must=[
        FieldCondition(
            key="source",
            match=MatchValue(value="dataset/msft-10-k-2023.pdf")
        )
    ]
)

# Perform similarity search with the correct filter
docs = vector_store.similarity_search(
    query=user_question,
    k=3,
    filter=qfilter
)

# Print result to confirm
for doc in docs:
    print(doc.page_content[:100])  # preview content
    print(doc.metadata)


In [298]:
from qdrant_client.models import FieldCondition, MatchValue
from qdrant_client import models as qdrant_models  # Import the entire models submodule

print(f"Type of Filter: {type(qdrant_models.Filter)}")

qdrant_filter = qdrant_models.Filter(
    must=[
        qdrant_models.FieldCondition(
            key="source",
            match=qdrant_models.MatchValue(value="dataset/msft-10-k-2023.pdf".lower())
        )
    ]
)

docs = vector_store.similarity_search(user_question, k=3, filter=qdrant_filter)

Type of Filter: <class 'pydantic._internal._model_construction.ModelMetaclass'>


In [299]:
docs = vector_store.similarity_search(user_question, k=3)

In [296]:
print(docs)

[]


In [271]:
results = vector_store.similarity_search(user_question, k=3)


In [273]:
# Perform similarity search on the user_question
# You must add an extra parameter to the similarity search  function so that you can filter the response based on the 'source'  in the metadata of the doc
# The filter can be added as a parameter to the similarity search function
# This will allow you to retrieve chunks from a particular document
# Use the same format to filter your response based on the company.
results = vector_store.similarity_search(user_question,k=3,filter = filter)

In [274]:
print(results)

[]


## RAG Q&A

### Prompt Design

In [300]:
# Create a system message for the LLM
qna_system_message = """
You are an assistant to a Financial Analyst. Your task is to summarize and provide relevant information to the financial analyst's question based on the provided context.

User input will include the necessary context for you to answer their questions. This context will begin with the token: ###Context.
The context contains references to specific portions of documents relevant to the user's query, along with page number from the report.
The source for the context will begin with the token ###Page

When crafting your response:
1. Select only context relevant to answer the question.
2. Include the source links in your response.
3. User questions will begin with the token: ###Question.
4. If the question is irrelevant or if the context is empty - "Sorry, this is out of my knowledge base"

Please adhere to the following guidelines:
- Your response should only be about the question asked and nothing else.
- Answer only using the context provided.
- Do not mention anything about the context in your final answer."
- If the answer is not found in the context, it is very very important for you to respond with "Sorry, this is out of my knowledge base"
- If NO CONTEXT is provided, it is very important for you to respond with "Sorry, this is out of my knowledge base"

Here is an example of how to structure your response:

Answer:
[Answer]

Page:
[Page number]
"""

In [301]:
# Create a message template
qna_user_message_template = """
###Context
Here are some documents and their page number that are relevant to the question mentioned below.
{context}

###Question
{question}
"""

### Composing the response

In [277]:
# Create a variable company to store the source of the context so that you can filter the similarity search
company = "dataset/google-10-k-2023.pdf" # We shall change this programmatically later when we test on multiple queries for each of the company

In [302]:
# Print the retrieved docs, their source and the page number
# (page number can be accessed using doc.metadata['page'] )
for i, doc in enumerate(docs):
    print(f"Retrieved chunk {i+1}: \n")
    print(doc)
    print(doc.page_content.replace('\t', ' '))
    print("Source: ", doc.metadata['source'],"\n ")
    print("Page Number: ",doc.metadata['page'],"\n===================================================== \n")
    print('\n')

Retrieved chunk 1: 

page_content='82 DIRECTORS AND EXECUTIVE OFFICERS OF MICROSOFT CORPORATION  
  
DIRECTORS  
  
Satya Nadella  
Chairman and Chief Executive Officer,  
Microsoft Corporation   Sandra E. Peterson 2,3 
Lead Independent Director,  
Microsoft Corporation  John W. Stanton 1,4 
Founder and Chairman, Trilogy 
Partnerships  
      
Reid G. Hoffman 4 
Partner, Greylock Partners  Penny S. Pritzker 4 
Founder and Chairman, PSP Partners, 
LLC John W. Thompson 3,4 
Partner, Lightspeed Venture Partners  
      
Hugh F. Johnston 1 
Vice Chairman and Executive  Vice 
President and Chief Financial Officer, 
PepsiCo, Inc.  Carlos A. Rodriguez 1,2 
Executive Chair, ADP, Inc.  Emma N. Walmsley 2,4 
Chief Executive Officer, GSK plc  
      
Teri L. List 1,3 
Former Executive Vice President and 
Chief Financial Officer, The Ga p, Inc.  Charles W. Scharf 2,3 
Chief Executive Officer and President, 
Wells Fargo  & Company  Padmasree Warrior 2 
Founder, President and Chief Executive 
Office

In [303]:
# Create context for query by joining page_content and page number of the retrieved docs
relevant_document_chunks = results = vector_store.similarity_search(user_question, k=3)

context_list = [d.page_content + "\n ###Page: " + str(d.metadata['page']) + "\n\n " for d in relevant_document_chunks]

context_for_query = ". ".join(context_list)

print(context_for_query) # Print the whole context_for_query (after joining all the chunks. It should contain page number of every chunk)


82 DIRECTORS AND EXECUTIVE OFFICERS OF MICROSOFT CORPORATION  
  
DIRECTORS  
  
Satya Nadella  
Chairman and Chief Executive Officer,  
Microsoft Corporation   Sandra E. Peterson 2,3 
Lead Independent Director,  
Microsoft Corporation  John W. Stanton 1,4 
Founder and Chairman, Trilogy 
Partnerships  
      
Reid G. Hoffman 4 
Partner, Greylock Partners  Penny S. Pritzker 4 
Founder and Chairman, PSP Partners, 
LLC John W. Thompson 3,4 
Partner, Lightspeed Venture Partners  
      
Hugh F. Johnston 1 
Vice Chairman and Executive  Vice 
President and Chief Financial Officer, 
PepsiCo, Inc.  Carlos A. Rodriguez 1,2 
Executive Chair, ADP, Inc.  Emma N. Walmsley 2,4 
Chief Executive Officer, GSK plc  
      
Teri L. List 1,3 
Former Executive Vice President and 
Chief Financial Officer, The Ga p, Inc.  Charles W. Scharf 2,3 
Chief Executive Officer and President, 
Wells Fargo  & Company  Padmasree Warrior 2 
Founder, President and Chief Executive 
Officer, Fable Group, Inc.  
Board Commit

In [304]:
# Perform similarity search
relevant_document_chunks = vector_store.similarity_search(user_question, k=3)

# Create context list with page content and page number
context_list = [
    f"{doc.page_content}\n### Page: {doc.metadata.get('page', 'N/A')}\n"
    for doc in relevant_document_chunks
]

# Join all chunks into one context string
context_for_query = "\n".join(context_list)

# Print the full context
print(context_for_query)


82 DIRECTORS AND EXECUTIVE OFFICERS OF MICROSOFT CORPORATION  
  
DIRECTORS  
  
Satya Nadella  
Chairman and Chief Executive Officer,  
Microsoft Corporation   Sandra E. Peterson 2,3 
Lead Independent Director,  
Microsoft Corporation  John W. Stanton 1,4 
Founder and Chairman, Trilogy 
Partnerships  
      
Reid G. Hoffman 4 
Partner, Greylock Partners  Penny S. Pritzker 4 
Founder and Chairman, PSP Partners, 
LLC John W. Thompson 3,4 
Partner, Lightspeed Venture Partners  
      
Hugh F. Johnston 1 
Vice Chairman and Executive  Vice 
President and Chief Financial Officer, 
PepsiCo, Inc.  Carlos A. Rodriguez 1,2 
Executive Chair, ADP, Inc.  Emma N. Walmsley 2,4 
Chief Executive Officer, GSK plc  
      
Teri L. List 1,3 
Former Executive Vice President and 
Chief Financial Officer, The Ga p, Inc.  Charles W. Scharf 2,3 
Chief Executive Officer and President, 
Wells Fargo  & Company  Padmasree Warrior 2 
Founder, President and Chief Executive 
Officer, Fable Group, Inc.  
Board Commit

In [305]:
# Craft the messages to pass to chat.completions.create
prompt = [
    {'role':'system', 'content': qna_system_message},
    {'role': 'user', 'content': qna_user_message_template.format(
         context=context_for_query,
         question=user_question
        )
    }
]

In [306]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0.0, model="gpt-3.5-turbo")

response = llm.invoke(context_for_query)
print(response.content)


Signature Title Date
/S/    S UNDAR  PICHAIChief Executive Officer and Director (Principal 
Executive Officer) January 30, 2024
Sundar Pichai
/S/    R UTH M. P ORAT        President and Chief Investment Officer; Chief 
Financial Officer (Principal Financial Officer) January 30, 2024
Ruth M. Porat
/S/    A MIE THUENER  O'T OOLE        Vice President, Corporate Controller and 
Principal Accounting Officer January 30, 2024
Amie Thuener O'Toole
/S/    F RANCES H. A RNOLD        Director January 30, 2024
Frances H. Arnold
/S/    S ERGEY  BRIN        Co-Founder and Director January 30, 2024
Sergey Brin
/S/   R. M ARTIN  CHAVEZ       Director January 30, 2024
R. Martin Chávez
/S/    L. J OHN DOERR        Director January 30, 2024
L. John Doerr
/S/    R OGER  W. F ERGUSON JR.       Director January 30, 2024
Roger W. Ferguson Jr.
/S/    J OHN L. H ENNESSY         Director, Chair January 30, 2024
John L. Hennessy
/S/    L ARRY  PAGE        Co-Founder and Director January 30, 2024
Larry Page
/S/ 