# Investment Analyst Assistant Retreival Notebook

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

Copyright 2024 Amazon Web Services, Inc.

##### This notebook allows you to generate metadata forom the question. This metadata will be used to retreived specific chunks from the Opensearch Index which is then sent to the LLM to generate answer for the specific question. This notebook is configured to use "OpenAI"GPT models.
#### In case you do not have OpenAI API Key/ Access; You can use mistral AI available on Bedrock

### STEP 0:  Reset And Install missing Packages

NOTE: Warnings and in some case, version errors can be ignored for package installation. Those are due to version updates. Only change versions if necessary.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
#!python -m pip install --upgrade pip --quiet
!pip install requests_toolbelt --quiet
# !pip install transformers --quiet  #install only if necessary
# !pip install llama_index --quiet   #install only if necessary
!pip install requests_aws4auth --quiet
!pip install openai --quiet
!pip install --upgrade openai --quiet
!pip install "pydantic>=1.8.2,<1.10.0" --quiet

!pip install langchain==0.1.9 --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-scheduler 2.7.1 requires pydantic<3,>=1.10, but you have pydantic 1.9.2 which is incompatible.
langchain 0.1.9 requires langchain-core<0.2,>=0.1.26, but you have langchain-core 0.2.41 which is incompatible.
langchain-text-splitters 0.3.0 requires langchain-core<0.4.0,>=0.3.0, but you have langchain-core 0.2.41 which is incompatible.
ragas 0.0.0 requires boto3==1.33.9, but you have boto3 1.34.162 which is incompatible.
ragas 0.0.0 requires botocore==1.33.9, but you have botocore 1.34.162 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-text-splitters 0.3.0 requires langchain-core<0.4.0,>=0.3.0, but you have langchain-core 0.1.52

#### Adding Project Directory to Path

In [3]:
import sys
import os
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))  # Adjust this path as needed
sys.path.append(project_root)

### STEP 1: Import Necessary Modules

In [4]:
from libraries.iaa.experiments.utils_exp import get_titan_text_embedding, results_fusion
from libraries.iaa.query_transformation.ExtractQueryMetadata import ExtractQueryMetadata
from libraries.iaa.reranker.SearchRanker import SearchRanker
from libraries.iaa.retrieval.OpenSearchRetrieval import OpenSearchRetrieval
import time
from dataclasses import dataclass, field
from libraries.iaa.query_transformation.QueryMetaExtractor import QueryMetaExtractor
import json
import boto3
import re
from bs4 import BeautifulSoup
from typing import List
from openai import OpenAI, AzureOpenAI
import os
import requests
import csv
import libraries.iaa.configs as configs

### STEP 2: Notebook Configuration

#### Users Please choose the LLM from the options provided.
#### If users are looking to experiment with GPT models, Please add you API key in the field indicated.
#### If you DO NOT have OpenAI Key, Please use 'mistral.mistral-large-2402-v1:0' for 'us-west-2'(Default for this workshop)

In [5]:
exp_config = {
    'type_rets': 'fusion', #fusion (text and embb) is selected for this workshop because it generated the best results
    'opensearch_host': configs.OPEN_SEARCH_HOST, # copy the url created in the index creation notebook
    'generation_llm_model': 'mistral.mistral-large-2402-v1:0', #mistral.mistral-small-2402-v1:0  or gpt-4-turbo or gpt-3.5-turbo or gpt-XX-turbo or mistral.mistral-large-2402-v1:0 (for us-west-2)
    #'OPENAI_API_KEY': '----***----', #Use your OpenAI API Key
    #'AZURE_OPENAI_API_KEY': '----***----', #Use your Azure OpenAI API Key
    'index_name_embb': 'expt_index', #name of the index created in the index creation notebook
    'index_name_text': 'expt_index', #name of the index created in the index creation notebook
    'top_k_ranking': 10,
    'top_k_retrieval': 20,
    'emb_name': 'vector_field',
    'region': configs.REGION
            }


#### Azure OpenAI Configuration

Configuration if choose to use Azure OpenAI over OpenAI. Default is set to use OpenAI to generate answers.

In [6]:
invoke_azure_openai = False

if invoke_azure_openai == True:
    # Download the certificate
    !curl -o ca-bundle-full.crt https://link-to-certificate
    # Set the SSL certificate file environment variable
    os.environ['SSL_CERT_FILE'] = 'ca-bundle-full.crt' # Add path to the certificate file

#### Input Question

In [7]:
question = "What was the revenue for 3M in 2022?"

### STEP 3 : Rephrase And Answer Generation

#### Prompt to Extract Metadata from the Quesion

In [8]:
PROMPT_METADATA_GENERATION_time = """
You are a financial editor responsible for rephrasing user questions accurately for better search and retrieval tasks related to yearly and quarterly financial reports. The current year is {most_recent_year}, and the current quarter is {most_recent_quarter}.
Task: Given a user question:{query}, identify the following metadata as per the instructions below:
1. time_keyword_type: Identifies what type of time range the user is requesting - a range of years, a range of quarters, specific years, specific quarters, or none.
2. time_keywords: If time_keyword_type is "range of periods," these keywords expand the year or quarter period. Otherwise, it will be the formatted version of the year in YYYY format or the quarter in Q'YY format.
Instructions:
1. Identify whether the user is asking for a date range or specific set of years or quarters. If there is no year or quarter mentioned, leave time_keyword blank.
2. If the user is requesting specific years, return the year(s) in YYYY format.
3. If the user is requesting specific quarters, return the quarter(s) in Q'YY format (e.g., Q2'24, Q1'23).
4. If the user is requesting documents within a specific time range between two periods, fill in the year or quarter information between the time ranges.
5. If the user is requesting the last N years, count backward from the current year, {most_recent_year}.
6. If the user is requesting the last N quarters, count backward from the current quarter and year, {most_recent_quarter}.
Examples:
what was Google's net profit?
time_keyword_type: none
time_keywords: none
explanation: no quarter or year mentioned
What was Amazon's total sales in 2022?
time_keyword_type: specific_year
time_keywords: 2022
What was Apple's revenue in 2019 compared to 2018?
time_keyword_type: specific_year
time_keywords: 2018, 2019
explanation: the user is requesting to compare 2 different years
Which of Disney's business segments had the highest growth in sales in Q4 F2023?
time_keyword_type: specific_quarter
time_keywords: Q4 2023
How did Netflix's quarterly spending on research change as a percentage of quarterly revenue change between Q2 2019 and Q4 2019?
time_keyword_type: range_quarter
time_keywords: Q2 2019, Q3 2019, Q4 2019
explanation: the quarters between Q2 2019 and Q4 2019 are Q2 2019, Q3 2019 and Q4 2019
What was Spotify's growth in the last 5 quarters?
time_keyword_type: range_quarter
time_keywords: Q4 2023, Q3 2023, Q2 2023, Q1 2023, Q4 2024
explanation: Since the current quarter is Q1 2024, the last 5 quarters are Q4 2023, Q3 2023, Q2 2023, Q1 2023, and Q4 2024.
In their 10-K filings, has Norwegian Cruise mentioned any negative environmental or weather-related impacts to their business in the last four years?
time_keyword_type: range_year
time_keywords: 2020, 2021, 2022, 2023
explanation: Since the current year is {most_recent_year}, the last four years are 2020, 2021, 2022, and 2023.
Return a JSON object with the following fields:
- 'time_keyword_type': The type of time range the user is requesting.
- 'time_keywords': The specific time-related keywords identified in the user's question.
- 'explanation': An explanation of why you chose a certain time_keyword_type and time_keywords.
"""


In [9]:
PROMPT_METADATA_TECHNICAL_KWD = """

Imagine you are a financial analyst looking to answer the question: {query}
Your task is to generate a list of 5-6 important keywords that you would use for searching relevant sections in companies 10-K and 10-Q documents to find information related to the given question.
Instructions:
1. Do not include company names, document names, or timelines in the keywords.
2. Generate a list of 5-6 comma-separated keywords.
3. Focus on identifying the sections of the documents you would look at, and include those section names or topics in the keywords.
4. Do not add keywords that are not part of or directly related to the given question.
Your response should be a comma-separated list of keywords without any additional formatting or tags.
For example, if the question is 'What was Google's net profit?', a possible response could be:
net profit, income statement, revenues, expenses, earnings 
"""

In [10]:
PROMPT_METADATA_AND_QUERY_ReWr = """
Human: You are a financial editor that looks at a user question and rephrases it accurately for better search and retrieval tasks. The question is related to yearly and quarterly financial reports.
Task: Given a user question, identify the following metadata: {query}
Technical Keywords: Provide a list of relevant keywords extracted from the question that are typically found in financial documents.
Generate a comprehensive list of all possible keywords relevant to financial sections.
Include different alternatives and variations of these keywords.
Exclude company names and document titles from this list.
Company Keywords: Extract a list of company names mentioned in the question.
Rephrased Question: Rephrase the question to make it clearer.
Expand any acronyms or abbreviations found in the original question, including both the abbreviated and expanded versions.
Make the rephrased question as clear as possible.
Output: Return a JSON object with the following fields:
'technical_keywords': A list of relevant keywords from the question.
'company_keywords': A list of company names mentioned in the question.
'rephrased_question': The fully rephrased question.
"""

#### Metadata Generation Using prompts and OpenAI GPT llm

In [11]:
def get_azure_openai_response(prompt, model_id, token_count=512, temperature=0, topP=1):
    client = AzureOpenAI(
                api_version="2024-02-01",
                azure_endpoint="https://your-azure-openai-endpoint",
                api_key=exp_config['AZURE_OPENAI_API_KEY']
                )
    gpt_assistant_prompt = "You a financial analyst that looks at a user question, related time and technical keywords and potentially relevant context."
    message=[{"role": "system", "content": gpt_assistant_prompt}, {"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model = str(model_id),
        # deployment_id= str(model_id),
        messages=message,
        temperature=temperature,
        max_tokens=token_count,
    )
    input_token_str = str(response.usage)
    # Find the index of the start of the completion_tokens value
    start_index = input_token_str.find("completion_tokens=") + len("completion_tokens=")
    end_index = input_token_str.find(",", start_index)
    output_tokens = int(input_token_str[start_index:end_index])

    # Find the index of the start of the prompt_tokens value
    start_index = input_token_str.find("prompt_tokens=") + len("prompt_tokens=")
    end_index = input_token_str.find(",", start_index)
    input_tokens = int(input_token_str[start_index:end_index])
    response_final = {'resp':response.choices[0].message.content, 'input':input_tokens, 'output':output_tokens}
    return response_final

In [12]:
def get_openai_response(prompt, model_id, token_count=512, temperature=0, topP=1):
    os.environ['OPENAI_API_KEY'] = exp_config['OPENAI_API_KEY']
    client = OpenAI()
    gpt_assistant_prompt = "You a financial analyst that looks at a user question, related time and technical keywords and potentially relevant context."
    message=[{"role": "system", "content": gpt_assistant_prompt}, {"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model= str(model_id),
        messages=message,
        temperature=temperature,
        max_tokens=token_count,
    )
    input_token_str = str(response.usage)
    # Find the index of the start of the completion_tokens value
    start_index = input_token_str.find("completion_tokens=") + len("completion_tokens=")
    end_index = input_token_str.find(",", start_index)
    output_tokens = int(input_token_str[start_index:end_index])

    # Find the index of the start of the prompt_tokens value
    start_index = input_token_str.find("prompt_tokens=") + len("prompt_tokens=")
    end_index = input_token_str.find(",", start_index)
    input_tokens = int(input_token_str[start_index:end_index])
    response_final = {'resp':response.choices[0].message.content, 'input':input_tokens, 'output':output_tokens}
    return response_final

#### Metadata Generation Using prompts and MistralAI llm

In [13]:
def get_mistralai_response(
    prompt,
    model_id,
    token_count=500,
    temp=0,
    topP=1,
    max_tokens: int = 800,
    temperature: float = 0):
    inference_config = {
        "maxTokens": token_count,
        "temperature": temp,
    }
    messages = [
        {"role": "user", "content": [{"text": prompt + "Do not show the steps or examples, only give the answer specific to the question"}]},
    ]

    bedrock_client = boto3.client("bedrock-runtime", region_name=exp_config['region'])
    response = bedrock_client.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig=inference_config
    )
    output = response['output']['message']['content'][0]['text']
    input_tokens = int(response['usage']['inputTokens'])
    output_tokens = int(response['usage']['outputTokens'])
    response_final = {'resp':output, 'input':input_tokens, 'output':output_tokens}
    return response_final

#### LLM Selection Based on config

In [14]:
def get_llm_response(model_id, prompt, invoke_azure_openai):
    if "gpt" in model_id:
        if not invoke_azure_openai:
            response = get_openai_response(prompt, model_id, token_count=512, temperature=0, topP=0.5)
        else:
            response = get_azure_openai_response(prompt, model_id, token_count=512, temperature=0, topP=0.5)
    else:
        response = get_mistralai_response(prompt, model_id, token_count=512, temperature=0, topP=0.5)
    return response

In [15]:
# Helper Function to get the LLM response in desired format
def llm_ouput_to_json_time(llm_output):
    # Use regular expressions to find content between curly braces
    pattern = r"\{([^}]*)\}"
    matches = re.findall(pattern, llm_output)
    if len(matches) < 1:
        return "", []
    try:
        json_obj = json.loads("{" + matches[0] + "}")
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {str(e)}")
        return "", []
    return (json_obj["time_keyword_type"], json_obj["time_keywords"], json_obj["explanation"])

#### Time Keywords extraction

In [16]:
def time_keyword_extraction(prompt, question):
    prompt_format = prompt.format(
        query=question, most_recent_quarter="Q1'24", most_recent_year=2024)
    response = get_llm_response(model_id = exp_config['generation_llm_model'],prompt=prompt_format, invoke_azure_openai= invoke_azure_openai)
    return response

In [17]:
response = time_keyword_extraction(prompt=PROMPT_METADATA_GENERATION_time, question=question)
time_keyword_type, time_kwds, explanation = llm_ouput_to_json_time(response['resp'])
print(f"Time keywords generated: {time_keyword_type, time_kwds, explanation}")

print(f"Total Input Token: {response['input']}")
print(f"Total Output Token: {response['output']}")


Time keywords generated: ('specific_year', '2022', 'The user is requesting the revenue for 3M in a specific year, which is 2022.')
Total Input Token: 1037
Total Output Token: 64


In [18]:
# Helper Function to get the LLM response in desired format
def llm_output_kwd(llm_output):
    llm_output_list = [llm_output]
    keywords = llm_output_list
    return keywords

#### Technical Keywords extraction

In [19]:
def technical_keyword_extraction(prompt, question):
    prompt_format = prompt.format(query=question)

    response = get_llm_response(model_id = exp_config['generation_llm_model'],prompt=prompt_format, invoke_azure_openai= invoke_azure_openai)
    return response
response = technical_keyword_extraction(prompt=PROMPT_METADATA_TECHNICAL_KWD, question=question)
kwds = llm_output_kwd(response['resp'])
print(kwds)

print(f"Total Input Token: {response['input']}")
print(f"Total Output Token: {response['output']}")

['Revenue, Consolidated Statements of Income, Total Sales, Operating Income, Gross Income, Income Statement']
Total Input Token: 245
Total Output Token: 29


In [20]:
# Helper Functions to get the LLM response in desired format
def llm_ouput_to_json(llm_output):
    # Use regular expressions to find content between curly braces
    pattern = r"\{([^}]*)\}"
    matches = re.findall(pattern, llm_output)
    if len(matches) < 1:
        return "", []
    try:
        json_obj = json.loads("{" + matches[0] + "}")
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {str(e)}")
        return "", []
    return (
        json_obj["rephrased_question"],
        json_obj["technical_keywords"],
        json_obj["company_keywords"] if "company_keywords" in json_obj.keys() else [],
    )

def convert_quarter_format(input_string):
    # Ensure both quarter formats are included eg: Q2'22 <--> Q2 2022

    # Use regular expression to match "Q<quarter>'<year>"
    match = re.match(r"Q(\d)\'(\d{2})", input_string.strip())

    if match:
        quarter = match.group(1)
        year = "20" + match.group(2)
        return f"Q{quarter} {year}"

    match = re.match(r"Q(\d) (\d{4})", input_string.strip())

    if match:
        quarter = match.group(1)
        year = match.group(2)[-2:]
        return f"Q{quarter}'{year}"

    # For case of e.g., Q1 F2023
    match = re.match(r"Q(\d) F(\d{4})", input_string.strip())

    if match:
        quarter = match.group(1)
        year = match.group(2)[-2:]
        return f"Q{quarter}'{year}"

    # For case of e.g., F2023
    match = re.match(r"F(\d{4})", input_string.strip())

    if match:
        year = match.group(1)
        return f"{year}"

    return ""

#### Combined Metadata Generation

In [21]:
def metadata_extraction(prompt, question, time_kwds):
    prompt_format = prompt.format(query=question, most_recent_quarter="Q1'24", time_kwds=time_kwds)

    response = get_llm_response(model_id = exp_config['generation_llm_model'],prompt=prompt_format, invoke_azure_openai= invoke_azure_openai)
    llm_output = response['resp']
    rephrased_q, kwds, company_keywords = llm_ouput_to_json(llm_output)
    time_kwds = time_kwds.split(",") if type(time_kwds) == str else time_kwds
    time_kwds = [time_kwd.strip() for time_kwd in time_kwds]

    new_time_kwds = []
    for time_kwd in time_kwds:
        new_time_kwd = convert_quarter_format(time_kwd)
        if new_time_kwd:
            new_time_kwds.append(new_time_kwd)
    time_kwds.extend(new_time_kwds)
    time_kwds = list(set(time_kwds))

    kwds = list(set(kwds))
    doc_type = []
    if len(time_kwds) == 0 or time_kwds == [""] or time_kwds == "none" or time_kwds is None:
        time_kwds = [2022,2023,2024]
    return company_keywords, time_kwds, kwds , rephrased_q, response

company_keywords, time_kwds, kwds, rephrased_q, response = metadata_extraction(prompt=PROMPT_METADATA_AND_QUERY_ReWr, question =question, time_kwds=time_kwds)
print(company_keywords, time_kwds, kwds, rephrased_q)
print(f"Total Input Token: {response['input']}")
print(f"Total Output Token: {response['output']}")

['3M'] ['2022'] ['turnover', 'yearly', 'gross income', 'annual', 'net income', 'sales', 'financial report', '2022', 'income', 'earnings', 'revenue'] What was the total annual revenue reported by 3M (MMM) in the financial year of 2022?
Total Input Token: 294
Total Output Token: 110


In [22]:
# Helper Functions to get the LLM response in desired format
def parse_time_kwds(time_kwds) -> list:
    """Given LLM generated time keywords, extract quarter and year

    Args:
        time_kwds (list): LLM generated time keywords

    Returns:
        list: List of tuples with year and quarter
    """
    set_tuple = set([])
    for kwd in time_kwds:
        match = re.match(r"Q(\d)\'(\d{2})", kwd.strip())
        if match:
            quarter = "q" + match.group(1)
            year = "20" + match.group(2)
            set_tuple.add((year, quarter))
            continue

        match = re.match(r"Q(\d) (\d{4})", kwd.strip())
        if match:
            quarter = "q" + match.group(1)
            year = match.group(2)
            set_tuple.add((year, quarter))
            continue

        match = re.match(r"Q(\d) F(\d{4})", kwd.strip())
        if match:
            quarter = "q" + match.group(1)
            year = match.group(2)
            set_tuple.add((year, quarter))
            continue

        match = re.search(r"(20\d{2})", kwd.strip())
        if match:
            quarter = ""
            year = match.group(1)
            set_tuple.add((year, quarter))
            continue

    return list(set_tuple)

def get_list_variables(company_kwds, time_kwds, kwds):
        time_kwds_tuples = parse_time_kwds(time_kwds)
        time_key_in_years = list(set([x[0] for x in time_kwds_tuples]))
        time_count = len(time_key_in_years)
        q_kwds = []

        return time_key_in_years, q_kwds

time_key_in_years, q_kwds = get_list_variables(company_keywords, time_kwds, kwds)



#### Chunks Retreival from Opensearch

In [23]:
def get_context(q, q_kwds, time_keyword_type, time_key_in_years, kwds, time_kwds, company_kwds):
    doc_type = []
    contexts = []
    retriever_embb = OpenSearchRetrieval(exp_config['opensearch_host'], exp_config['index_name_embb'])
    embedding = get_titan_text_embedding(q)
    contexts_sem = retriever_embb.retrieve_semantic(
        exp_config['emb_name'],
        embedding,
        time_kwds,
        company_kwds,
        doc_type,
        q_kwds,
        top_k=exp_config['top_k_retrieval'],
        use_company_kwds=True,
        use_doc_type=False,
    )

    retriever_text = OpenSearchRetrieval(exp_config['opensearch_host'], exp_config['index_name_embb'])
    contexts_text = retriever_text.retrieve_text(
        kwds,
        time_kwds,
        company_kwds,
        doc_type,
        exp_config['top_k_retrieval'],
        True,
        False,
    )
    contexts.extend(
        [
            context["paragraph"]
            for context in results_fusion([contexts_sem, contexts_text], [0.7, 0.3], top_k=exp_config['top_k_retrieval'])
        ]
    )

    return contexts

In [24]:
contexts = get_context( rephrased_q, q_kwds, time_keyword_type, time_key_in_years, kwds, time_kwds, company_keywords)

Using exact matching in semantic search with ['2022'], ['3M']
Using exact matching in text search with ['2022'], ['3M']


### Rerank the Retreived Chunks

##### Keyword Reranker used for this demo

In [25]:
def rerank_context(rephrased_query, kwds_list, time_kwds_list, company_kwds_list, contexts):
    # Set the top k ranking based on time keywords
    t0 = time.time()
    print("With keyword reranker")
    # Rank the contexts
    ranker = SearchRanker()
    combined_kwds = []
    combined_kwds.extend(set(kwds_list))
    combined_kwds.extend(set(time_kwds_list))
    combined_kwds.extend(set(company_kwds_list))
    ranked_contexts = ranker.rank_by_word_frequency(combined_kwds, contexts)
    top_k_contexts = ranked_contexts[: exp_config['top_k_ranking']]
    end_time = time.time()
    print(f"**** Time taken for reranking {end_time - t0}")
    return ranked_contexts, top_k_contexts

In [26]:
ranked_contexts, top_k_contexts = rerank_context(rephrased_q, q_kwds,time_kwds, company_keywords, contexts )

With keyword reranker
**** Time taken for reranking 0.0024771690368652344


In [27]:
#Check point to see if contexts/chunks are retreived
print (f"Number of Chunks retreived = {len(ranked_contexts)}")
print (f"Number of Ranked Chunks to be sent to llm = {len(top_k_contexts)}")

Number of Chunks retreived = 20
Number of Ranked Chunks to be sent to llm = 10


#### Prompt to generate Answer

In [28]:
PROMPT_ANS_GENERATION  = """
To answer the financial question, think step-by-step:
1. Carefully read the question and any provided context paragraphs related to yearly and quarterly document reports to find all relevant paragraphs. Prioritize context paragraphs with CSV tables.
2. If needed, analyze financial trends and quarter-over-quarter (Q/Q) performance over the detected time spans mentioned in the related time keywords. Calculate rates of change between quarters to identify growth or decline.
3. Perform any required calculations to get the final answer, such as sums or divisions. Show the math steps.
4. Provide a complete, correct answer based on the given information. If information is missing, state what is needed to answer the question fully.
5. Present numerical values in rounded format using easy-to-read units.
6. Do not preface the answer with "Based on the provided context" or anything similar. Just provide the answer directly.
7. Include the answer with relevant and exhaustive information across all contexts. Substantiate your answer with explanations grounded in the provided context. Conclude with a precise, concise, honest, and to-the-point answer.
8. Add the page source and number.
9. Add all source files from where the contexts were used to generate the answers.
context = {CONTEXT}
query = {QUERY}
rephrased_query = {REPHARSED_QUERY}
time_kwds = {TIME_KWDS}
"""


In [29]:
ans_gen_prompt = PROMPT_ANS_GENERATION.format(
            QUERY=question,
            CONTEXT=top_k_contexts,
            TIME_KWDS=time_kwds,
            REPHARSED_QUERY=rephrased_q,
            most_recent_quarter="Q1'24",
        )

### Answer Generation Using OpenAI

In [30]:
# Helper Function to get the LLM response in desired format
def parse_generation(llm_prediction):
    soup = BeautifulSoup(llm_prediction, "html.parser")
    answer = soup.find("answer".lower())
    page_source = soup.find("pages".lower())
    source = soup.find("src".lower())

    answer = re.sub("<[^<]+>", "", str(answer))
    page_source = re.sub("<[^<]+>", "", str(page_source))
    source = re.sub("<[^<]+>", "", str(source))

    return answer+'\n'+page_source, source

##### <srong><u> NOTE: Possible reasons for error in this step</u></strong>
    1. If Open AI, model access/api key issue
    2. For 3.5 gpt model, reduce "top_k_ranking = 5" in exp_config. This might effect the answer quality.
    3. "top_k_ranking" is the number of chunks sent to llm and the token limitations might cause it to error out.

In [31]:
prediction = get_llm_response(model_id = exp_config['generation_llm_model'],prompt=ans_gen_prompt, invoke_azure_openai= invoke_azure_openai)
print(prediction['resp'])
print(f"Total Input Token: {prediction['input']}")
print(f"Total Output Token: {prediction['output']}")

The provided context does not include the specific revenue figure for 3M in 2022. To answer this question, I would need additional information or a context that contains the 2022 revenue for 3M.
Total Input Token: 6534
Total Output Token: 49


### Bulk Processing - Proceed ONLY After Indexing is completed for all the files in notebook 01 

In [32]:
questions = ["What is the FY2018 capital expenditure amount (in USD millions) for 3M? Give a response to the question by relying on the details shown in the cash flow statement.",
             "Is 3M a capital-intensive business based on FY2022 data?",
             "Does Adobe have an improving operating margin profile as of FY2022? If operating margin is not a useful metric for a company like this, then state that and explain why.",
             "Does Adobe have an improving Free cashflow conversion as of FY2022?",
             "Answer the following question as if you are an equity research analyst and have lost internet connection so you do not have access to financial metric providers. According to the details clearly outlined within the P&L statement and the statement of cash flows, what is the FY2015 depreciation and amortization (D&A from cash flow statement) % margin for AMD?",
             "From FY21 to FY22, excluding Embedded, in which AMD reporting segment did sales proportionally increase the most?",
             "How much has the effective tax rate of American Express changed between FY2021 and FY2022?",
             "What was the largest liability in American Express's Balance Sheet in 2022?",
             "What is the year end FY2019 total amount of inventories for Best Buy? Answer in USD millions. Base your judgments on the information provided primarily in the balance sheet.",
             "Are Best Buy's gross margins historically consistent (not fluctuating more than roughly 2% each year)? If gross margins are not a relevant metric for a company like this, then please state that and explain why."]

In [33]:
llm_answers =[]
llm_contexts =[]
latency_meta_time = []
latency_meta_kwd = []
latency_meta_comb = []
latency_meta_ans_gen = []
input_tokens = []
output_tokens = []
for question in questions:   
    print(question)
    t0=time.time()
    response = time_keyword_extraction(prompt=PROMPT_METADATA_GENERATION_time, question=question)
    t1=time.time()
    input_token1 = response['input']
    output_token1 = response['output']
    time_keyword_type, time_kwds, explanation = llm_ouput_to_json_time(response['resp'])
    t2=time.time()
    response = technical_keyword_extraction(prompt=PROMPT_METADATA_TECHNICAL_KWD, question=question)
    t3=time.time()
    input_token2 = response['input']
    output_token2 = response['output']
    kwds = llm_output_kwd(response['resp'])
    t4=time.time()
    company_keywords, time_kwds, kwds, rephrased_q, response = metadata_extraction(
        prompt=PROMPT_METADATA_AND_QUERY_ReWr,
        question =question,
        time_kwds=time_kwds)
    t5=time.time()
    input_token3 = response['input']
    output_token3 = response['output']
    time_key_in_years, q_kwds = get_list_variables(
        company_keywords,
        time_kwds,
        kwds)
    contexts = get_context( rephrased_q,
                           q_kwds,
                           time_keyword_type,
                           time_key_in_years,
                           kwds,
                           time_kwds,
                           company_keywords)
    ranked_contexts, top_k_contexts = rerank_context(rephrased_q,
                                                     q_kwds,time_kwds,
                                                     company_keywords,
                                                     contexts )
    ans_gen_prompt = PROMPT_ANS_GENERATION.format(
        QUERY=question,
        CONTEXT=top_k_contexts,
        TIME_KWDS=time_kwds,
        REPHARSED_QUERY=rephrased_q,
        most_recent_quarter="Q1'24")
    t6=time.time()
    prediction = get_llm_response(model_id = exp_config['generation_llm_model'],prompt=ans_gen_prompt, invoke_azure_openai= invoke_azure_openai)
    t7=time.time()
    input_token4 = prediction['input']
    output_token4 = prediction['output']
    total_input_token = input_token1 + input_token2 + input_token3 + input_token4
    total_output_token = output_token1 + output_token2 + output_token3 + output_token4
    input_tokens = input_tokens + [total_input_token]
    output_tokens = output_tokens + [total_output_token]
    print(f"****Total Input Token : {total_input_token}")
    print(f"****Total Output Token : {total_output_token}")
    answer = prediction['resp']
    llm_answers = llm_answers + [answer.replace("\n", "")]
    llm_contexts = llm_contexts +[ranked_contexts]
    latency_meta_time = latency_meta_time + [t1-t0]
    latency_meta_kwd = latency_meta_kwd + [t3-t2]
    latency_meta_comb = latency_meta_comb + [t5-t4]
    latency_meta_ans_gen = latency_meta_ans_gen + [t7-t6]
    print(answer)
    print(f"****Total Time taken for Metadata extraction - time : {(t1-t0):.2f}")
    print(f"****Total Time taken for Metadata extraction - keywords : {(t3-t2):.2f}")
    print(f"****Total Time taken for Metadata extraction - combined : {(t5-t4):.2f}")
    print(f"****Total Time taken for Answer Generation : {(t7-t6):.2f}")

What is the FY2018 capital expenditure amount (in USD millions) for 3M? Give a response to the question by relying on the details shown in the cash flow statement.
Using exact matching in semantic search with ['2018'], ['3M']
Using exact matching in text search with ['2018'], ['3M']
With keyword reranker
**** Time taken for reranking 0.002820253372192383
****Total Input Token : 12622
****Total Output Token : 250
The capital expenditure amount for 3M in FY2018 was $1,577 million. This is reported as "Purchases of property, plant and equipment (PP&E)" in the "Cash Flows from Investing Activities" section of the cash flow statement.
****Total Time taken for Metadata extraction - time : 1.50
****Total Time taken for Metadata extraction - keywords : 0.72
****Total Time taken for Metadata extraction - combined : 2.51
****Total Time taken for Answer Generation : 4.73
Is 3M a capital-intensive business based on FY2022 data?
Using exact matching in semantic search with ['2022'], ['3M']
Using ex

### Write to CSV for evaluation

In [34]:
if "gpt" in exp_config['generation_llm_model']:
    output_file = '../outputs/rag_outputs/output_openai.csv'
else:
    output_file = '../outputs/rag_outputs/output_mistral.csv'
# Open the input CSV file for reading
with open('../data/selected_samples.csv', 'r') as infile:
    reader = csv.DictReader(infile)
    # Open the output CSV file for writing
    fieldnames = reader.fieldnames  # Get the fieldnames from the input file
    
    # Add the new column names to the fieldnames list
    new_columns = ['latency_meta_time', 'latency_meta_kwd', 'latency_meta_comb', 'latency_meta_ans_gen', 'input_tokens', 'output_tokens']
    fieldnames.extend(new_columns)
    
    # # Open the output CSV file for writing
    # fieldnames = reader.fieldnames  # Get the fieldnames from the input file
    with open(output_file, 'w', newline='') as outfile:
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()  # Write the header row
        
        # Lists of new values to be updated in the specified columns
        llm_ans = llm_answers
        llm_contex = llm_contexts
        
        # Iterate over the rows in the input file and the new values
        for row_index, row in enumerate(reader, start=1):
            # Update the values in the desired columns for the current row
            if row_index <= len(llm_ans):
                row['llm_answer'] = llm_ans[row_index - 1]
            
            if row_index <= len(llm_contex):
                row['llm_contexts'] = llm_contex[row_index - 1]
                
            if row_index <= len(llm_contex):
                row['latency_meta_time'] = latency_meta_time[row_index - 1]
            
            if row_index <= len(llm_contex):
                row['latency_meta_kwd'] = latency_meta_kwd[row_index - 1]
            
            if row_index <= len(llm_contex):
                row['latency_meta_comb'] = latency_meta_comb[row_index - 1]
            
            if row_index <= len(llm_contex):
                row['latency_meta_ans_gen'] = latency_meta_ans_gen[row_index - 1]
            
            if row_index <= len(llm_contex):
                row['input_tokens'] = input_tokens[row_index - 1]
            
            if row_index <= len(llm_contex):
                row['output_tokens'] = output_tokens[row_index - 1]
            
            
            # Write the updated row to the output file
            writer.writerow(row)


### <center>----------EOF---------</center>