# Generative AI-powered search with Amazon OpenSearch Service

---
### Using Scenario
Form 10-K is a comprehensive report filed annually by a publicly traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). Some of the information a company is required to document in the 10-K includes its history, organizational structure, financial statements, earnings per share, subsidiaries, executive compensation, and any other relevant data.

The SEC mandates that all public companies file regular 10-Ks to keep investors aware of a company's financial condition and to allow them to have enough information before they buy or sell securities issued by that company. The 10-K can appear overly complex at first glance, complete with tables full of data and figures. However, it is so comprehensive that this filing is critical for investors to handle a company's financial position and prospects.

Form 10-K is an annual report that provides a comprehensive analysis of the company's financial condition. The Form 10-K is comprised of several parts. These include:

- 1 - Business-This describes the company's operations. 
- 1A - Risk Factors
- 1B - Unresolved Staff Comments
- 2 - Properties
- 3 - Legal Proceedings
- 4 - Mine Safety Disclosures
- 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
- 6 - Selected Financial Data (prior to February 2021)
- 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
- 7A - Quantitative and Qualitative Disclosures about Market Risk
- 8 - Financial Statements and Supplementary Data
- 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
- 9A - Controls and Procedures
- 9B - Other Information
- 10 - Directors, Executive Officers and Corporate Governance
- 11 - Executive Compensation
- 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
- 13 - Certain Relationships and Related Transactions, and Director Independence
- 14 - Principal Accountant Fees and Services
- 15 - Exhibits and Financial Statement Schedules

---

Many investors rely of SEC filings to analyze the financial health of a company, and they can certainly be a treasure trove of valuable information. Keyword based search may return some irrelevant information. Even with semantic search, information is overwhelming. Can we leverage generative AI to help us on company financial statements interpertation?


In this code talk session, we will show you how to modernize your search application to improve search relevance with Amazon OpenSearch while leveraging generative AI to improve search productivity. The code includes the following topics:
- Comparison search relevance between keyword search and semantic search with Amazon OpenSearch.
- How to leverage Retrieval Augmented Generation(RAG) improve search productivity.
- How to build intelligent agent which orchestrate and execute multistep tasks to automate 10-K filings analysis.
- OpenSearch vector store best practices

---


### Code Structure


The code includes the following sections:
- [Initialize](#Initialize)
- [Part 1: Ingest unstructured data into OpenSearch](#Part-1:-Ingest-unstructured-data-into-OpenSearch)
- [Part 2: Different appoach to search](#Part-2:-Different-appoach-to-search)
    - [2.1 Keyword search](#2.1-Keyword-search)
    - [2.2 Semantic/Vector search](#2.2-Semantic/Vector-search)
    - [2.3 Retrieval Augmented Generation(RAG)](#2.3-Retrieval-Augmented-Generation(RAG))
- [Part 3: AI agent powered search](#Part-3:-AI-agent-powered-search)
    - [3.1 Prepare other tools used by AI agent](#3.1-Prepare-other-tools-used-by-AI-agent)
        - [3.1.1 Ingest and query structured data in Redshift](#3.1.1-Ingest-and-query-structured-data-in-Redshift)
        - [3.1.2 Download 10-K filing from SEC](#3.1.2-Download-10-K-filing-from-SEC)
    - [3.2 Create AI agent](#3.2-Create-AI-agent)
    - [3.3 Use AI agent](#3.3-Use-AI-agent)


## Initialize




###  Install dependency Python library for OpenSearch, Redshift, Langchain

In [None]:
%pip install opensearch-py
%pip install torch
%pip install requests-aws4auth
%pip install boto3
%pip install sqlalchemy
%pip install sqlalchemy-redshift
%pip install redshift_connector
%pip install ipython-sql==0.4.1
%pip install langchain==0.3.1
%pip install langchain-aws==0.2.1
%pip install langchain-community==0.3.1


### Import library



In [None]:
import boto3
import re
import time
import sagemaker,json
from sagemaker.session import Session
import pandas as pd
import os

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## Part 1: Ingest unstructured data into OpenSearch

### Get SEC 10-K form files

Lets download it and unzip it

In [None]:
!wget https://ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com/df655552-1e61-4a6b-9dc4-c03eb94c6f75/10k-financial-filing.zip
!unzip 10k-financial-filing.zip

Read the dataset in JSON format and contruct pandas DataFrame

### Load the data to OpenSearch
OpenSearch is good with dynamic type inference and can perform full text search with fuzziness and type tolerance. Lets fetch the AOSS endpoint from the deployed CloudFormation template. 

In [None]:
import json
import boto3

cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "generative-ai-powered-search"

outputs = get_cfn_outputs(cloudformation_stack_name)

aoss_endpoint = outputs['OpenSearchServerlessCollectionEndpoint']

aoss_host = aoss_endpoint.split("//")[1]

outputs


Lets create a client to OpenSearch Serverless and we use this for the entire workshop.

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

service = "aoss"
aws_region = boto3.Session().region_name
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, aws_region, service)

aos_client = OpenSearch(
    hosts = [{"host": aoss_host, "port": 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

Let's create a new index with a name "10k_finanical". As you will recreate this index with different techniques, delete the index if any before creating. Just creating an empty index in OpenSearch, the index schema would be extened as and when it finds a new attribute with dynamic type inference. 

In [None]:
index_name="10k_financial"

exist=False
try:
    aos_client.indices.get(index=index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=index_name)
else:
    print("index does not exist.")
    
aos_client.indices.create(index=index_name,ignore=400)

Now you can load all the financial reports to the index you just created.

In [None]:
import os
from opensearchpy import helpers
# Set the directory path
directory_path =  "extracted"
batch_size = 10
# Initialize a list to store the documents
documents = []

# Iterate through the files in the directory
for filename in os.listdir(directory_path):
    file_path = os.path.join(directory_path, filename)

    # Read the file contents
    with open(file_path, 'r') as file:
        file_contents = file.read()
    docJson = json.loads(file_contents)
    docJson["_index"] = index_name
    documents.append(docJson)
    # If the batch size is reached, index the documents
    if len(documents) == batch_size:
        aos_response= helpers.bulk(aos_client, documents)
        print(f"Indexed {len(documents)} documents.")
        documents = []

# Index the remaining documents
if documents:
    aos_response= helpers.bulk(aos_client, documents)
    print(f"Indexed {len(documents)} documents.")

you can check the total number of documents indexed into the OpenSearch index.

In [None]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])


## Part 2: Different appoach to search

### 2.1 Keyword search
---
Keyword search refers to finding information one is looking for using terms or words, called "query", from among a large body of textual data. It uses various tokenization methods to split the actual text and score the results with the token statistics like, how many times the word exist in the document, how common it is across the entire corpus, proximity, etc. With these data structure, OpenSearch can handle fuzziness, and tolerate the typo mistakes in the search enables the users search with phonetically similar terms or not knowing the exact spelling( for example scientific names). It works great with exact matches

Lets search for companies in the state of Illinois

In [None]:
query = {
    "_source" : ["company", "filing_date", "state_location"],
    "query": {
        "bool" :{
            "filter" : [{
                    "match" :{
                        "state_location.keyword" : "IL"
                    }
                }]
        }
    }
}

In [None]:
import pandas as pd
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

You can search across fields and highlight them why they match the document.

In [None]:
query = {
    "_source" : ["company", "filing_date", "state_location", "item_1"],
    "query": {
        "bool": {
            "must" :[
                {
                    "multi_match" :{
                        "query" : "microchips",
                        "fields" :["item_1"]
                    }
                }
            ]
        }
    },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}

In [None]:
import pandas as pd
res = aos_client.search(index=index_name, body=query)
query_result=[]
print (f"Total documents found: {res['hits']['total']['value']}")
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location'],hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location", "item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

You can also perform a phrase match where the words are together, with additional filters like companies located in California as below

In [None]:
query = {
    "_source" : ["company", "filing_date", "state_location", "item_1"],
    "query": {
        "bool": {
            "must" :[
                {
                    "match_phrase" :{
                        "item_1" : "digital media"
                    }
                }
            ],
            "filter" :[
                {
                    "term": {
                        "state_location.keyword" : "CA"
                    }
                }
            ]
        }
    },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}

In [None]:
import pandas as pd
res = aos_client.search(index=index_name, body=query)
query_result=[]
print (f"Total documents found: {res['hits']['total']['value']}")
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location'],hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location", "item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

While the full text search works great with structural data with fuzziness, and typo tolerance, proximity searching, highlighting. However, When it comes to natural language, pure keyword search could result less relevant and a long tail of noise.

Lets run the query and check the search result. Some irrelevant documents are returned.

In [None]:
query_text="What Microsoft's researh and development organization is responsible for?"
query={
  "size": 10,
  "query": {
    "match": {
      "item_1": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_*" : {}
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['item_1'], hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","item_1","item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Run the query and check the search result.

In [None]:
# lets execute another query example
query_text="What is Microsoft's main revenue?"
query={
  "size": 10,
  "query": {
    "match": {
      "item_1": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['item_1'], hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","item_1","item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

# you can notice some irrelevant results

These term statistics based retrieval on a large text corpus produced these results. 
![Keyword Search](./static/keyword-search-flow.png)

And a segway to the world of vector search or semantic search.

### Initialize embedding model to vectorize text data

### 2.2 Semantic/Vector search

---
In Vector search,documents and queries are represented as high-dimensional numerical vectors, rather than just as strings of text.

The key idea behind vector search is that semantically similar documents or queries can be mapped to vectors that are close to each other in the vector space, even if the textual content doesn't have much lexical overlap.

Here's a bit more detail on how vector search works:

    Documents are converted into numerical vector representations using machine learning models like word embeddings or document embeddings. These models learn to map words, phrases, or entire documents into a high-dimensional vector space.

    Queries are also converted into vector form, allowing them to be compared to the document vectors in this mathematical vector space.

    Rather than doing a simple keyword match, the search engine calculates the similarity between the query vector and each document vector, often using a metric like cosine similarity.

    The most similar document vectors are then returned as the search results, even if they don't contain the exact words from the query.

This allows vector search to uncover semantically relevant content that would be missed by traditional text-based searches. It's especially useful for tasks like question answering, e-commerce search, and retrieval of similar documents or images.

The underlying machine learning models need to be trained on large datasets, but vector search can significantly improve the quality and relevance of search results compared to purely lexical approaches.


![Semantic Search](./static/semantic-search-flow.png)

---


![Semantic Search Architecture](./static/semantic-search-architecture.png)



**Embeddings** are numerical representations of data, typically used to convert complex, high-dimensional data into a lower-dimensional space where similar data points are closer together. In the context of natural language processing (NLP), embeddings are used to represent words, phrases, or sentences as vectors of real numbers. These vectors capture semantic relationships, meaning that words with similar meanings are represented by vectors that are close together in the embedding space.

**Embedding models** are machine learning models that are trained to create these numerical representations. They learn to encode various types of data into embeddings that capture the essential characteristics and relationships within the data. For example, in NLP, embedding models like Word2Vec, GloVe, and BERT are trained on large text corpora to produce word embeddings. These embeddings can then be used for various downstream tasks, such as text classification, sentiment analysis, or machine translation. In this case we'll be using it for semantic similarity

We use embedding model to convert questions into vector and use vector similiarity to search semantic similiar 10-K data. The following diagram shows the flow: 

<!-- ![Convert Text to Vector](./static/text2vector.png) -->

---

![opensearch vector store](./static/opensearch-vector-store.png)


In [None]:
import os
import pandas as pd

# Specify the path to the folder containing the JSON files
folder_path = "extracted"

# Initialize an empty list to store list of company 10-K filing file names
company_filing_file_name_list = []

#For this session, we only ingest few company information.
company_list=["Alteryx, Inc.",
              "MICROSTRATEGY Inc", 
              "Elastic N.V.", 
              "MongoDB, Inc.", 
              "Palo Alto Networks Inc", 
              "Okta, Inc.",
              "Datadog, Inc.", 
              "Snowflake Inc.",
              "SALESFORCE.COM, INC.", 
              "ORACLE CORP",
              "MICROSOFT CORP", 
              "Palantir Technologies Inc."
             ]

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        df = pd.DataFrame([pd.read_json(file_path,typ='series')])
        if df.iloc[0]['company'] in company_list:
            company_filing_file_name_list.append(file_path)

company_filing_file_name_list

In [None]:
#from langchain.embeddings import BedrockEmbeddings
from langchain_aws import BedrockEmbeddings

aws_region = boto3.Session().region_name

boto3_bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=f"https://bedrock-runtime.{aws_region}.amazonaws.com")
bedrock_embeddings = BedrockEmbeddings(model_id='amazon.titan-embed-text-v2:0',client=boto3_bedrock)
#bedrock_embeddings = BedrockEmbeddings(model_id='cohere.embed-multilingual-v3',client=boto3_bedrock)

In [None]:
result = bedrock_embeddings.embed_query("This is a content of the document")
result[0:20]

In [None]:
len(result)

#### Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

### Create a index in Amazon OpenSearch Service collection

The OpenSearch k-NN plugin introduces a custom data type, the knn_vector, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. 

<!-- ---
#### OpenSearch Approximate Nearest Neighbor Algorithms and Engines
![ANN algorithm](./static/ann-algorithm.png)

--- -->

<!-- #### HNSW parameter tuning
![hnsw parameter tuning](./static/hnsw-parameter-tuning.png) -->

<!-- ---

#### IVF parameter tuning
![ivf parameter tuning](./static/ivf-parameter-tuning.png)

#### How to select the engine and algorithms
![opensearch ann comparison](./static/opensearch-ann-selection.png)  -->

In [None]:
knn_index = {
    "settings": {
        "index.knn": True,
        #"index.knn.space_type": "cosinesimil"
    },
    "mappings": {
        "properties": {
            "item_vector": {
                "type": "knn_vector",
                "dimension": 1024,
                "store": True,
                "method": {
                    "name": "hnsw",
                    "space_type": "l2",
                    "engine": "nmslib",
                    "parameters": {
                      "ef_construction": 128,
                      "m": 24
                    }
                }
            },
            "item_content": {
                "type": "text",
                "store": True
            },
            "company_name": {
                "type": "text",
                "store": True
            }
        }
    }
}


Using the above index definition, we now need to create the index in Amazon OpenSearch

In [None]:
index_name="10k_financial"

exist=False
try:
    aos_client.indices.get(index=index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=index_name)
else:
    print("index does not exist.")
    
aos_client.indices.create(index=index_name,body=knn_index,ignore=400)

In [None]:
aos_client.indices.get(index=index_name)
# you can verify the mappings

###  Load the raw data into the Index
Next, let's load the financial billing data into the index you've just created.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
import pandas

from typing import Any, Dict, List, Optional, Sequence

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class PandasDataFrameLoader(BaseLoader):
    
    def __init__(self,dataframe:pandas.DataFrame):
        self.dataframe=dataframe
        
    def load(self) -> List[Document]:
        docs = []
        items=["item_1","item_1A","item_1B","item_2","item_3","item_4","item_5","item_6","item_7","item_7A","item_8","item_9","item_9A", "item_9B", "item_10", "item_11", "item_12", "item_13", "item_14", "item_15"]
        
        for index, row in self.dataframe.iterrows():
            metadata={}
            # you can use as many metadata possible
            metadata["cik"]=row['cik']
            metadata["company_name"]=row['company']
            metadata["filing_date"]=row['filing_date']
            for item in items:
                content=row[item]
                metadata['item'] = item
                doc = Document(page_content=content,metadata=metadata)
                #print(doc.metadata)
                docs.append(doc)
        return docs

Use Bedrock embedding convert item content into vector and use OpenSearch bulk ingest to store data into OpenSearch index

In [None]:
import time
from opensearchpy import helpers

def ingest_downloaded_10k_into_opensearch(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    item_contents=[]
    company_name=splitted_documents[0].metadata['company_name']
    for doc in splitted_documents:
        item_contents.append(doc.page_content)
    
    print("\ncompany:" + company_name + ", item count:" + str(len(item_contents)))
    start = time.time()
    embedding_results = bedrock_embeddings.embed_documents(item_contents)
    end = time.time()
    elapsed = end - start
    #print(f"total time elapsed for Bedrock embedding: {elapsed:.2f} seconds")
        
    data = []
    i=0
    for content in item_contents:
        data.append({"_index": index_name,  "company_name": company_name, "item_content":content, "item_vector":embedding_results[i]})
        i = i+1
    aos_response= helpers.bulk(aos_client, data)
    print(f"Bulk-inserted {aos_response[0]} items.")

In [None]:
# load the data in to OpenSearch Serverless collection. Note: This would take some time to complete
for file in company_filing_file_name_list:
    ingest_downloaded_10k_into_opensearch(file)
    print("Ingested :" + file)

To validate the load, you can query the number of documents number in the index. 

In [None]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])

In [None]:
print(res['hits']['hits'][1]['_source'])

In [None]:
# now you can use the same queries where the full text search weren't return relevant results.
query_text="What Microsoft's research and development organization is responsible for?"

Run the query and check the search result. 

In [None]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

In [None]:
query_text="What is Microsoft's main revenue?"

Run the query and check the search result. 

In [None]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

### 2.3 Retrieval Augmented Generation(RAG)

You can leverage the Large Language Models to generate answers rather than provinding the document back to the user. By provinding these retrieved documents as context to generate answers, we minimizes the halucination. This method is called Retrieval Augmented Generation or simply RAG. In RAG, external data can be sourced from various data sources, such as document repositories, databases, or APIs. The first step is to convert the documents and the user query into a format that enables comparison and allows for performing relevancy search. To achieve comparability for relevancy search, a document collection (knowledge library) and the user-submitted query are transformed into numerical representations using embedding language models. These embeddings are essentially numerical representations of concepts in text.

Next, based on the embedding of the user query, relevant text is identified in the document collection through similarity search in the embedding space. The prompt provided by the user is then combined with the searched relevant text and added to the context. This updated prompt, which includes relevant external data along with the original prompt, is sent to the LLM (Language Model) for processing. As a result, the model output becomes relevant and accurate due to the context containing the relevant external data.

The major components of RAG, including embedding, vector databases, augmentation, and generation:


- **Embedding**: Purpose: Embeddings transform text data into numerical vectors in a high-dimensional space. These vectors represent the semantic meaning of the text. Process: The embedding process typically uses pre-trained models (like BERT or a variant) to convert both the input queries and the documents in the database into dense vectors. Role in RAG: Embeddings are crucial for the retrieval component as they allow the model to compute the similarity between the query and the documents in the database efficiently.
- **Vector Database**: Function: A vector database stores the embeddings of a large collection of documents or passages. Construction: It is created by processing a vast corpus (like Wikipedia or other specialized datasets) through an embedding model. Usage in RAG: When a query comes in, the model searches this database to find the documents whose embeddings are most similar to the embedding of the query.
- **Retrieval (Augmentation)**: Mechanism: The retrieval part of RAG functions by taking the input query, converting it into a vector using embeddings, and then searching the vector database to retrieve relevant documents. Result: It augments the original query with additional context by selecting documents or passages that are semantically related to the query. This augmented information is essential for generating more informed responses.
- **Generation**: Integration with a Language Model: The generative component, often a large language model like Amazon Titan Text, receives both the original query and the retrieved documents. Response Generation: It synthesizes information from these inputs to produce a coherent and contextually appropriate response. 
- **Training and Fine-Tuning**: This component is generally pre-trained on vast amounts of text and may be further fine-tuned to optimize its performance for specific tasks or datasets.
- **End-to-End Training (Optional)**: Joint Optimization: In RAG, both retrieval and generation components can be fine-tuned together, allowing the system to optimize the selection of documents and the generation of responses simultaneously. Feedback Loop: The model learns not only to generate relevant responses but also to retrieve the most useful documents for any given query.

---
### Architecture

![RAG](./static/RAG_Architecture.png)

---

In [None]:
langchain_index_name="10k_financial_embedding"

In [None]:
exist=False
try:
    aos_client.indices.get(index=langchain_index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=langchain_index_name)
else:
    print("index does not exist.")
    

In [None]:
from langchain.vectorstores import OpenSearchVectorSearch
from typing import Callable
from requests_aws4auth import AWS4Auth

os_domain_ep = 'https://'+aoss_host

credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, aws_region, service, session_token=credentials.token)


def ingest_10k_into_opensearch_with_langchain(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    OpenSearchVectorSearch.from_documents(
                index_name = langchain_index_name,
                documents=splitted_documents,
                embedding=bedrock_embeddings,
                opensearch_url=os_domain_ep,
                http_auth=awsauth,
                timeout=600,
                use_ssl=True,
                verify_certs=True,
                connection_class=RequestsHttpConnection,
    )

In [None]:
## In this step, you're indexing the selected companies 10K files in to Amazon OpenSearch Serverless with LangChain
## This will take a while to load all the chunks

for file in company_filing_file_name_list:
    ingest_10k_into_opensearch_with_langchain(file)
    print("Ingested :" + file)


In [None]:
aos_client.indices.get(index=langchain_index_name)

In [None]:

class SimiliarOpenSearchVectorSearch(OpenSearchVectorSearch):
    
    def relevance_score(self, distance: float) -> float:
        return distance
    
    def _select_relevance_score_fn(self) -> Callable[[float], float]:
        return self.relevance_score


open_search_vector_store = SimiliarOpenSearchVectorSearch(
                                    index_name=langchain_index_name,
                                    embedding_function=bedrock_embeddings,
                                    opensearch_url=os_domain_ep,
                                    http_auth=awsauth,
                                    timeout=600,
                                    use_ssl=True,
                                    verify_certs=True,
                                    connection_class=RequestsHttpConnection,
                                    ) 

Initialize Bedrock LLM model with Claude

In [None]:
from langchain_aws import BedrockLLM, ChatBedrock

#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=boto3_bedrock)
bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=boto3_bedrock)
#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-opus-20240229-v1:0", client=boto3_bedrock)

bedrock_llm.model_kwargs = {"temperature":0.001,"top_k":300,"top_p":1}


#### Note: This session's prompt is desinged for Claude 3. Output result may be different if use other LLMs, for example guardrails impact.

In [None]:
from langchain.chains import RetrievalQA
import langchain

bedrock_retriever = open_search_vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        'k': 5,
        'score_threshold': 0.005
    }
)
rag_qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm,
    retriever=bedrock_retriever,
    chain_type="stuff" #stuff, refine, map_reduce, and map_rerank
)

In [None]:
question="What Microsoft's research and development organization is responsible for?"

langchain.debug=True
result = rag_qa({"query": question})


In [None]:
print("Result:" + result["result"])

In [None]:
question="What is Microsoft main revenue?"

langchain.debug=False
result = rag_qa({"query": question})


In [None]:
print("Result:" + result["result"])

## Part 3: AI agent powered search

![standard rag limitation](./static/rag-limitation.png)




![advanced rag ](./static/advanced-rag.png)

### What is an AI agent ?
An agentic employs a chain-of-thought reasoning process, where the LLM is prompted to think gradually through a question, interleaving its reasoning with the ability to use external tools such as search engines and APIs. This allows the LLM to retrieve relevant information that can help answer partial aspects of the question, ultimately leading to a more comprehensive and accurate final response. This approach is inspired by the "Reason and Act" (ReAct) design introduced in the paper [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/pdf/2210.03629)  which aims to synergize the reasoning capabilities of language models with the ability to interact with external resources and take actions. By combining these two facets, an agentic LLM assistant can provide more informed and well-rounded answers to complex user queries.

![agent components](./static/agent-components.png)

### Why build an AI agent?
In today's digital landscape, enterprises are inundated with a vast array of data sources, ranging from traditional PDF documents to complex SQL and NoSQL databases, and everything in between. While this wealth of information holds immense potential for gaining valuable insights and driving operational efficiency, the sheer volume and diversity of data can often pose significant challenges in terms of accessibility and utilization.

This is where the power of agentic LLM assistants comes into play. By leveraging progress in LLM design patterns such as Reason and Act (ReAct) and other traditional or novel design patterns, these intelligent assistants are capable of integrating with an enterprise's diverse data sources. Through the development of specialized tools tailored to each data source and the ability of LLM agents to identify the right tool for a given question, agentic LLM assistants can simplify how you navigate and extract relevant information, regardless of its origin or structure.

This enables a rich, multi-source conversation that promises to unlock the full potential of the entreprise data, enabling data-driven decision-making, enhancing operational efficiency, and ultimately driving productivity and growth.

![agent powered search advantage](./static/agent-powered-search-advantage.png)


### AI Agent powered search reference architecture

![AI Agent powered search reference architecture](./static/reference-architecture.png)


### Lab Architecture

For demo purpose, we use SageMaker notebook run the code, the following is the overall architecture of this lab:

![AI Agent powered search architecture](./static/architecture.png)

---


### Data Flow

The user submit query, first AI agent will judge if the query is financial statements related. If so, AI agent will use vector search to get similiar financial statements for this company from OpenSearch. If there is no financial statements for this company, AI agent will download the data from internet by calling SEC API, ingest the data into OpenSearch and do the search again. If there is related financial statements, AI agent will see if the query is stock price related question. If so, AI agent will query Redshift database to get company's stock price data. LLM will generate the response with all collected data. Overall data flow is like following:

![AI Agent powered search data flow](./static/ai-agent-search-data-flow.png)

### 3.1 Prepare other tools used by AI agent

#### 3.1.1 Ingest and query structured data in Redshift

Get Redshift Serverless username, password and endpoint

In [None]:
kms = boto3.client('secretsmanager')

redshift_serverless_credentials = json.loads(kms.get_secret_value(SecretId=outputs['RedshiftServerlessSecret'])['SecretString'])
redshift_serverless_username=redshift_serverless_credentials['username']
redshift_serverless_password=redshift_serverless_credentials['password']
redshift_serverless_endpoint =  outputs['RedshiftServerlessEndpoint']

Create `stock_symbol` table and populate the table from S3. We will use this table to query company stock ticker by company name.

In [None]:
import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host=redshift_serverless_endpoint, 
port=5439,
database='dev',
username=redshift_serverless_username,
password=redshift_serverless_password
)

%sql $connect_to_db
%sql select current_user, version();

%sql CREATE TABLE IF NOT EXISTS public.stock_symbol (stock_symbol text PRIMARY KEY, company_name text NOT NULL);

stock_price_bucket = outputs["s3BucketStock"]
s3_location = f's3://{stock_price_bucket}/stock-price/'
print(s3_location)
!aws s3 sync ./stock-price/ $s3_location

stock_symbol_s3_location = f's3://{stock_price_bucket}/stock-price/stock_symbol.csv'
quoted_stock_symbol_s3_location = "'" + stock_symbol_s3_location + "'"

%sql COPY STOCK_SYMBOL FROM $quoted_stock_symbol_s3_location iam_role default IGNOREHEADER 1 CSV;


url = URL.create(
    drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
    host=redshift_serverless_endpoint, 
    port=5439,
    database='dev',
    username=redshift_serverless_username,
    password=redshift_serverless_password
)

engine = sa.create_engine(url)
redshift_connection = engine.connect()
    
def query_stock_ticker(company_name):
    strSQL = "SELECT stock_symbol FROM stock_symbol WHERE lower(company_name) ILIKE '%" + company_name + "%'"
    stock_ticker=''
    try:
        result = redshift_connection.execute(strSQL)
        df = pd.DataFrame(result)
        stock_ticker=df['stock_symbol'][0]
    except Exception as e:
        print(e)
    return stock_ticker


In [None]:
query_stock_ticker("Amazon")

In [None]:
%sql CREATE TABLE IF NOT EXISTS public.stock_price (stock_date DATE, stock_symbol text, open_price DECIMAL, high_price DECIMAL, low_price DECIMAL, close_price DECIMAL, adjusted_close_price DECIMAL, volume DECIMAL);

msft_s3_location = f's3://{stock_price_bucket}/stock-price/MSFT.csv'
quoted_msft_s3_location = "'" + msft_s3_location + "'"
print(quoted_msft_s3_location)
print("---------")

crm_s3_location = f's3://{stock_price_bucket}/stock-price/CRM.csv'
quoted_crm_s3_location = "'" + crm_s3_location + "'"
print(quoted_crm_s3_location)
print("---------")

orcl_s3_location = f's3://{stock_price_bucket}/stock-price/ORCL.csv'
quoted_orcl_s3_location = "'" + orcl_s3_location + "'"
print(quoted_orcl_s3_location)
print("---------")

snow_s3_location = f's3://{stock_price_bucket}/stock-price/SNOW.csv'
quoted_snow_s3_location = "'" + snow_s3_location + "'"
print(quoted_snow_s3_location)
print("---------")

%sql COPY STOCK_PRICE FROM $quoted_msft_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_crm_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_orcl_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_snow_s3_location iam_role default IGNOREHEADER 1 CSV;


In [None]:
%sql select * from public.stock_price

In [None]:
def query_stock_price(stock_ticker):
    strSQL = "SELECT stock_date, stock_symbol, open_price, high_price, low_price, close_price FROM public.stock_price WHERE stock_symbol ='" + stock_ticker + "' limit 100"
    try:
        result = redshift_connection.execute(strSQL)
        stock_price = pd.DataFrame(result)
    except Exception as e:
        print(e)
    return stock_price

In [None]:
query_stock_price('MSFT')

#### 3.1.2 Download 10-K filing from SEC
---
https://sec-api.io

Create a new account and get free API key.



In [None]:
!pip install sec-api

##### Replace your sec-api key in the following line

In [None]:
sec_api_key="{security_api_key}"

In [None]:
from sec_api import ExtractorApi, QueryApi
import json
import os

def get_filings(ticker):
    global sec_api_key

    # Finding Recent Filings with QueryAPI
    queryApi = QueryApi(api_key=sec_api_key)
    query = {
      "query": f"ticker:{ticker} AND formType:\"10-K\"",
      "from": "0",
      "size": "1",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }
    response = queryApi.get_filings(query)

    # Getting 10-K URL
    filing_url = response["filings"][0]["linkToFilingDetails"]
    filing_type=response['filings'][0]['formType']
    cik=response['filings'][0]['cik']
    company=response['filings'][0]['companyName']
    filing_date=response['filings'][0]['filedAt']
    period_of_report=response['filings'][0]['periodOfReport']
    filing_html_index=response['filings'][0]['linkToFilingDetails']
    complete_text_filing_link=response['filings'][0]['linkToTxt']

    # Extracting Text with ExtractorAPI
    extractorApi = ExtractorApi(api_key=sec_api_key)
    
    one_text = extractorApi.get_section(filing_url, "1", "text")       #Section 1 - Business
    onea_text = extractorApi.get_section(filing_url, "1A", "text")     # Section 1A - Risk Factors
    oneb_text = extractorApi.get_section(filing_url, "1B", "text")     # Section 1B - Unresolved Staff Comments
    two_text = extractorApi.get_section(filing_url, "2", "text")       # Section 2 - Properties
    three_text = extractorApi.get_section(filing_url, "3", "text")     # Section 3 - Legal Proceedings
    four_text = extractorApi.get_section(filing_url, "4", "text")      # Section 4 - Mine Safety Disclosures
    five_text = extractorApi.get_section(filing_url, "5", "text")      # Section 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
    six_text = extractorApi.get_section(filing_url, "6", "text")       # Section 6 - Selected Financial Data (prior to February 2021)
    seven_text = extractorApi.get_section(filing_url, "7", "text")     # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
    sevena_text = extractorApi.get_section(filing_url, "7A", "text")   # Section 7A - Quantitative and Qualitative Disclosures about Market Risk
    eight_text = extractorApi.get_section(filing_url, "8", "text")     # Section 8 - Financial Statements and Supplementary Data
    nine_text = extractorApi.get_section(filing_url, "9", "text")      # Section 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
    ninea_text = extractorApi.get_section(filing_url, "9A", "text")    # Section 9A - Controls and Procedures
    nineb_text = extractorApi.get_section(filing_url, "9B", "text")    # Section 9B - Other Information
    ten_text = extractorApi.get_section(filing_url, "10", "text")      # Section 10 - Directors, Executive Officers and Corporate Governance
    eleven_text = extractorApi.get_section(filing_url, "11", "text")   # Section 11 - Executive Compensation
    twelve_text = extractorApi.get_section(filing_url, "12", "text")   # Section 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
    thirteen_text = extractorApi.get_section(filing_url, "13", "text") # Section 13 - Certain Relationships and Related Transactions, and Director Independence
    fourteen_text = extractorApi.get_section(filing_url, "14", "text") # Section 14 - Principal Accountant Fees and Services
    fifteen_text = extractorApi.get_section(filing_url, "15", "text")  # Section 15 - Exhibits and Financial Statement Schedules
    
    data = {}
    data['filing_url'] = filing_url
    data['filing_type'] = filing_type
    data['cik'] = cik
    data['company'] = company
    data['filing_date'] = filing_date
    data['period_of_report'] = period_of_report
    data['filing_html_index'] = filing_html_index
    data['complete_text_filing_link'] = complete_text_filing_link
    
    
    data['item_1'] = one_text
    data['item_1A'] = onea_text
    data['item_1B'] = oneb_text
    data['item_2'] = two_text
    data['item_3'] = three_text
    data['item_4'] = four_text
    data['item_5'] = five_text
    data['item_6'] = six_text
    data['item_7'] = seven_text
    data['item_7A'] = sevena_text
    data['item_8'] = eight_text
    data['item_9'] = nine_text
    data['item_9A'] = ninea_text
    data['item_9B'] = nineb_text
    data['item_10'] = ten_text
    data['item_11'] = eleven_text
    data['item_12'] = twelve_text
    data['item_13'] = thirteen_text
    data['item_14'] = fourteen_text
    data['item_15'] = fifteen_text
    
    json_data = json.dumps(data)
    
    
    if not os.path.exists("./download_filings"):
        os.makedirs("./download_filings")
    
    try:
        file_name = filing_url.split("/")[-2] + "-" + filing_url.split("/")[-1].split(".")[0]+".json"
        download_to = "./download_filings/" + file_name
        with open(download_to, "w") as f:
          json.dump(data, f, ensure_ascii=False, indent=4)
    except Exception as e:
        print("Problem with {url}".format(url=url))
        print(e)
    
    return file_name

In [None]:
#downloaded_file=get_filings("AMZN")
#ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file)

### 3.2 Create AI agent

#### Define methods used by AI agent

One popular architecture for building agents is ReAct. ReAct combines reasoning and acting in an iterative process - in fact the name "ReAct" stands for "Reason" and "Act".

The general flow looks like this:

- The model will "think" about what step to take in response to an input and any previous observations.
- The model will then choose an action from available tools (or choose to respond to the user).
- The model will generate arguments to that tool.
- The agent runtime (executor) will parse out the chosen tool and call it with the generated arguments.
- The executor will return the results of the tool call back to the model as an observation.
- This process repeats until the agent chooses to respond.

In [None]:
from langchain.prompts.chat import ChatPromptTemplate
from langchain.chains import LLMChain
import time

def is_financial_statement_related_query(human_input):
    #template = """You are a helpful assistant to judge if the human input is stock related question.
    #If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    template = """You are a helpful assistant to judge if the human input is trying to analyze company financial statement.
    If the human input is financial statement related question, answer \"yes\". Otherwise answer \"no\".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def is_stock_related_query(human_input):
    #template = """You are a helpful assistant to judge if the human input is stock related question.
    #If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    template = """
    You are a helpful assistant to judge if the human input is stock related question. 
    If the human innput is stock related question, return "yes".Otherwise return "no".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def get_company_name(human_input):
    template = """You are a helpful assistant who extract company name from the human input.Please only output the company"""
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )

    company_name=llm_chain({"text":human_input})['text'].strip()
    return company_name
    
def semantic_search_and_check(human_input, k=10,with_post_filter=True):

    company_name=get_company_name(human_input)
    
    search_vector = bedrock_embeddings.embed_query(human_input)

    no_post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        }
    }

    post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        },
        "post_filter": {
           "match": { 
               "company_name":company_name
           }
        }
    }
    
    search_query=no_post_filter_search_query
    if with_post_filter:
        search_query=post_filter_search_query
    
    res = aos_client.search(index=index_name, 
                       body=search_query,
                       stored_fields=["company_name","item_category","item_content"])
    
    query_result=[]
    for hit in res['hits']['hits']:
        hit_company=hit['fields']['company_name'][0]
        print("\nsemantic search hit company: " + hit_company)
        row=[hit['fields']['company_name'][0], hit['fields']['item_content'][0]]
        query_result.append(row)

    query_result_df = pd.DataFrame(data=query_result,columns=["company_name","company_financial_statements"])
    return query_result_df

def search_for_similiar_content_in_10k_filing(human_input):
    company_statements = semantic_search_and_check(human_input)
    return company_statements

def search_financial_statements_for_company(company_financial_statements_query):
    company_statements = semantic_search_and_check(company_financial_statements_query)
    return company_statements

def get_stock_ticker(human_input):
    company_name=get_company_name(human_input)
    company_ticker = query_stock_ticker(company_name)
    return company_ticker

def get_stock_price(stock_ticker):
    stock_price = query_stock_price(stock_ticker)
    return stock_price

def download_10k_filing_from_sec_and_ingest_into_opensearch(stock_ticker):
    result = "download failed."
    try:
        #downloaded_file=get_filings(stock_ticker)
        downloaded_file="000101872424000008-amzn-20231231.json"
        ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file)
        time.sleep(60) #wait the data can be searchable
        result="download succeeded."
    except Exception as e:
        result = "download failed."
    return result

### Note

Uncomment the line `downloaded_file=get_filings(company_stock_ticker)` if you have sec-api security key so that you can download 10-K from SEC. In the meanwhile, comment the line 'downloaded_file="000101872424000008-amzn-20231231.json"`


#### Define tools for financial statements analysis AI agent

In [None]:
from langchain.agents import Tool

annual_report_tools=[
    Tool(
        name="is_financial_statement_related_query",
        func=is_financial_statement_related_query,
        description="""
        Use this tool when you need to know whether user input query is financial statement analysis related query. Human orginal query is the input to this tool. This tool output is whether human input is financial statement analysis related or not. 
        If the query is not finance statement related, please answer \"I am finiancial statement ansysis assitant. I can not answer question which is not finance related.\" and terminate the dialog.
        """
    ),
    Tool(
        name="search_financial_statements_for_company",
        func=search_financial_statements_for_company,
        description="""
        Use this tool to get financial statement of the company. This tool output is company financial statements.
        """
    ),
    Tool(
        name="get_stock_ticker",
        func=get_stock_ticker,
        description="Use this tool when you need to get the company stock ticker. Human orginal query is the input to this tool. This tool will output company stock ticker."
    ),
    Tool(
        name="download_10k_filing_from_sec_and_ingest_into_opensearch",
        func=download_10k_filing_from_sec_and_ingest_into_opensearch,
        description="""
        Use this tool to download company financial statements from internet. Company stock ticker is the input to this tool. The tool output is download succeed or not.
        Use this tool if and only if "search_financial_statements_for_company" output result is empty. After downloading financial statements, you must use "search_financial_statements_for_company" tool to search financial statements again.
        """
    ),
    Tool(
        name="is_stock_related_query",
        func=is_stock_related_query,
        description="Use this tool when you need to know whether user input query is stock related query. Human orginal query is the input to this tool. This tool output is whether human input is stock related or not."
    ),
    Tool(
        name="get_stock_price",
        func=get_stock_price,
        description="""
        Use this tool to get company stock price data. Company stock ticker is the input to this tool. This tool will output company historic stock price. The output includes 'stock_date', 'stock_ticker', 'open_price', 'high_price', 'low_price', 'close_price' of the company in the latest 100 days.
        This tool is mandatory to use if the input query is both finance statement related and stock related. If the output of "get_stock_price" is empty, please answer \"I cannot provide stock analysis without stock price information.\" and terminate the dialog.
        """
    )
]

#### Define prompt for financial statements analysis AI agent 

In [None]:
from langchain_core.prompts import ChatPromptTemplate


system_message = f"""
You are finiancial analyst assistant and you will analyze company financial statements and stock data. 
Leverage the <conversation_history> to avoid duplicating work when answering questions.

Available tools:
<tools>
{{tools}}
</tools>


To answer, first review the <conversation_history>. If insufficient use tool(s) with the following format:
<thinking>Think about which tool(s) to use and why. "get_stock_price" tool is mandatory to use if the input query is both finance statements related and stock related.</thinking>
<tool>tool_name</tool>
<tool_input>input</tool_input>
<observation>response</observation>

When you are done, provide a final answer in markdown within <final_answer></final_answer>.
If <user_input> is stock related and the output of "get_stock_price" tool is empty, respond directly within <final_answer> with the exact content \"I cannot provide stock analysis without stock price information.\".
Otherwise, use the following format to organize your <final_answer>:

Summary:
...

Support points:
Support point 1: ...
Support point 2: ...
Support point 3: ...


"""

user_message = """
Begin!

Previous conversation history:
<conversation_history>
{chat_history}
</conversation_history>

User input message:
<user_input>
{input}
</user_input>

{agent_scratchpad}
"""

# Construct the prompt from the messages
messages = [
    ("system", system_message),
    ("human", user_message),
]

financial_statements_analysis_prompt = ChatPromptTemplate.from_messages(messages)

#### Define memory for financial statements analysis AI agent 

In [None]:
from langchain_community.chat_message_histories import DynamoDBChatMessageHistory
from uuid import uuid4

dynamo = boto3.client('dynamodb')

history_table_name = 'conversation-history-memory'

try:
    response = dynamo.describe_table(TableName=history_table_name)
    print("The table "+history_table_name+" exists")
except dynamo.exceptions.ResourceNotFoundException:
    print("The table "+history_table_name+" does not exist")
    
    dynamo.create_table(
    TableName=history_table_name,
    AttributeDefinitions=[
        {
            'AttributeName': 'SessionId',
            'AttributeType': 'S',
        }
    ],
    KeySchema=[
        {
            'AttributeName': 'SessionId',
            'KeyType': 'HASH',
        }
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 5,
        'WriteCapacityUnits': 5,
    }
    )

    response = dynamo.describe_table(TableName=history_table_name) 
    
    while response["Table"]["TableStatus"] == 'CREATING':
        time.sleep(1)
        print('.', end='')
        response = dynamo.describe_table(TableName=history_table_name) 

    print("\ndynamo DB Table, '"+response['Table']['TableName']+"' is created")



#### Create financial statements analysis AI agent AI using defined Memory,  LLM, tools and prompt

In [None]:
from langchain.agents import create_xml_agent
from langchain.memory import ConversationBufferMemory
from langchain.agents import AgentExecutor


def create_new_memory_with_session(session_id):
    chat_memory = DynamoDBChatMessageHistory(table_name=history_table_name,session_id=session_id)    
    return chat_memory

def get_agentic_chatbot_conversation_chain(session_id, verbose=True):
    chat_memory=create_new_memory_with_session(session_id)
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        # Change the human_prefix from Human to something else
        # to not conflict with Human keyword in Anthropic Claude model.
        human_prefix="Hu",
        chat_memory=chat_memory,
        return_messages=False)

    agent = create_xml_agent(
        bedrock_llm,
        annual_report_tools,
        financial_statements_analysis_prompt,
        stop_sequence=["</tool_input>", "</final_answer>"]
    )

    agent_chain = AgentExecutor(
        agent=agent,
        tools=annual_report_tools,
        return_intermediate_steps=False,
        verbose=True,
        memory=memory,
        handle_parsing_errors="Check your output and make sure it conforms!"
    )
    return agent_chain

### 3.3 Use financial statements analysis AI agent

#### Example 1:

Query is "Per Microsoft financial statements, what Microsoft's research and development organization is responsible for?". 

The data flow is like following:

![example 1](./static/example-1-data-flow.png)


In [None]:
import warnings


langchain.debug=False
warnings.filterwarnings("ignore")

session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Per Microsoft financial statements, what Microsoft's research and development organization is responsible for?"})

In [None]:
print(response["output"])

#### Example 2

Query is "Is Microsoft a good investment choice right now?". 

The data flow is like following:

![example 2](./static/example-2-data-flow.png)

In [None]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Is Microsoft a good investment choice right now?"})


In [None]:
print(response["output"])

#### Example 3

Query is "Compare Oracle and Microsoft company financial statements"

The data flow is like following:

![example 3](./static/example-3-data-flow.png)

In [None]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Compare Oracle and Microsoft company financial statements"})

In [None]:
print(response["output"])

#### Example 4

Query is "Is Amazon a good investment choice right now?"

The data flow is like following:

![example 4](./static/example-4-data-flow.png)

In [None]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Is Amazon a good investment choice right now?"})

In [None]:
print(response["output"])

#### Example 5

Query is "What is OpenSearch?"

The data flow is like following:

![example 5](./static/example-5-data-flow.png)

In [None]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "What is OpenSearch?"})

In [None]:
print(response["output"])