# Generative AI-powered search with Amazon OpenSearch Service

---
### Using Scenario
Form 10-K is a comprehensive report filed annually by a publicly traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). Some of the information a company is required to document in the 10-K includes its history, organizational structure, financial statements, earnings per share, subsidiaries, executive compensation, and any other relevant data.

The SEC mandates that all public companies file regular 10-Ks to keep investors aware of a company's financial condition and to allow them to have enough information before they buy or sell securities issued by that company. The 10-K can appear overly complex at first glance, complete with tables full of data and figures. However, it is so comprehensive that this filing is critical for investors to handle a company's financial position and prospects.

Form 10-K is an annual report that provides a comprehensive analysis of the company's financial condition. The Form 10-K is comprised of several parts. These include:

- 1 - Business-This describes the company's operations. 
- 1A - Risk Factors
- 1B - Unresolved Staff Comments
- 2 - Properties
- 3 - Legal Proceedings
- 4 - Mine Safety Disclosures
- 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
- 6 - Selected Financial Data (prior to February 2021)
- 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
- 7A - Quantitative and Qualitative Disclosures about Market Risk
- 8 - Financial Statements and Supplementary Data
- 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
- 9A - Controls and Procedures
- 9B - Other Information
- 10 - Directors, Executive Officers and Corporate Governance
- 11 - Executive Compensation
- 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
- 13 - Certain Relationships and Related Transactions, and Director Independence
- 14 - Principal Accountant Fees and Services
- 15 - Exhibits and Financial Statement Schedules

---

Many investors rely of SEC filings to analyze the financial health of a company, and they can certainly be a treasure trove of valuable information. Keyword based search may return some irrelevant information. Even with semantic search, information is overwhelming. Can we leverage generative AI to help us on company financial statements interpertation?


In this code talk session, we will show you how to modernize your search application to improve search relevance with Amazon OpenSearch while leveraging generative AI to improve search productivity. The code includes the following topics:
- Comparison search relevance between keyword search and semantic search with Amazon OpenSearch.
- How to leverage Retrieval Augmented Generation(RAG) improve search productivity.
- How to build intelligent agent which orchestrate and execute multistep tasks to automate 10-K filings analysis.
- OpenSearch vector store best practices

---


### Code Structure


The code includes the following sections:
- [Initialize](#Initialize)
- [Part 1: Ingest unstructured data into OpenSearch](#Part-1:-Ingest-unstructured-data-into-OpenSearch)
- [Part 2: Different appoach to search](#Part-2:-Different-appoach-to-search)
    - [2.1 Keyword search](#2.1-Keyword-search)
    - [2.2 Semantic/Vector search](#2.2-Semantic/Vector-search)
    - [2.3 Retrieval Augmented Generation(RAG)](#2.3-Retrieval-Augmented-Generation(RAG))
- [Part 3: AI agent powered search](#Part-3:-AI-agent-powered-search)
    - [3.1 Prepare other tools used by AI agent](#3.1-Prepare-other-tools-used-by-AI-agent)
        - [3.1.1 Ingest and query structured data in Redshift](#3.1.1-Ingest-and-query-structured-data-in-Redshift)
        - [3.1.2 Download 10-K filing from SEC](#3.1.2-Download-10-K-filing-from-SEC)
    - [3.2 Create AI agent](#3.2-Create-AI-agent)
    - [3.3 Use AI agent](#3.3-Use-AI-agent)


## Initialize




Make sure PyTorch versin is larger than or equal  2.2.0

In [1]:
import torch
print(torch.__version__)

2.2.0


###  Install dependency Python library for OpenSearch, Redshift, Langchain

In [2]:
%pip install -q opensearch-py
%pip install boto3
%pip install sqlalchemy>
%pip install sqlalchemy-redshift
%pip install redshift_connector
%pip install ipython-sql==0.4.1
%pip install langchain==0.3.6
%pip install langchain-aws==0.2.1
%pip install langchain-community==0.3.1

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
/bin/sh: -c: line 0: syntax error near unexpected token `newline'
/bin/sh: -c: line 0: `/home/ec2-user/anaconda3/envs/pytorch_p310/bin/python -m pip install sqlalchemy>'
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement langchain==0.3.6 (from versions: 0.0.1, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.8, 0.0.9, 0.0.10, 0.0.11, 0.0.12, 0.0.13, 0.0.14, 0.0.15, 0.0.16, 0.0.17, 0.0.18, 0.0.19, 0.0.20, 0.0.21, 0.0.22, 0.0.23, 0.0.24, 0.0.25, 0.0.26, 0.0.27, 0.0.28, 0.0.29, 0.0.30, 0.0.31, 0.0.32, 0.0.33, 0.0.34, 0.0.35, 0.0.36, 0.0.37, 0.0.38, 0.0.39, 0.0.40, 0.0.41, 0.0.42, 0.0.43

### Import library



In [3]:
import boto3
import re
import time
import sagemaker,json
from sagemaker.session import Session
import pandas as pd
import os

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Part 1: Ingest unstructured data into OpenSearch

### Get SEC 10-K form files

In [4]:
!wget https://ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com/df655552-1e61-4a6b-9dc4-c03eb94c6f75/10k-financial-filing.zip

--2024-09-28 13:18:08--  https://ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com/df655552-1e61-4a6b-9dc4-c03eb94c6f75/10k-financial-filing.zip
Resolving ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com (ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com)... 3.5.160.145, 52.219.220.178, 3.5.163.143, ...
Connecting to ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com (ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com)|3.5.160.145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 65823210 (63M) [application/zip]
Saving to: ‘10k-financial-filing.zip’


2024-09-28 13:18:10 (37.2 MB/s) - ‘10k-financial-filing.zip’ saved [65823210/65823210]



Unzip the dataset

In [5]:
!unzip 10k-financial-filing.zip

Archive:  10k-financial-filing.zip
  inflating: extracted/1001601_10K_2020_0001493152-21-008913.json  
  inflating: extracted/1002517_10K_2021_0001002517-21-000052.json  
  inflating: extracted/1013462_10K_2020_0001013462-21-000009.json  
  inflating: extracted/1013857_10K_2020_0001013857-21-000032.json  
  inflating: extracted/1015739_10K_2020_0001564590-21-006373.json  
  inflating: extracted/1017655_10K_2020_0001654954-21-003649.json  
  inflating: extracted/1019034_10K_2020_0001437749-21-007486.json  
  inflating: extracted/1019671_10K_2021_0001564590-21-018886.json  
  inflating: extracted/1021435_10K_2020_0001493152-21-007517.json  
  inflating: extracted/1022505_10K_2021_0001493152-21-032215.json  
  inflating: extracted/1023731_10K_2021_0001023731-21-000091.json  
  inflating: extracted/1029744_10K_2021_0001437749-21-028230.json  
  inflating: extracted/1031308_10K_2020_0001628280-21-003727.json  
  inflating: extracted/1036188_10K_2021_0001437749-21-008761.json  
  inflating: 

Read the dataset in JSON format and contruct pandas DataFrame

In [6]:
# Specify the path to the folder containing the JSON files
folder_path = "extracted"

# Initialize an empty list to store list of company 10-K filing file names
company_filing_file_name_list = []

#For this session, we only ingest few company information.
company_list=["Alteryx, Inc.",
              "MICROSTRATEGY Inc", 
              "Elastic N.V.", 
              "MongoDB, Inc.", 
              "Palo Alto Networks Inc", 
              "Okta, Inc.",
              "Datadog, Inc.", 
              "Snowflake Inc.",
              "SALESFORCE.COM, INC.", 
              "ORACLE CORP",
              "MICROSOFT CORP", 
              "Palantir Technologies Inc."
             ]

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        df = pd.DataFrame([pd.read_json(file_path,typ='series')])
        if df.iloc[0]['company'] in company_list:
            company_filing_file_name_list.append(file_path)


In [7]:
company_filing_file_name_list

['extracted/1321655_10K_2020_0001193125-21-060650.json',
 'extracted/1561550_10K_2020_0001564590-21-009770.json',
 'extracted/1707753_10K_2021_0001707753-21-000026.json',
 'extracted/1660134_10K_2021_0001660134-21-000007.json',
 'extracted/1640147_10K_2021_0001640147-21-000073.json',
 'extracted/1050446_10K_2020_0001564590-21-005783.json',
 'extracted/789019_10K_2021_0001564590-21-039151.json',
 'extracted/1341439_10K_2021_0001564590-21-033616.json',
 'extracted/1689923_10K_2020_0001689923-21-000024.json',
 'extracted/1327567_10K_2021_0001327567-21-000029.json',
 'extracted/1108524_10K_2021_0001108524-22-000008.json',
 'extracted/1441816_10K_2021_0001441816-21-000051.json']

### Initialize embedding model to vectorize text data

**Embeddings** are numerical representations of data, typically used to convert complex, high-dimensional data into a lower-dimensional space where similar data points are closer together. In the context of natural language processing (NLP), embeddings are used to represent words, phrases, or sentences as vectors of real numbers. These vectors capture semantic relationships, meaning that words with similar meanings are represented by vectors that are close together in the embedding space.

**Embedding models** are machine learning models that are trained to create these numerical representations. They learn to encode various types of data into embeddings that capture the essential characteristics and relationships within the data. For example, in NLP, embedding models like Word2Vec, GloVe, and BERT are trained on large text corpora to produce word embeddings. These embeddings can then be used for various downstream tasks, such as text classification, sentiment analysis, or machine translation. In this case we'll be using it for semantic similarity

We use embedding model to convert questions into vector and use vector similiarity to search semantic similiar 10-K data. The following diagram shows the flow: 

![Convert Text to Vector](./static/text2vector.png)

---

![opensearch vector store](./static/opensearch-vector-store.png)


In [8]:
#from langchain.embeddings import BedrockEmbeddings
from langchain_aws import BedrockEmbeddings

boto3_bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=f"https://bedrock-runtime.{aws_region}.amazonaws.com")
bedrock_embeddings = BedrockEmbeddings(model_id='amazon.titan-embed-text-v1',client=boto3_bedrock)
#bedrock_embeddings = BedrockEmbeddings(model_id='cohere.embed-multilingual-v3',client=boto3_bedrock)

In [9]:
result = bedrock_embeddings.embed_query("This is a content of the document")
result[0:20]

[0.75390625,
 -0.34375,
 -0.37890625,
 -0.51171875,
 -0.2734375,
 -0.1435546875,
 -0.2470703125,
 -0.000766754150390625,
 0.34765625,
 -0.2734375,
 0.1416015625,
 -0.453125,
 -0.1767578125,
 -0.037109375,
 -0.6875,
 0.0595703125,
 0.28515625,
 -0.1865234375,
 -0.2734375,
 0.3046875]

In [10]:
len(result)

1536

### Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.

Note: if you're using a region other than us-east-1, please update the region in the code below.

#### Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [11]:
import json
region = aws_region

cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "generative-ai-powered-search"

outputs = get_cfn_outputs(cloudformation_stack_name)
aos_host = outputs['OpenSearchDomainEndpoint']
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['OpenSearchSecret'])['SecretString'])

outputs

{'RedshiftClusterSecurityGroupName': 'sg-0320207541f370a92',
 'RedshiftServerlessWorkroup': 'workgroup-63906330',
 's3BucketStock': 'generative-ai-powered-search-s3bucketstock-cmtvs8onrpd7',
 'SageMakerNotebookURL': 'https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/notebook-instances/openNotebook/generative-ai-powered-search?view=classic',
 'Workgroupname': 'workgroup-63906330',
 'VPC': 'vpc-0ff2845a2a8788f36',
 'RedshiftRoleName': 'RedshiftServerlessImmersionRole',
 'OpenSearchDomainName': 'opensearchservi-hcmvdgngawop',
 'OpenSearchSecret': 'arn:aws:secretsmanager:us-east-1:522880334446:secret:OpenSearchSecret-generative-ai-powered-search-YmqeMU',
 'RedshiftRoleNameArn': 'arn:aws:iam::522880334446:role/RedshiftServerlessImmersionRole',
 'NamespaceName': 'namespace-63906330',
 'AdminUsername': 'awsuser',
 'OpenSearchDomainEndpoint': 'vpc-opensearchservi-hcmvdgngawop-wyavi3nuw53o7nzlzl4jficaeu.us-east-1.es.amazonaws.com',
 'RedshiftServerlessEndpoint': 'workgroup-6390633

In [12]:
from opensearchpy import OpenSearch, RequestsHttpConnection

auth = (aos_credentials['username'], aos_credentials['password'])
aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### Create a index in Amazon OpenSearch Service 

The OpenSearch k-NN plugin introduces a custom data type, the knn_vector, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. 

---
#### OpenSearch Approximate Nearest Neighbor Algorithms and Engines
![ANN algorithm](./static/ann-algorithm.png)

---

#### HNSW parameter tuning
![hnsw parameter tuning](./static/hnsw-parameter-tuning.png)

---

#### IVF parameter tuning
![ivf parameter tuning](./static/ivf-parameter-tuning.png)

#### How to select the engine and algorithms
![opensearch ann comparison](./static/opensearch-ann-selection.png)

In [13]:
knn_index = {
    "settings": {
        "index.knn": True,
        #"index.knn.space_type": "cosinesimil"
    },
    "mappings": {
        "properties": {
            "item_vector": {
                "type": "knn_vector",
                "dimension": 1536,
                "store": True,
                "method": {
                    "name": "hnsw",
                    "space_type": "l2",
                    "engine": "nmslib",
                    "parameters": {
                      "ef_construction": 128,
                      "m": 24
                    }
                }
            },
            "item_content": {
                "type": "text",
                "store": True
            },
            "company_name": {
                "type": "text",
                "store": True
            }
        }
    }
}


Using the above index definition, we now need to create the index in Amazon OpenSearch

In [14]:
index_name="10k_financial"

exist=False
try:
    aos_client.indices.get(index=index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=index_name)
else:
    print("index does not exist.")
    
aos_client.indices.create(index=index_name,body=knn_index,ignore=400)

delete existing index before creating new one


{'acknowledged': True, 'shards_acknowledged': True, 'index': '10k_financial'}

Let's verify the created index information

In [15]:
aos_client.indices.get(index=index_name)


{'10k_financial': {'aliases': {},
  'mappings': {'properties': {'company_name': {'type': 'text', 'store': True},
    'item_content': {'type': 'text', 'store': True},
    'item_vector': {'type': 'knn_vector',
     'store': True,
     'dimension': 1536,
     'method': {'engine': 'nmslib',
      'space_type': 'l2',
      'name': 'hnsw',
      'parameters': {'ef_construction': 128, 'm': 24}}}}},
  'settings': {'index': {'replication': {'type': 'DOCUMENT'},
    'number_of_shards': '5',
    'provided_name': '10k_financial',
    'knn': 'true',
    'creation_date': '1727529505127',
    'number_of_replicas': '1',
    'uuid': 'CxLm1ueMS3SKCuuBxpHdvw',
    'version': {'created': '136367827'}}}}}

###  Load the raw data into the Index
Next, let's load the financial billing data into the index we've just created.

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
import pandas

from typing import Any, Dict, List, Optional, Sequence

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class PandasDataFrameLoader(BaseLoader):
    
    def __init__(self,dataframe:pandas.DataFrame):
        self.dataframe=dataframe
        
    def load(self) -> List[Document]:
        docs = []
        items=["item_1","item_1A","item_1B","item_2","item_3","item_4","item_5","item_6","item_7","item_7A","item_8","item_9","item_9A", "item_9B", "item_10", "item_11", "item_12", "item_13", "item_14", "item_15"]
        
        for index, row in self.dataframe.iterrows():
            metadata={}
            metadata["cik"]=row['cik']
            metadata["company_name"]=row['company']
            metadata["filing_date"]=row['filing_date']
            for item in items:
                content=row[item]
                metadata['item'] = item
                doc = Document(page_content=content,metadata=metadata)
                #print(doc.metadata)
                docs.append(doc)
        return docs

Use Bedrock embedding convert item content into vector and use OpenSearch bulk ingest to store data into OpenSearch index

In [17]:
import time
from opensearchpy import helpers

def ingest_downloaded_10k_into_opensearch(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    item_contents=[]
    company_name=splitted_documents[0].metadata['company_name']
    for doc in splitted_documents:
        item_contents.append(doc.page_content)
    
    print("\ncompany:" + company_name + ", item count:" + str(len(item_contents)))
    start = time.time()
    embedding_results = bedrock_embeddings.embed_documents(item_contents)
    end = time.time()
    elapsed = end - start
    #print(f"total time elapsed for Bedrock embedding: {elapsed:.2f} seconds")
        
    data = []
    i=0
    for content in item_contents:
        data.append({"_index": index_name,  "company_name": company_name, "item_content":content, "item_vector":embedding_results[i]})
        i = i+1
    aos_response= helpers.bulk(aos_client, data)
    print(f"Bulk-inserted {aos_response[0]} items.")

In [18]:
for file in company_filing_file_name_list:
    ingest_downloaded_10k_into_opensearch(file)
    print("Ingested :" + file)


company:Palantir Technologies Inc., item count:88
Bulk-inserted 88 items.
Ingested :extracted/1321655_10K_2020_0001193125-21-060650.json

company:Datadog, Inc., item count:67
Bulk-inserted 67 items.
Ingested :extracted/1561550_10K_2020_0001564590-21-009770.json

company:Elastic N.V., item count:79
Bulk-inserted 79 items.
Ingested :extracted/1707753_10K_2021_0001707753-21-000026.json

company:Okta, Inc., item count:96
Bulk-inserted 96 items.
Ingested :extracted/1660134_10K_2021_0001660134-21-000007.json

company:Snowflake Inc., item count:67
Bulk-inserted 67 items.
Ingested :extracted/1640147_10K_2021_0001640147-21-000073.json

company:MICROSTRATEGY Inc, item count:65
Bulk-inserted 65 items.
Ingested :extracted/1050446_10K_2020_0001564590-21-005783.json

company:MICROSOFT CORP, item count:57
Bulk-inserted 57 items.
Ingested :extracted/789019_10K_2021_0001564590-21-039151.json

company:ORACLE CORP, item count:63
Bulk-inserted 63 items.
Ingested :extracted/1341439_10K_2021_0001564590-21-

To validate the load, we'll query the number of documents number in the index. 

In [19]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])

Records found: 873.


## Part 2: Different appoach to search

### 2.1 Keyword search
---
Keyword search refers to finding information one is looking for using terms or words, called "query", from among a large body of textual data. It uses exact matching of those terms, popularly called "keyword match" without considering the meaning or context behind those words. 


![Keyword Search](./static/keyword-search-flow.png)



In [20]:
query_text="What Microsoft's research and development organization is responsible for?"

Run the query and check the search result. Some irrelevant documents are returned.

In [21]:
query={
  "size": 10,
  "query": {
    "match": {
      "item_content": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_content" : {}
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content'], hit['highlight']['item_content'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content","item_content_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Unnamed: 0,_id,_score,company_name,item_content,item_content_highlight
0,aJjIOJIBwhwjFiEHpeEY,14.089218,"Datadog, Inc.","Research and Development\nOur research and development organization is responsible for the design, development, testing and delivery of new technologies, features and integrations of our platform, as well as the continued improvement and iteration of our existing products. It is also responsible for operating and scaling our platform including the underlying cloud infrastructure. Our research and development investments seek to drive core technology innovation and bring new products to marke...","<em>Research</em> <em>and</em> <em>Development</em>\nOur <em>research</em> <em>and</em> <em>development</em> <em>organization</em> <em>is</em> <em>responsible</em> <em>for</em> the design, <em>development</em>"
1,WJjJOJIBwhwjFiEHpeP5,13.378643,"Alteryx, Inc.","Competitive Pay and Benefits\nWe strive to provide pay, comprehensive benefits and services that help meet the varying needs of our associates. Our total rewards package includes market-competitive pay, including equity compensation, paid time off, and other comprehensive and competitive global benefits. For example, in the United States, we provide 12 weeks of paid parental leave for all new parents (either through birth or adoption). And, for all of our associates, we offer competitive fin...","<em>is</em> recognized by corporate executives, IT personnel, <em>and</em> <em>organization</em> leaders as the solution to their"
2,WpjJOJIBwhwjFiEHI-LG,11.62011,Snowflake Inc.,"◦Cloning. Our architecture enables us to offer zero-copy cloning, an operation by which entire tables, schemas, or databases can be duplicated-or cloned-without having to copy or duplicate the underlying data. Our platform leverages the separation between cloud services and storage to be able to track independent clones of objects sharing the same physical copy of the underlying data. This enables a variety of customer use cases such as making copies of production data for data scientists, c...","<em>Research</em> <em>and</em> <em>Development</em>\nOur <em>research</em> <em>and</em> <em>development</em> <em>organization</em> <em>is</em> <em>responsible</em> <em>for</em> the design, <em>development</em>"
3,-pjJOJIBwhwjFiEHAeH-,10.365611,"Okta, Inc.","Robust Security\nSecurity is a mission-critical issue for Okta and for our customers. Our approach to security spans day-to-day operational practices to the design and development of our software to how customer data is segmented and secured within our multi-tenant platform. We ensure that access to our platform is securely delegated across an organization. Our source code is updated weekly, and there are audited and verifiable security checkpoints to ensure source code fidelity and continuo...",Robust Security\nSecurity <em>is</em> a mission-critical issue <em>for</em> Okta <em>and</em> <em>for</em> our customers.
4,g5jJOJIBwhwjFiEHI-LG,10.353295,Snowflake Inc.,"4) Allocate the transaction price to performance obligations in the contract. If the contract contains a single performance obligation, the entire transaction price is allocated to the single performance obligation. Contracts that contain multiple performance obligations require an allocation of the transaction price to each performance obligation based on a relative SSP. The determination of a relative SSP for each distinct performance obligation requires judgment. The Company determines SS...","<em>responsible</em> <em>for</em> maintaining the Company's service availability <em>and</em> security of its platform, <em>and</em> professional"
5,25jJOJIBwhwjFiEHX-K4,9.406379,MICROSOFT CORP,"Item 1\nThe investments we make in sustainability carry through to our products, services, and devices. We design our devices, from Surface to Xbox, to minimize their impact on the environment. Our cloud and AI services and datacenters help businesses cut energy consumption, reduce physical footprints, and design sustainable products. We also pledged a $50 million investment in AI for Earth to accelerate innovation by putting AI in the hands of those working to directly address sustainabilit...",our mission to empower every person <em>and</em> every <em>organization</em> on the planet to achieve more.
6,D5jIOJIBwhwjFiEHguGm,9.159437,Palantir Technologies Inc.,"Our software brings government-grade security to industry, and the breadth of private sector experience to government.\nOur roots in the intelligence community and defense sector introduced us to a set of unique challenges when it comes to building software that other companies in Silicon Valley and elsewhere either did not or could not address.\nOur software had to be secure enough to handle national secrets, stable enough to support soldiers’ wartime decisions, and transparent enough to en...",<em>and</em> <em>development</em> as well as features built based on the operational needs of nearly every major industry
7,Z5jIOJIBwhwjFiEHpeEY,9.106748,"Datadog, Inc.","•\nSaaS Platform. Our cloud based multi-tenant SaaS platform allows for real-time ingestion, and analysis of massive amounts of data, without our customers needing to worry about the provisioning, sizing and capacity of their monitoring platform.\n•\nOne Data Model. Every piece of data that is ingested by our platform is consistently tagged with metadata regardless of its type. This allows for different kinds of performance data, such as a log event and an application trace, to be queried to...",into <em>what</em> <em>is</em> occurring in a customer’s IT environment <em>and</em> power faster troubleshooting.\n•\nOut-Of-The-Box
8,CZjJOJIBwhwjFiEH6eR_,9.106746,"SALESFORCE.COM, INC.","The topics covered in this section are informed by an internal ESG materiality assessment completed in fiscal 2020, which assessed both the impact on our business and the importance to our stakeholders, as well as by relevant topics identified through third-party ESG reporting frameworks, standards and metrics, such as the Sustainability Accounting Standards Board (“SASB”), and the Task Force on Climate-Related Financial Disclosures (“TCFD”). More information on our key ESG programs, goals a...",world <em>and</em> provided an allowance <em>for</em> employees to use <em>for</em> equipment to improve their ability to work from
9,J5jIOJIBwhwjFiEHguGm,9.043163,Palantir Technologies Inc.,"Our platforms contain “open source” software, and any failure to comply with the terms of one or more of these open source licenses could negatively affect our business.\nOur platforms are distributed with software licensed by its authors or other third parties under “open source” licenses. Some of these licenses contain requirements that we make available source code for modifications or derivative works we create based upon the open source software, and that we license these modifications ...","requests from our <em>development</em> <em>organization</em> <em>for</em> the use of open source software, <em>and</em> the use of software"


In [22]:
query_text="What is Microsoft's main revenue?"

Run the query and check the search result. Some irrelevant documents are returned.

In [23]:
query={
  "size": 10,
  "query": {
    "match": {
      "item_content": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_content" : {}
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content'], hit['highlight']['item_content'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content","item_content_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Unnamed: 0,_id,_score,company_name,item_content,item_content_highlight
0,iZjIOJIBwhwjFiEHpeEY,7.370281,"Datadog, Inc.","Quarterly Cost of Revenue Trends\nOur quarterly cost of revenue has generally increased quarter-over-quarter in each period presented above primarily as a result of third-party cloud infrastructure hosting and software costs, as well as increase headcount, which resulted in increased personnel expenses.\nQuarterly Gross Margin Trends\nOur quarterly gross margins have fluctuated between 73% and 80% in each period presented. Our gross margins decreased in the last three quarters ended December...",Quarterly Cost of <em>Revenue</em> Trends\nOur quarterly cost of <em>revenue</em> has generally increased quarter-over-quarter
1,4JjJOJIBwhwjFiEHX-K4,6.010574,MICROSOFT CORP,"•\nExperiences and Devices, focuses on instilling a unifying product ethos across our end-user experiences and devices, including Office, Windows, Enterprise Mobility + Security, and Surface.\n•\nAI and Research, focuses on our AI innovations and other forward-looking research and development efforts spanning infrastructure, services, applications, and search.\n•\nLinkedIn, focuses on our services that transform the way customers hire, market, sell, and learn.\n•\nGaming, focuses on developi...","While our <em>main</em> product research and development facilities are located in Redmond, Washington, we also"
2,DpjIOJIBwhwjFiEHguGm,5.878239,Palantir Technologies Inc.,"Building flexible applications on this data asset allows the data to be made accessible and understandable to the people who need it. We bring real-time operational data to decision makers at the world’s leading institutions.\nOur Software Does Not Displace Existing Systems, It Augments Them\nFlexibility and openness are core tenets of our software. By integrating their existing solutions into our central operating system, organizations can choose to maintain key historic investments without...",Security <em>is</em> always our first priority.
3,eZjJOJIBwhwjFiEHI-LG,5.483123,Snowflake Inc.,"We intend to continue to make significant investments in research and development as we enhance our platform. We also intend to invest in our sales and marketing organization to drive future revenue growth. As a result of the closing of our IPO, we have incurred and expect to continue to incur additional expenses as a result of operating as a public company, including costs to comply with the rules and regulations applicable to companies listed on a national securities exchange, costs relate...",We also intend to invest in our sales and marketing organization to drive future <em>revenue</em> growth.
4,DZjIOJIBwhwjFiEHguGm,5.339501,Palantir Technologies Inc.,"ITEM 1. BUSINESS\nOverview\nWe build software platforms for large institutions whose work is essential to our way of life. Those institutions must be able to function in times of stability as well as crisis and uncertainty. To do so, they need software that works.\nWe were founded in 2003 and started building software for the intelligence community in the United States to assist in counterterrorism investigations and operations. We later began working with commercial enterprises.\nWe have bu...","sources to reports from confidential informants, and helps U.S. and allied military personnel find <em>what</em>"
5,epjJOJIBwhwjFiEHI-LG,5.253618,Snowflake Inc.,"For the fiscal year ended January 31, 2020, net cash used in operating activities was $176.6 million, primarily consisting of our net loss of $348.5 million, adjusted for non-cash charges of $122.6 million, and net cash inflows of $49.3 million provided by changes in our operating assets and liabilities, net of effect of acquisitions. The main drivers of the changes in operating assets and liabilities, net of effect of acquisitions, were a $223.0 million increase in deferred revenue, resulti...","The <em>main</em> drivers of the changes in operating assets and liabilities, net of effect of acquisitions, were"
6,L5jIOJIBwhwjFiEHguGm,5.21816,Palantir Technologies Inc.,"Evolving government procurement policies and increased emphasis on cost over performance could adversely affect our business.\nFederal, state, local, and foreign governments and government agencies could implement procurement policies that negatively impact our profitability. Changes in procurement policy favoring more non-commercial purchases, different pricing, or evaluation criteria or government contract negotiation offers based upon the customer’s view of what our pricing should be may ...",or evaluation criteria or government contract negotiation offers based upon the customer’s view of <em>what</em>
7,pZjJOJIBwhwjFiEHx-Nf,5.180108,Palo Alto Networks Inc,"Other than fiscal 2012, we have incurred losses in all fiscal years since our inception. As a result, we had an accumulated deficit of $1.7 billion as of July 31, 2021. We anticipate that our operating expenses will continue to increase in the foreseeable future as we continue to grow our business. Our growth efforts may prove more expensive than we currently anticipate, and we may not succeed in increasing our revenues sufficiently, or at all, to offset increasing expenses. Revenue growth m...","<em>Revenue</em> growth may slow or <em>revenue</em> may decline for a number of possible reasons, including the downturn"
8,qJjIOJIBwhwjFiEHzeHF,5.168217,Elastic N.V.,"Item 1. Business.\nElastic is a search company.\nSearch is foundational to a wide variety of experiences. Elastic makes the power of search-the ability to instantly find relevant information and insights from large amounts of data-available for a diverse set of applications and solutions, including Enterprise Search, Observability, and Security.\nElastic powers the search behind a ride sharing app to help locate nearby riders and drivers. Elastic powers the search for finding the right produ...","But how and <em>what</em> we search changes over time, from the Dewey Decimal System for libraries to Google for"
9,EZjIOJIBwhwjFiEHguGm,4.92216,Palantir Technologies Inc.,"These failures have made both software buyers and vendors highly risk-averse. Institutions often doubt that any vendor can implement a working solution and are unwilling to invest. On the other hand, smaller technology companies are often unable to compete for complex, large-scale opportunities because installation costs and the risks of failure are too high, and the sales cycles too long.\nThese are precisely the opportunities we target. Rather than reject projects with risky and resource i...",Each one <em>is</em> now an existing or potential customer.


### 2.2 Semantic/Vector search

---
Semantic search refers to using machine learning to understand the meaning of queries. It improves usefulness of search by understanding the intent and contextual meaning of those terms by bringing results that are hopefully more relevant than simple text search.  

![Semantic Search](./static/semantic-search-flow.png)

---


![Semantic Search Architecture](./static/semantic-search-architecture.png)



In [24]:
query_text="What Microsoft's research and development organization is responsible for?"

Run the query and check the search result. 

In [25]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

Unnamed: 0,_id,_score,company_name,item_content
0,4JjJOJIBwhwjFiEHX-K4,0.006448,MICROSOFT CORP,"•\nExperiences and Devices, focuses on instilling a unifying product ethos across our end-user experiences and devices, including Office, Windows, Enterprise Mobility + Security, and Surface.\n•\nAI and Research, focuses on our AI innovations and other forward-looking research and development efforts spanning infrastructure, services, applications, and search.\n•\nLinkedIn, focuses on our services that transform the way customers hire, market, sell, and learn.\n•\nGaming, focuses on developi..."
1,25jJOJIBwhwjFiEHX-K4,0.005694,MICROSOFT CORP,"Item 1\nThe investments we make in sustainability carry through to our products, services, and devices. We design our devices, from Surface to Xbox, to minimize their impact on the environment. Our cloud and AI services and datacenters help businesses cut energy consumption, reduce physical footprints, and design sustainable products. We also pledged a $50 million investment in AI for Earth to accelerate innovation by putting AI in the hands of those working to directly address sustainabilit..."
2,4ZjJOJIBwhwjFiEHX-K4,0.005662,MICROSOFT CORP,"Distributors and Resellers\nOrganizations also license our products and services indirectly, primarily through licensing solution partners (“LSP”), distributors, value-added resellers (“VAR”), and retailers. Although each type of reselling partner may reach organizations of all sizes, LSPs are primarily engaged with large organizations, distributors resell primarily to VARs, and VARs typically reach small and medium organizations. ESAs are also typically authorized as LSPs and operate as res..."
3,35jJOJIBwhwjFiEHX-K4,0.005599,MICROSOFT CORP,"•\nConstraints in the supply chain of device components.\n•\nPiracy.\nWindows Commercial revenue, which includes volume licensing of the Windows operating system and Windows cloud services such as Microsoft Defender Advanced Threat Protection, is affected mainly by the demand from commercial customers for volume licensing and Software Assurance (“SA”), as well as advanced security offerings. Windows Commercial revenue often reflects the number of information workers in a licensed enterprise ..."
4,3JjJOJIBwhwjFiEHX-K4,0.005441,MICROSOFT CORP,"We strive to include others by holding ourselves accountable for diversity, driving global systemic change in our workplace and workforce, and creating an inclusive work environment. Through this commitment we can allow everyone the chance to be their authentic selves and do their best work every day. We support multiple highly active Employee Resource Groups for women, families, racial and ethnic minorities, military, people with disabilities, or who identify as LGBTQI+, where employees can..."
5,4pjJOJIBwhwjFiEHX-K4,0.005328,MICROSOFT CORP,"Ms. Hogan was appointed Executive Vice President, Human Resources in November 2014. Prior to that Ms. Hogan was Corporate Vice President of Microsoft Services. She also served as Corporate Vice President of Customer Service and Support. Ms. Hogan joined Microsoft in 2003. Ms. Hogan also serves on the Board of Directors of Alaska Air Group, Inc.\nMs. Hood was appointed Executive Vice President and Chief Financial Officer in July 2013, subsequent to her appointment as Chief Financial Officer i..."
6,55jJOJIBwhwjFiEHX-K4,0.005163,MICROSOFT CORP,"Issues in the use of AI in our offerings may result in reputational harm or liability. We are building AI into many of our offerings and we expect this element of our business to grow. We envision a future in which AI operating in our devices, applications, and the cloud helps our customers be more productive in their work and personal lives. As with many disruptive innovations, AI presents risks and challenges that could affect its adoption, and therefore our business. AI algorithms may be ..."
7,3pjJOJIBwhwjFiEHX-K4,0.004929,MICROSOFT CORP,"•\nEnterprise Services, including Premier Support Services and Microsoft Consulting Services.\nServer Products and Cloud Services\nAzure is a comprehensive set of cloud services that offer developers, IT professionals, and enterprises freedom to build, deploy, and manage applications on any platform or device. Customers can use Azure through our global network of datacenters for computing, networking, storage, mobile and web application services, AI, IoT, cognitive services, and machine lear..."
8,6JjJOJIBwhwjFiEHX-K4,0.004592,MICROSOFT CORP,"Government regulatory actions and court decisions such as these may result in fines or hinder our ability to provide the benefits of our software to consumers and businesses, reducing the attractiveness of our products and the revenue that come from them. New competition law actions could be initiated, potentially using previous actions as precedent. The outcome of such actions, or steps taken to avoid them, could adversely affect us in a variety of ways, including:\n•\nWe may have to choose..."
9,D5jIOJIBwhwjFiEHguGm,0.004576,Palantir Technologies Inc.,"Our software brings government-grade security to industry, and the breadth of private sector experience to government.\nOur roots in the intelligence community and defense sector introduced us to a set of unique challenges when it comes to building software that other companies in Silicon Valley and elsewhere either did not or could not address.\nOur software had to be secure enough to handle national secrets, stable enough to support soldiers’ wartime decisions, and transparent enough to en..."


In [26]:
query_text="What is Microsoft's main revenue?"

Run the query and check the search result. 

In [27]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

Unnamed: 0,_id,_score,company_name,item_content
0,9ZjJOJIBwhwjFiEHX-K4,0.005685,MICROSOFT CORP,"Gross margin increased $18.9 billion or 20% driven by growth across each of our segments and the change in estimated useful lives of our server and network equipment. Gross margin percentage increased with the change in estimated useful lives of our server and network equipment. Excluding this impact, gross margin percentage decreased slightly driven by gross margin percentage reduction in More Personal Computing. Commercial cloud gross margin percentage increased 4 points to 71% driven by g..."
1,35jJOJIBwhwjFiEHX-K4,0.004918,MICROSOFT CORP,"•\nConstraints in the supply chain of device components.\n•\nPiracy.\nWindows Commercial revenue, which includes volume licensing of the Windows operating system and Windows cloud services such as Microsoft Defender Advanced Threat Protection, is affected mainly by the demand from commercial customers for volume licensing and Software Assurance (“SA”), as well as advanced security offerings. Windows Commercial revenue often reflects the number of information workers in a licensed enterprise ..."
2,CJjJOJIBwhwjFiEHX-O4,0.004747,MICROSOFT CORP,"Our More Personal Computing segment consists of products and services that put customers at the center of the experience with our technology. This segment primarily comprises:\n•\nWindows, including Windows OEM licensing and other non-volume licensing of the Windows operating system; Windows Commercial, comprising volume licensing of the Windows operating system, Windows cloud services, and other Windows commercial offerings; patent licensing; Windows Internet of Things; and MSN advertising...."
3,9JjJOJIBwhwjFiEHX-K4,0.004608,MICROSOFT CORP,"PART II\nItem 7\nChange in Accounting Estimate\nIn July 2020, we completed an assessment of the useful lives of our server and network equipment and determined we should increase the estimated useful life of server equipment from three years to four years and increase the estimated useful life of network equipment from two years to four years. This change in accounting estimate was effective beginning fiscal year 2021. Based on the carrying amount of server and network equipment included in ..."
4,85jJOJIBwhwjFiEHX-K4,0.0043,MICROSOFT CORP,"Item 7\nITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS\nThe following Management’s Discussion and Analysis of Financial Condition and Results of Operations (“MD&A”) is intended to help the reader understand the results of operations and financial condition of Microsoft Corporation. MD&A is provided as a supplement to, and should be read in conjunction with, our consolidated financial statements and the accompanying Notes to Financial Statements ..."
5,_ZjJOJIBwhwjFiEHX-K4,0.00411,MICROSOFT CORP,"The consolidated financial statements include the accounts of Microsoft Corporation and its subsidiaries. Intercompany transactions and balances have been eliminated.\nEstimates and Assumptions\nPreparing financial statements requires management to make estimates and assumptions that affect the reported amounts of assets, liabilities, revenue, and expenses. Examples of estimates and assumptions include: for revenue recognition, determining the nature and timing of satisfaction of performance..."
6,3pjJOJIBwhwjFiEHX-K4,0.003936,MICROSOFT CORP,"•\nEnterprise Services, including Premier Support Services and Microsoft Consulting Services.\nServer Products and Cloud Services\nAzure is a comprehensive set of cloud services that offer developers, IT professionals, and enterprises freedom to build, deploy, and manage applications on any platform or device. Customers can use Azure through our global network of datacenters for computing, networking, storage, mobile and web application services, AI, IoT, cognitive services, and machine lear..."
7,25jJOJIBwhwjFiEHX-K4,0.003901,MICROSOFT CORP,"Item 1\nThe investments we make in sustainability carry through to our products, services, and devices. We design our devices, from Surface to Xbox, to minimize their impact on the environment. Our cloud and AI services and datacenters help businesses cut energy consumption, reduce physical footprints, and design sustainable products. We also pledged a $50 million investment in AI for Earth to accelerate innovation by putting AI in the hands of those working to directly address sustainabilit..."
8,4JjJOJIBwhwjFiEHX-K4,0.003852,MICROSOFT CORP,"•\nExperiences and Devices, focuses on instilling a unifying product ethos across our end-user experiences and devices, including Office, Windows, Enterprise Mobility + Security, and Surface.\n•\nAI and Research, focuses on our AI innovations and other forward-looking research and development efforts spanning infrastructure, services, applications, and search.\n•\nLinkedIn, focuses on our services that transform the way customers hire, market, sell, and learn.\n•\nGaming, focuses on developi..."
9,BpjJOJIBwhwjFiEH6eR_,0.003811,"SALESFORCE.COM, INC.","not a significant driver of the increase in revenues for the period. Revenues from term and perpetual software licenses, which are recognized at a point in time, represent approximately six percent of total subscription and support revenues for fiscal 2021. Subscription and support revenues accounted for approximately 94 percent of our total revenues for both fiscal 2021 and fiscal 2020.\nThe acquisition of Tableau in August 2019 contributed approximately $1.5 billion and $652 million to tot..."


### 2.3 Retrieval Augmented Generation(RAG)

In RAG, external data can be sourced from various data sources, such as document repositories, databases, or APIs. The first step is to convert the documents and the user query into a format that enables comparison and allows for performing relevancy search. To achieve comparability for relevancy search, a document collection (knowledge library) and the user-submitted query are transformed into numerical representations using embedding language models. These embeddings are essentially numerical representations of concepts in text.

Next, based on the embedding of the user query, relevant text is identified in the document collection through similarity search in the embedding space. The prompt provided by the user is then combined with the searched relevant text and added to the context. This updated prompt, which includes relevant external data along with the original prompt, is sent to the LLM (Language Model) for processing. As a result, the model output becomes relevant and accurate due to the context containing the relevant external data.

The major components of RAG, including embedding, vector databases, augmentation, and generation:


- Embedding: Purpose: Embeddings transform text data into numerical vectors in a high-dimensional space. These vectors represent the semantic meaning of the text. Process: The embedding process typically uses pre-trained models (like BERT or a variant) to convert both the input queries and the documents in the database into dense vectors. Role in RAG: Embeddings are crucial for the retrieval component as they allow the model to compute the similarity between the query and the documents in the database efficiently.
- Vector Database: Function: A vector database stores the embeddings of a large collection of documents or passages. Construction: It is created by processing a vast corpus (like Wikipedia or other specialized datasets) through an embedding model. Usage in RAG: When a query comes in, the model searches this database to find the documents whose embeddings are most similar to the embedding of the query.
- Retrieval (Augmentation): Mechanism: The retrieval part of RAG functions by taking the input query, converting it into a vector using embeddings, and then searching the vector database to retrieve relevant documents. Result: It augments the original query with additional context by selecting documents or passages that are semantically related to the query. This augmented information is essential for generating more informed responses.
- Generation: Integration with a Language Model: The generative component, often a large language model like Amazon Titan Text, receives both the original query and the retrieved documents. Response Generation: It synthesizes information from these inputs to produce a coherent and contextually appropriate response. Training and Fine-Tuning: This component is generally pre-trained on vast amounts of text and may be further fine-tuned to optimize its performance for specific tasks or datasets.
- End-to-End Training (Optional): Joint Optimization: In RAG, both retrieval and generation components can be fine-tuned together, allowing the system to optimize the selection of documents and the generation of responses simultaneously. Feedback Loop: The model learns not only to generate relevant responses but also to retrieve the most useful documents for any given query.

---
### Architecture

![RAG](./static/RAG_Architecture.png)

---

In [28]:
langchain_index_name="10k_financial_embedding"

In [29]:
exist=False
try:
    aos_client.indices.get(index=langchain_index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=langchain_index_name)
else:
    print("index does not exist.")
    

delete existing index before creating new one


In [30]:
from langchain.vectorstores import OpenSearchVectorSearch
from typing import Callable

os_domain_ep = 'https://'+aos_host


def ingest_10k_into_opensearch_with_langchain(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    OpenSearchVectorSearch.from_documents(
                index_name = langchain_index_name,
                documents=splitted_documents,
                embedding=bedrock_embeddings,
                opensearch_url=os_domain_ep,
                http_auth=auth
    )

In [31]:
for file in company_filing_file_name_list:
    ingest_10k_into_opensearch_with_langchain(file)
    print("Ingested :" + file)


Ingested :extracted/1321655_10K_2020_0001193125-21-060650.json
Ingested :extracted/1561550_10K_2020_0001564590-21-009770.json
Ingested :extracted/1707753_10K_2021_0001707753-21-000026.json
Ingested :extracted/1660134_10K_2021_0001660134-21-000007.json
Ingested :extracted/1640147_10K_2021_0001640147-21-000073.json
Ingested :extracted/1050446_10K_2020_0001564590-21-005783.json
Ingested :extracted/789019_10K_2021_0001564590-21-039151.json
Ingested :extracted/1341439_10K_2021_0001564590-21-033616.json
Ingested :extracted/1689923_10K_2020_0001689923-21-000024.json
Ingested :extracted/1327567_10K_2021_0001327567-21-000029.json
Ingested :extracted/1108524_10K_2021_0001108524-22-000008.json
Ingested :extracted/1441816_10K_2021_0001441816-21-000051.json


In [32]:
aos_client.indices.get(index=langchain_index_name)

{'10k_financial_embedding': {'aliases': {},
  'mappings': {'properties': {'metadata': {'properties': {'cik': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'company_name': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'filing_date': {'type': 'date'},
      'item': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}},
    'text': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'vector_field': {'type': 'knn_vector',
     'dimension': 1536,
     'method': {'engine': 'nmslib',
      'space_type': 'l2',
      'name': 'hnsw',
      'parameters': {'ef_construction': 512, 'm': 16}}}}},
  'settings': {'index': {'replication': {'type': 'DOCUMENT'},
    'number_of_shards': '5',
    'knn.algo_param': {'ef_search': '512'},
    'provided_name': '10k_financial_embedding',
    'knn': 'true',
    'creation_date': '1727529632

In [33]:

class SimiliarOpenSearchVectorSearch(OpenSearchVectorSearch):
    
    def relevance_score(self, distance: float) -> float:
        return distance
    
    def _select_relevance_score_fn(self) -> Callable[[float], float]:
        return self.relevance_score


open_search_vector_store = SimiliarOpenSearchVectorSearch(
                                    index_name=langchain_index_name,
                                    embedding_function=bedrock_embeddings,
                                    opensearch_url=os_domain_ep,
                                    http_auth=auth
                                    ) 


Initialize Bedrock LLM model with Claude

In [34]:
from langchain_aws import BedrockLLM, ChatBedrock

#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=boto3_bedrock)
bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=boto3_bedrock)
#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-opus-20240229-v1:0", client=boto3_bedrock)

bedrock_llm.model_kwargs = {"temperature":0.001,"top_k":300,"top_p":1}


#### Note: This session's prompt is desinged for Claude 3. Output result may be different if use other LLMs, for example guardrails impact.

In [35]:
from langchain.chains import RetrievalQA
import langchain

bedrock_retriever = open_search_vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        'k': 5,
        'score_threshold': 0.005
    }
)

In [36]:
rag_qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm,
    retriever=bedrock_retriever,
    chain_type="stuff" #stuff, refine, map_reduce, and map_rerank
)

In [37]:
question="What Microsoft's research and development organization is responsible for?"

langchain.debug=True
result = rag_qa({"query": question})


  result = rag_qa({"query": question})


[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What Microsoft's research and development organization is responsible for?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What Microsoft's research and development organization is responsible for?",
  "context": "•\nExperiences and Devices, focuses on instilling a unifying product ethos across our end-user experiences and devices, including Office, Windows, Enterprise Mobility + Security, and Surface.\n•\nAI and Research, focuses on our AI innovations and other forward-looking research and development efforts spanning infrastructure, services, applications, and search.\n•\nLinkedIn, focuses on our services that transform the way customers hire, market, 

In [38]:
print("Result:" + result["result"])

Result:According to the information provided, Microsoft has several engineering groups responsible for research and development of its products and services:

- Cloud and AI focuses on developing cloud infrastructure, server, database, CRM, ERP, management and development tools, AI cognitive services, and other business process applications and services for enterprises.

- Experiences and Devices focuses on end-user experiences and devices like Office, Windows, Enterprise Mobility + Security, and Surface.

- AI and Research focuses on AI innovations and other forward-looking research and development efforts spanning infrastructure, services, applications, and search.

- Gaming focuses on developing hardware, content, and services across platforms to grow the user base through game experiences and social interaction.

Additionally, Microsoft Research is described as one of the world's largest corporate research organizations that works closely with universities to advance computer scien

In [39]:
question="What is Microsoft main revenue?"

langchain.debug=False
result = rag_qa({"query": question})


In [40]:
print("Result:" + result["result"])

Result:Based on the information provided, Microsoft's main revenue sources come from its three reportable segments:

1. Productivity and Business Processes
This segment includes Office commercial products and cloud services (e.g. Office 365 Commercial), Office consumer products and cloud services, LinkedIn, and Dynamics products and cloud services. Key products driving revenue in this segment are Office 365 Commercial and LinkedIn.

2. Intelligent Cloud  
This segment includes server products and cloud services (with Azure being the key driver), as well as Enterprise Services like Premier Support Services. Azure revenue grew 50% and was the main growth driver for this segment.

3. More Personal Computing
This segment includes Windows OEM revenue, Windows Commercial products and cloud services, Gaming (Xbox content/services and hardware), Search advertising, and Surface devices. Key revenue drivers were Gaming (up 33% driven by Xbox content/services and hardware) and Search advertising.

## Part 3: AI agent powered search

![standard rag limitation](./static/rag-limitation.png)

### What is an AI agent ?
An agentic employs a chain-of-thought reasoning process, where the LLM is prompted to think gradually through a question, interleaving its reasoning with the ability to use external tools such as search engines and APIs. This allows the LLM to retrieve relevant information that can help answer partial aspects of the question, ultimately leading to a more comprehensive and accurate final response. This approach is inspired by the "Reason and Act" (ReAct) design introduced in the paper [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/pdf/2210.03629)  which aims to synergize the reasoning capabilities of language models with the ability to interact with external resources and take actions. By combining these two facets, an agentic LLM assistant can provide more informed and well-rounded answers to complex user queries.

### Why build an AI agent?
In today's digital landscape, enterprises are inundated with a vast array of data sources, ranging from traditional PDF documents to complex SQL and NoSQL databases, and everything in between. While this wealth of information holds immense potential for gaining valuable insights and driving operational efficiency, the sheer volume and diversity of data can often pose significant challenges in terms of accessibility and utilization.

This is where the power of agentic LLM assistants comes into play. By leveraging progress in LLM design patterns such as Reason and Act (ReAct) and other traditional or novel design patterns, these intelligent assistants are capable of integrating with an enterprise's diverse data sources. Through the development of specialized tools tailored to each data source and the ability of LLM agents to identify the right tool for a given question, agentic LLM assistants can simplify how you navigate and extract relevant information, regardless of its origin or structure.

This enables a rich, multi-source conversation that promises to unlock the full potential of the entreprise data, enabling data-driven decision-making, enhancing operational efficiency, and ultimately driving productivity and growth.


### Architecture

The following is the overall architecture on agent based finincial filings analysis:

![generative ai powered search](./static/architecture.png)

---

### 3.1 Prepare other tools used by AI agent

#### 3.1.1 Ingest and query structured data in Redshift

Get Redshift Serverless username, password and endpoint

In [41]:
redshift_serverless_credentials = json.loads(kms.get_secret_value(SecretId=outputs['RedshiftServerlessSecret'])['SecretString'])
redshift_serverless_username=redshift_serverless_credentials['username']
redshift_serverless_password=redshift_serverless_credentials['password']
redshift_serverless_endpoint =  outputs['RedshiftServerlessEndpoint']

Create `stock_symbol` table and populate the table from S3. We will use this table to query company stock ticker by company name.

In [42]:
import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host=redshift_serverless_endpoint, 
port=5439,
database='dev',
username=redshift_serverless_username,
password=redshift_serverless_password
)

%sql $connect_to_db
%sql select current_user, version();

%sql CREATE TABLE IF NOT EXISTS public.stock_symbol (stock_symbol text PRIMARY KEY, company_name text NOT NULL);

stock_price_bucket = outputs["s3BucketStock"]
s3_location = f's3://{stock_price_bucket}/stock-price/'
print(s3_location)
!aws s3 sync ./stock-price/ $s3_location

stock_symbol_s3_location = f's3://{stock_price_bucket}/stock-price/stock_symbol.csv'
quoted_stock_symbol_s3_location = "'" + stock_symbol_s3_location + "'"

%sql COPY STOCK_SYMBOL FROM $quoted_stock_symbol_s3_location iam_role default IGNOREHEADER 1 CSV;


url = URL.create(
    drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
    host=redshift_serverless_endpoint, 
    port=5439,
    database='dev',
    username=redshift_serverless_username,
    password=redshift_serverless_password
)

engine = sa.create_engine(url)
redshift_connection = engine.connect()
    
def query_stock_ticker(company_name):
    strSQL = "SELECT stock_symbol FROM stock_symbol WHERE lower(company_name) ILIKE '%" + company_name + "%'"
    stock_ticker=''
    try:
        result = redshift_connection.execute(strSQL)
        df = pd.DataFrame(result)
        stock_ticker=df['stock_symbol'][0]
    except Exception as e:
        print(e)
    return stock_ticker


 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
s3://generative-ai-powered-search-s3bucketstock-cmtvs8onrpd7/stock-price/
 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.


In [43]:
query_stock_ticker("Amazon")

'AMZN'

In [44]:
%sql CREATE TABLE IF NOT EXISTS public.stock_price (stock_date DATE, stock_symbol text, open_price DECIMAL, high_price DECIMAL, low_price DECIMAL, close_price DECIMAL, adjusted_close_price DECIMAL, volume DECIMAL);

msft_s3_location = f's3://{stock_price_bucket}/stock-price/MSFT.csv'
quoted_msft_s3_location = "'" + msft_s3_location + "'"
print(quoted_msft_s3_location)
print("---------")

crm_s3_location = f's3://{stock_price_bucket}/stock-price/CRM.csv'
quoted_crm_s3_location = "'" + crm_s3_location + "'"
print(quoted_crm_s3_location)
print("---------")

orcl_s3_location = f's3://{stock_price_bucket}/stock-price/ORCL.csv'
quoted_orcl_s3_location = "'" + orcl_s3_location + "'"
print(quoted_orcl_s3_location)
print("---------")

snow_s3_location = f's3://{stock_price_bucket}/stock-price/SNOW.csv'
quoted_snow_s3_location = "'" + snow_s3_location + "'"
print(quoted_snow_s3_location)
print("---------")

%sql COPY STOCK_PRICE FROM $quoted_msft_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_crm_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_orcl_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_snow_s3_location iam_role default IGNOREHEADER 1 CSV;


 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
's3://generative-ai-powered-search-s3bucketstock-cmtvs8onrpd7/stock-price/MSFT.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-cmtvs8onrpd7/stock-price/CRM.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-cmtvs8onrpd7/stock-price/ORCL.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-cmtvs8onrpd7/stock-price/SNOW.csv'
---------
 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-e

[]

In [45]:
%sql select * from public.stock_price

 * redshift+redshift_connector://awsuser:***@workgroup-63906330.522880334446.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.


stock_date,stock_symbol,open_price,high_price,low_price,close_price,adjusted_close_price,volume
2022-01-07,MSFT,314,316,310,314,309,32720000
2022-07-14,MSFT,250,255,245,254,251,25102800
2023-01-17,MSFT,237,240,237,240,238,29831300
2023-07-21,MSFT,349,350,339,343,343,69368900
2022-04-04,CRM,214,221,214,218,218,7843600
2022-10-06,CRM,155,158,154,155,155,3853900
2023-04-12,CRM,190,191,189,190,190,5033400
2023-10-13,CRM,205,207,203,204,204,3677500
2022-06-29,ORCL,69,70,68,69,67,8077900
2022-12-30,ORCL,81,81,80,81,80,5375700


In [46]:
def query_stock_price(stock_ticker):
    strSQL = "SELECT stock_date, stock_symbol, open_price, high_price, low_price, close_price FROM stock_price WHERE stock_symbol ='" + stock_ticker + "' limit 100"
    try:
        result = redshift_connection.execute(strSQL)
        stock_price = pd.DataFrame(result)
    except Exception as e:
        print(e)
    return stock_price

In [47]:
query_stock_price('MSFT')

Unnamed: 0,stock_date,stock_symbol,open_price,high_price,low_price,close_price
0,2022-02-17,MSFT,296,296,290,290
1,2022-08-23,MSFT,276,278,275,276
2,2023-02-27,MSFT,252,252,249,250
3,2023-08-30,MSFT,328,329,326,328
4,2022-01-20,MSFT,309,311,301,301
5,2022-07-26,MSFT,259,259,249,251
6,2023-01-27,MSFT,248,249,246,248
7,2023-08-02,MSFT,333,333,326,327
8,2022-06-27,MSFT,268,268,263,264
9,2022-12-28,MSFT,236,239,234,234


#### 3.1.2 Download 10-K filing from SEC
---
https://sec-api.io

Create a new account and get free API key.



In [48]:
!pip install sec-api



##### Replace your sec-api key in the following line

In [49]:
sec_api_key="{security_api_key}"

In [50]:
from sec_api import ExtractorApi, QueryApi
import json
import os

def get_filings(ticker):
    global sec_api_key

    # Finding Recent Filings with QueryAPI
    queryApi = QueryApi(api_key=sec_api_key)
    query = {
      "query": f"ticker:{ticker} AND formType:\"10-K\"",
      "from": "0",
      "size": "1",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }
    response = queryApi.get_filings(query)

    # Getting 10-K URL
    filing_url = response["filings"][0]["linkToFilingDetails"]
    filing_type=response['filings'][0]['formType']
    cik=response['filings'][0]['cik']
    company=response['filings'][0]['companyName']
    filing_date=response['filings'][0]['filedAt']
    period_of_report=response['filings'][0]['periodOfReport']
    filing_html_index=response['filings'][0]['linkToFilingDetails']
    complete_text_filing_link=response['filings'][0]['linkToTxt']

    # Extracting Text with ExtractorAPI
    extractorApi = ExtractorApi(api_key=sec_api_key)
    
    one_text = extractorApi.get_section(filing_url, "1", "text")       #Section 1 - Business
    onea_text = extractorApi.get_section(filing_url, "1A", "text")     # Section 1A - Risk Factors
    oneb_text = extractorApi.get_section(filing_url, "1B", "text")     # Section 1B - Unresolved Staff Comments
    two_text = extractorApi.get_section(filing_url, "2", "text")       # Section 2 - Properties
    three_text = extractorApi.get_section(filing_url, "3", "text")     # Section 3 - Legal Proceedings
    four_text = extractorApi.get_section(filing_url, "4", "text")      # Section 4 - Mine Safety Disclosures
    five_text = extractorApi.get_section(filing_url, "5", "text")      # Section 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
    six_text = extractorApi.get_section(filing_url, "6", "text")       # Section 6 - Selected Financial Data (prior to February 2021)
    seven_text = extractorApi.get_section(filing_url, "7", "text")     # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
    sevena_text = extractorApi.get_section(filing_url, "7A", "text")   # Section 7A - Quantitative and Qualitative Disclosures about Market Risk
    eight_text = extractorApi.get_section(filing_url, "8", "text")     # Section 8 - Financial Statements and Supplementary Data
    nine_text = extractorApi.get_section(filing_url, "9", "text")      # Section 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
    ninea_text = extractorApi.get_section(filing_url, "9A", "text")    # Section 9A - Controls and Procedures
    nineb_text = extractorApi.get_section(filing_url, "9B", "text")    # Section 9B - Other Information
    ten_text = extractorApi.get_section(filing_url, "10", "text")      # Section 10 - Directors, Executive Officers and Corporate Governance
    eleven_text = extractorApi.get_section(filing_url, "11", "text")   # Section 11 - Executive Compensation
    twelve_text = extractorApi.get_section(filing_url, "12", "text")   # Section 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
    thirteen_text = extractorApi.get_section(filing_url, "13", "text") # Section 13 - Certain Relationships and Related Transactions, and Director Independence
    fourteen_text = extractorApi.get_section(filing_url, "14", "text") # Section 14 - Principal Accountant Fees and Services
    fifteen_text = extractorApi.get_section(filing_url, "15", "text")  # Section 15 - Exhibits and Financial Statement Schedules
    
    data = {}
    data['filing_url'] = filing_url
    data['filing_type'] = filing_type
    data['cik'] = cik
    data['company'] = company
    data['filing_date'] = filing_date
    data['period_of_report'] = period_of_report
    data['filing_html_index'] = filing_html_index
    data['complete_text_filing_link'] = complete_text_filing_link
    
    
    data['item_1'] = one_text
    data['item_1A'] = onea_text
    data['item_1B'] = oneb_text
    data['item_2'] = two_text
    data['item_3'] = three_text
    data['item_4'] = four_text
    data['item_5'] = five_text
    data['item_6'] = six_text
    data['item_7'] = seven_text
    data['item_7A'] = sevena_text
    data['item_8'] = eight_text
    data['item_9'] = nine_text
    data['item_9A'] = ninea_text
    data['item_9B'] = nineb_text
    data['item_10'] = ten_text
    data['item_11'] = eleven_text
    data['item_12'] = twelve_text
    data['item_13'] = thirteen_text
    data['item_14'] = fourteen_text
    data['item_15'] = fifteen_text
    
    json_data = json.dumps(data)
    
    
    if not os.path.exists("./download_filings"):
        os.makedirs("./download_filings")
    
    try:
        file_name = filing_url.split("/")[-2] + "-" + filing_url.split("/")[-1].split(".")[0]+".json"
        download_to = "./download_filings/" + file_name
        with open(download_to, "w") as f:
          json.dump(data, f, ensure_ascii=False, indent=4)
    except Exception as e:
        print("Problem with {url}".format(url=url))
        print(e)
    
    return file_name

In [51]:
#downloaded_file=get_filings("AMZN")
#ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file)

### 3.2 Create AI agent

#### Define methods used by AI agent

One popular architecture for building agents is ReAct. ReAct combines reasoning and acting in an iterative process - in fact the name "ReAct" stands for "Reason" and "Act".

The general flow looks like this:

- The model will "think" about what step to take in response to an input and any previous observations.
- The model will then choose an action from available tools (or choose to respond to the user).
- The model will generate arguments to that tool.
- The agent runtime (executor) will parse out the chosen tool and call it with the generated arguments.
- The executor will return the results of the tool call back to the model as an observation.
- This process repeats until the agent chooses to respond.

In [52]:
from langchain.prompts.chat import ChatPromptTemplate
from langchain.chains import LLMChain

def is_financial_statement_related_query(human_input):
    #template = """You are a helpful assistant to judge if the human input is stock related question.
    #If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    template = """You are a helpful assistant to judge if the human input is trying to analyze company financial statement.
    If the human input is financial statement related question, answer \"yes\". Otherwise answer \"no\".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def is_stock_related_query(human_input):
    #template = """You are a helpful assistant to judge if the human input is stock related question.
    #If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    template = """
    You are a helpful assistant to judge if the human input is stock related question. 
    If the human innput is stock related question, return "yes".Otherwise return "no".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def get_company_name(human_input):
    template = """You are a helpful assistant who extract company name from the human input.Please only output the company"""
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )

    company_name=llm_chain({"text":human_input})['text'].strip()
    return company_name
    
def semantic_search_and_check(human_input, k=10,with_post_filter=True):

    company_name=get_company_name(human_input)
    
    search_vector = bedrock_embeddings.embed_query(human_input)

    no_post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        }
    }

    post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        },
        "post_filter": {
           "match": { 
               "company_name":company_name
           }
        }
    }
    
    search_query=no_post_filter_search_query
    if with_post_filter:
        search_query=post_filter_search_query
    
    res = aos_client.search(index=index_name, 
                       body=search_query,
                       stored_fields=["company_name","item_category","item_content"])
    
    query_result=[]
    for hit in res['hits']['hits']:
        hit_company=hit['fields']['company_name'][0]
        print("\nsemantic search hit company: " + hit_company)
        row=[hit['fields']['company_name'][0], hit['fields']['item_content'][0]]
        query_result.append(row)

    query_result_df = pd.DataFrame(data=query_result,columns=["company_name","company_financial_statements"])
    return query_result_df

def search_for_similiar_content_in_10k_filing(human_input):
    company_statements = semantic_search_and_check(human_input)
    return company_statements

def search_financial_statements_for_company(company_financial_statements_query):
    company_statements = semantic_search_and_check(company_financial_statements_query)
    return company_statements

def get_stock_ticker(human_input):
    company_name=get_company_name(human_input)
    company_ticker = query_stock_ticker(company_name)
    return company_ticker

def get_stock_price(stock_ticker):
    stock_price = query_stock_price(stock_ticker)
    return stock_price

def download_10k_filing_from_sec_and_ingest_into_opensearch(stock_ticker):
    result = "download failed."
    try:
        #downloaded_file=get_filings(stock_ticker)
        downloaded_file="000101872424000008-amzn-20231231.json"
        ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file)
        result="download succeeded."
    except Exception as e:
        result = "download failed."
    return result

### Note

Uncomment the line `downloaded_file=get_filings(company_stock_ticker)` if you have sec-api security key so that you can download 10-K from SEC. In the meanwhile, comment the line 'downloaded_file="000101872424000008-amzn-20231231.json"`



---
![OpenSearch KNN Filter](./static/opensearch-knn-filter.png)

#### Define tools for financial statements analysis AI agent

In [53]:
from langchain.agents import Tool

annual_report_tools=[
    Tool(
        name="is_financial_statement_related_query",
        func=is_financial_statement_related_query,
        description="""
        Use this tool when you need to know whether user input query is financial statement analysis related query. Human orginal query is the input to this tool. This tool output is whether human input is financial statement analysis related or not. 
        If the query is not finance statement related, please answer \"I am finiancial statement ansysis assitant. I can not answer question which is not finance related.\" and terminate the dialog.
        """
    ),
    Tool(
        name="search_financial_statements_for_company",
        func=search_financial_statements_for_company,
        description="""
        Use this tool to get financial statement of the company. This tool output is company financial statements.
        """
    ),
    Tool(
        name="get_stock_ticker",
        func=get_stock_ticker,
        description="Use this tool when you need to get the company stock ticker. Human orginal query is the input to this tool. This tool will output company stock ticker."
    ),
    Tool(
        name="download_10k_filing_from_sec_and_ingest_into_opensearch",
        func=download_10k_filing_from_sec_and_ingest_into_opensearch,
        description="""
        Use this tool to download company financial statements from internet. Company stock ticker is the input to this tool. The tool output is download succeed or not.
        Use this tool if and only if "search_financial_statements_for_company" output result is empty. After downloading financial statements, you must use "search_financial_statements_for_company" tool to search financial statements again.
        """
    ),
    Tool(
        name="is_stock_related_query",
        func=is_stock_related_query,
        description="Use this tool when you need to know whether user input query is stock related query. Human orginal query is the input to this tool. This tool output is whether human input is stock related or not."
    ),
    Tool(
        name="get_stock_price",
        func=get_stock_price,
        description="""
        Use this tool to get company stock price data. Company stock ticker is the input to this tool. This tool will output company historic stock price. The output includes 'stock_date', 'stock_ticker', 'open_price', 'high_price', 'low_price', 'close_price' of the company in the latest 100 days.
        This tool is mandatory to use if the input query is both finance statement related and stock related. If the output of "get_stock_price" is empty, please answer \"I cannot provide stock analysis without stock price information.\" and terminate the dialog.
        """
    )
]

#### Define prompt for financial statements analysis AI agent 

In [54]:
from langchain_core.prompts import ChatPromptTemplate


system_message = f"""
You are finiancial analyst assistant and you will analyze company financial statements and stock data. 
Leverage the <conversation_history> to avoid duplicating work when answering questions.

Available tools:
<tools>
{{tools}}
</tools>


To answer, first review the <conversation_history>. If insufficient use tool(s) with the following format:
<thinking>Think about which tool(s) to use and why. "get_stock_price" tool is mandatory to use if the input query is both finance statements related and stock related.</thinking>
<tool>tool_name</tool>
<tool_input>input</tool_input>
<observation>response</observation>

When you are done, provide a final answer in markdown within <final_answer></final_answer>.
If <user_input> is stock related and the output of "get_stock_price" tool is empty, respond directly within <final_answer> with the exact content \"I cannot provide stock analysis without stock price information.\".
Otherwise, use the following format to organize your <final_answer>:

Summary:
...

Support points:
Support point 1: ...
Support point 2: ...
Support point 3: ...


"""

user_message = """
Begin!

Previous conversation history:
<conversation_history>
{chat_history}
</conversation_history>

User input message:
<user_input>
{input}
</user_input>

{agent_scratchpad}
"""

# Construct the prompt from the messages
messages = [
    ("system", system_message),
    ("human", user_message),
]

financial_statements_analysis_prompt = ChatPromptTemplate.from_messages(messages)

#### Define memory for financial statements analysis AI agent 

In [55]:
from langchain_community.chat_message_histories import DynamoDBChatMessageHistory
from uuid import uuid4

dynamo = boto3.client('dynamodb')

history_table_name = 'conversation-history-memory'

try:
    response = dynamo.describe_table(TableName=history_table_name)
    print("The table "+history_table_name+" exists")
except dynamo.exceptions.ResourceNotFoundException:
    print("The table "+history_table_name+" does not exist")
    
    dynamo.create_table(
    TableName=history_table_name,
    AttributeDefinitions=[
        {
            'AttributeName': 'SessionId',
            'AttributeType': 'S',
        }
    ],
    KeySchema=[
        {
            'AttributeName': 'SessionId',
            'KeyType': 'HASH',
        }
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 5,
        'WriteCapacityUnits': 5,
    }
    )

    response = dynamo.describe_table(TableName=history_table_name) 
    
    while response["Table"]["TableStatus"] == 'CREATING':
        time.sleep(1)
        print('.', end='')
        response = dynamo.describe_table(TableName=history_table_name) 

    print("\ndynamo DB Table, '"+response['Table']['TableName']+"' is created")



The table conversation-history-memory exists


#### Create financial statements analysis AI agent AI using defined Memory,  LLM, tools and prompt

In [56]:
from langchain.agents import create_xml_agent
from langchain.memory import ConversationBufferMemory
from langchain.agents import AgentExecutor


def create_new_memory_with_session(session_id):
    chat_memory = DynamoDBChatMessageHistory(table_name=history_table_name,session_id=session_id)    
    return chat_memory

def get_agentic_chatbot_conversation_chain(session_id, verbose=True):
    chat_memory=create_new_memory_with_session(session_id)
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        # Change the human_prefix from Human to something else
        # to not conflict with Human keyword in Anthropic Claude model.
        human_prefix="Hu",
        chat_memory=chat_memory,
        return_messages=False)

    agent = create_xml_agent(
        bedrock_llm,
        annual_report_tools,
        financial_statements_analysis_prompt,
        stop_sequence=["</tool_input>", "</final_answer>"]
    )

    agent_chain = AgentExecutor(
        agent=agent,
        tools=annual_report_tools,
        return_intermediate_steps=False,
        verbose=True,
        memory=memory,
        handle_parsing_errors="Check your output and make sure it conforms!"
    )
    return agent_chain

### 3.3 Use financial statements analysis AI agent

#### Example 1:

Ask the queustion "Per Microsoft financial statements, what Microsoft's research and development organization is responsible for?". 



In [57]:
import warnings


langchain.debug=False
warnings.filterwarnings("ignore")

session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Per Microsoft financial statements, what Microsoft's research and development organization is responsible for?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<thinking>
To answer this query, I need to:
1. Check if the query is related to financial statements using the is_financial_statement_related_query tool.
2. If it is related, search for Microsoft's financial statements using the search_financial_statements_for_company tool.
3. If the financial statements are not found, use the get_stock_ticker tool to get Microsoft's stock ticker, then use the download_10k_filing_from_sec_and_ingest_into_opensearch tool to download and ingest Microsoft's 10-K filing into the search engine.
4. Search the financial statements again using search_financial_statements_for_company.
5. Analyze the financial statements to find information about Microsoft's research and development organization.
</thinking>

<tool>is_financial_statement_related_query</tool>
<tool_input>Per Microsoft financial statements, what Microsoft's research and development organization is responsible for?[0m[36;1m[1;3mYes, th

In [58]:
print(response["output"])


Summary:
According to Microsoft's financial statements, the company's AI and Research organization is responsible for its forward-looking research and development efforts spanning infrastructure, services, applications, and search. This includes Microsoft's AI innovations and other R&D initiatives.

Support points:

1. In the section describing Microsoft's business segments, it states: "AI and Research, focuses on our AI innovations and other forward-looking research and development efforts spanning infrastructure, services, applications, and search."

2. Microsoft highlights the importance of AI in its business, stating "We are building AI into many of our offerings and we expect this element of our business to grow. We envision a future in which AI operating in our devices, applications, and the cloud helps our customers be more productive in their work and personal lives."

3. The financial statements mention Microsoft's investment in "AI for Earth" to accelerate AI innovation in a

#### Example 2

"Is Microsoft a good investment choice right now?". 

In [59]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Is Microsoft a good investment choice right now?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<thinking>
To determine if Microsoft is a good investment choice, I need to analyze Microsoft's financial statements and stock data. I will first check if the query is related to financial statements and stocks.
</thinking>

<tool>is_financial_statement_related_query</tool>
<tool_input>Is Microsoft a good investment choice right now?[0m[36;1m[1;3mYes, this question is related to analyzing a company's financial statements and performance, which would be relevant for evaluating Microsoft as a potential investment choice.[0m[32;1m[1;3m<thinking>To evaluate if Microsoft is a good investment choice, I need to analyze Microsoft's financial statements and stock price data. I will first check if I have Microsoft's financial statements available. If not, I will download and ingest them from the SEC website using the stock ticker. I will also need to get Microsoft's stock price data using the "get_stock_price" tool since the quer

In [60]:
print(response["output"])



Summary:
Based on Microsoft's strong financial performance and growth prospects, the company appears to be a good investment choice at the current time. However, there are some risks and challenges to consider as well.

Support points:

1. Microsoft has demonstrated robust revenue growth across its key business segments like cloud services (Azure), productivity software (Office 365), and gaming. The company's commercial cloud revenue grew 35% year-over-year in the latest quarter, indicating strong demand for its cloud offerings. Microsoft's pivot towards cloud computing has positioned it well for future growth.

2. The company has a solid balance sheet with over $100 billion in cash and investments, allowing it to invest in strategic areas like artificial intelligence, cloud infrastructure, and acquisitions. Microsoft's operating margins have also been expanding, reflecting efficient cost management.

3. Microsoft has been returning capital to shareholders through share buybacks and 

#### Example 3

"Compare Oracle and Microsoft company financial statements"

In [61]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Compare Oracle and Microsoft company financial statements"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<thinking>
To compare the financial statements of Oracle and Microsoft, I will need to:
1. Check if the query is related to financial statement analysis using the is_financial_statement_related_query tool.
2. If it is related, search for the financial statements of Oracle and Microsoft using the search_financial_statements_for_company tool.
3. If the financial statements are not found, use the get_stock_ticker tool to get the stock tickers, then use the download_10k_filing_from_sec_and_ingest_into_opensearch tool to download the latest 10-K filings from the SEC.
4. Search for the financial statements again using search_financial_statements_for_company after downloading.
</thinking>

<tool>is_financial_statement_related_query</tool>
<tool_input>Compare Oracle and Microsoft company financial statements[0m[36;1m[1;3mYes, this input is related to analyzing and comparing the financial statements of Oracle and Microsoft companie

In [62]:
print(response["output"])



Summary:
This analysis compares the financial statements of Oracle and Microsoft, two major technology companies. Both companies have diversified product and service offerings spanning software, cloud services, hardware, and other technology solutions.

Support points:

1. Revenue sources:
- Oracle's main revenue sources include cloud services, on-premise software licenses, hardware products, and related services.
- Microsoft's revenue comes from productivity software and services (Office, Windows, etc.), intelligent cloud offerings (Azure, server products), personal computing (Windows OEM, devices like Surface, gaming), and LinkedIn services.

2. Financial performance:
- Oracle's fiscal 2021 total revenues were $40.5 billion, with cloud services and license support revenues being the largest components.
- Microsoft's fiscal 2022 total revenues were $198.3 billion, led by strong growth in its cloud offerings like Azure as well as continued demand for productivity software and service

#### Example 4

"Is Amazon a good investment choice right now?"

In [63]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Is Amazon a good investment choice right now?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<thinking>
To determine if Amazon is a good investment choice, I need to analyze both Amazon's financial statements and stock data. I will first check if the query is related to financial statements and stocks using the provided tools.
</thinking>

<tool>is_financial_statement_related_query</tool>
<tool_input>Is Amazon a good investment choice right now?[0m[36;1m[1;3mYes, this question is related to analyzing a company's financial statements and performance, which would be relevant for evaluating Amazon as a potential investment choice.[0m[32;1m[1;3m<thinking>To evaluate if Amazon is a good investment choice, I need to analyze Amazon's financial statements and stock data. I will first check if I have Amazon's financial statements available. If not, I will need to download them from the SEC website using the stock ticker. I will also need to get Amazon's stock price data since the query is related to evaluating Amazon as

In [64]:
print(response["output"])


I cannot provide stock analysis without stock price information.



#### Example 5

"What is OpenSearch?"

In [65]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "What is OpenSearch?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<thinking>
To determine if this query is related to financial statements or stock analysis, I will use the is_financial_statement_related_query and is_stock_related_query tools.
</thinking>

<tool>is_financial_statement_related_query</tool>
<tool_input>What is OpenSearch?[0m[36;1m[1;3mNo, this question is not related to analyzing company financial statements.[0m[32;1m[1;3m<final_answer>
I am a financial statement analysis assistant. I cannot answer questions that are not related to analyzing company financial statements.
[0m

[1m> Finished chain.[0m


In [66]:
print(response["output"])


I am a financial statement analysis assistant. I cannot answer questions that are not related to analyzing company financial statements.



# Test

In [67]:
delete_query={
    "query":{
           "match": { 
               "company_name":"Amazon"
           }
    }
}

aos_client.delete_by_query(index=index_name, body=delete_query)

{'took': 9,
 'timed_out': False,
 'total': 52,
 'deleted': 52,
 'batches': 1,
 'version_conflicts': 0,
 'noops': 0,
 'retries': {'bulk': 0, 'search': 0},
 'throttled_millis': 0,
 'requests_per_second': -1.0,
 'throttled_until_millis': 0,
 'failures': []}

In [68]:
search_query={
    "query":{
           "match": { 
               "company_name":"Amazon"
           }
    }
}

aos_client.search(index=index_name, body=search_query)

{'took': 2,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 52, 'relation': 'eq'},
  'max_score': 3.1733174,
  'hits': [{'_index': '10k_financial',
    '_id': 'eZjOOJIBwhwjFiEHVOTN',
    '_score': 3.1733174,
    '_source': {'company_name': 'AMAZON COM INC',
     'item_content': 'The People&#8217;s Republic of China (&#8220;PRC&#8221;) and India regulate Amazon&#8217;s and its affiliates&#8217; businesses and operations in country through regulations and license requirements that may restrict (i) foreign investment in and operation of the internet, IT infrastructure, data centers, retail, delivery, and other sectors, (ii) internet content, and (iii) the sale of media and other products and services. For example, in order to meet local ownership, regulatory licensing, and cybersecurity requirements, we provide certain technology services in China through contractual relationships with third parties that hold PRC lice