# Generative AI-powered search with Amazon OpenSearch Service

---
### Using Scenario
Form 10-K is a comprehensive report filed annually by a publicly traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). Some of the information a company is required to document in the 10-K includes its history, organizational structure, financial statements, earnings per share, subsidiaries, executive compensation, and any other relevant data.

The SEC mandates that all public companies file regular 10-Ks to keep investors aware of a company's financial condition and to allow them to have enough information before they buy or sell securities issued by that company. The 10-K can appear overly complex at first glance, complete with tables full of data and figures. However, it is so comprehensive that this filing is critical for investors to handle a company's financial position and prospects.

Form 10-K is an annual report that provides a comprehensive analysis of the company's financial condition. The Form 10-K is comprised of several parts. These include:

- 1 - Business-This describes the company's operations. 
- 1A - Risk Factors
- 1B - Unresolved Staff Comments
- 2 - Properties
- 3 - Legal Proceedings
- 4 - Mine Safety Disclosures
- 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
- 6 - Selected Financial Data (prior to February 2021)
- 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
- 7A - Quantitative and Qualitative Disclosures about Market Risk
- 8 - Financial Statements and Supplementary Data
- 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
- 9A - Controls and Procedures
- 9B - Other Information
- 10 - Directors, Executive Officers and Corporate Governance
- 11 - Executive Compensation
- 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
- 13 - Certain Relationships and Related Transactions, and Director Independence
- 14 - Principal Accountant Fees and Services
- 15 - Exhibits and Financial Statement Schedules

---

Many investors rely of SEC filings to analyze the financial health of a company, and they can certainly be a treasure trove of valuable information. Keyword based search may return some irrelevant information. Even with semantic search, information is overwhelming. Can we leverage generative AI to help us on company financial statements interpertation?


In this code talk session, we will show you how to modernize your search application to improve search relevance with Amazon OpenSearch while leveraging generative AI to improve search productivity. The code includes the following topics:
- Comparison search relevance between keyword search and semantic search with Amazon OpenSearch.
- How to leverage Retrieval Augmented Generation(RAG) improve search productivity.
- How to build intelligent agent which orchestrate and execute multistep tasks to automate 10-K filings analysis.
- OpenSearch vector store best practices

---


### Code Structure


The code includes the following sections:
- [Initialize](#Initialize)
- [Part 1: Ingest unstructured data into OpenSearch](#Part-1:-Ingest-unstructured-data-into-OpenSearch)
- [Part 2: Different appoach to search](#Part-2:-Different-appoach-to-search)
    - [2.1 Keyword search](#2.1-Keyword-search)
    - [2.2 Semantic/Vector search](#2.2-Semantic/Vector-search)
    - [2.3 Retrieval Augmented Generation(RAG)](#2.3-Retrieval-Augmented-Generation(RAG))
- [Part 3: AI agent powered search](#Part-3:-AI-agent-powered-search)
    - [3.1 Prepare other tools used by AI agent](#3.1-Prepare-other-tools-used-by-AI-agent)
        - [3.1.1 Ingest and query structured data in Redshift](#3.1.1-Ingest-and-query-structured-data-in-Redshift)
        - [3.1.2 Download 10-K filing from SEC](#3.1.2-Download-10-K-filing-from-SEC)
    - [3.2 Create AI agent](#3.2-Create-AI-agent)
    - [3.3 Use AI agent](#3.3-Use-AI-agent)


## Initialize




###  Install dependency Python library for OpenSearch, Redshift, Langchain

In [None]:
%pip install opensearch-py
%pip install torch
%pip install requests-aws4auth
%pip install boto3
%pip install sqlalchemy>
%pip install sqlalchemy-redshift
%pip install redshift_connector
%pip install ipython-sql==0.4.1
%pip install langchain==0.3.1
%pip install langchain-aws==0.2.1
%pip install langchain-community==0.3.1


### Import library



In [None]:
import boto3
import re
import time
import sagemaker,json
from sagemaker.session import Session
import pandas as pd
import os

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## Part 1: Ingest unstructured data into OpenSearch

### Get SEC 10-K form files

Lets download it and unzip it

In [1]:
!wget https://ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com/df655552-1e61-4a6b-9dc4-c03eb94c6f75/10k-financial-filing.zip
!unzip 10k-financial-filing.zip

--2024-10-18 02:41:18--  https://ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com/df655552-1e61-4a6b-9dc4-c03eb94c6f75/10k-financial-filing.zip
Resolving ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com (ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com)... 52.219.121.50, 52.219.192.10, 52.219.112.145, ...
Connecting to ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com (ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com)|52.219.121.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 65823210 (63M) [application/zip]
Saving to: ‘10k-financial-filing.zip.3’


2024-10-18 02:41:20 (35.1 MB/s) - ‘10k-financial-filing.zip.3’ saved [65823210/65823210]

Archive:  10k-financial-filing.zip
  inflating: extracted/1001601_10K_2020_0001493152-21-008913.json  
  inflating: extracted/1002517_10K_2021_0001002517-21-000052.json  
  inflating: extracted/1013462_10K_2020_0001013462-21-0

Read the dataset in JSON format and contruct pandas DataFrame

### Load the data to OpenSearch
OpenSearch is good with dynamic type inference and can perform full text search with fuzziness and type tolerance. Lets fetch the AOSS endpoint from the deployed CloudFormation template. 

In [2]:
import json
import boto3

cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "generative-ai-powered-search"

outputs = get_cfn_outputs(cloudformation_stack_name)

aoss_endpoint = outputs['OpenSearchServerlessCollectionEndpoint']

aoss_host = aoss_endpoint.split("//")[1]

outputs


{'RedshiftClusterSecurityGroupName': 'sg-069460ac85a18e614',
 'RedshiftServerlessWorkroup': 'workgroup-3bdbab70',
 's3BucketStock': 'generative-ai-powered-search-s3bucketstock-jfii0zmfgix1',
 'SageMakerNotebookURL': 'https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/notebook-instances/openNotebook/generative-ai-powered-search?view=classic',
 'Workgroupname': 'workgroup-3bdbab70',
 'VPC': 'vpc-0bdfbb51a82cf3bc0',
 'RedshiftRoleName': 'RedshiftServerlessImmersionRole',
 'RedshiftRoleNameArn': 'arn:aws:iam::649735563824:role/RedshiftServerlessImmersionRole',
 'NamespaceName': 'namespace-3bdbab70',
 'AdminUsername': 'awsuser',
 'RedshiftServerlessEndpoint': 'workgroup-3bdbab70.649735563824.us-east-1.redshift-serverless.amazonaws.com',
 'Region': 'us-east-1',
 'RedshiftServerlessSecret': 'arn:aws:secretsmanager:us-east-1:649735563824:secret:RedshiftServerlessSecret-NckhW9',
 'AdminPassword': 'Awsuser123!',
 'OpenSearchServerlessCollectionEndpoint': 'https://o951f2ft4ck58q2wl0e

Lets create a client to OpenSearch Serverless and we use this for the entire workshop.

In [3]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

service = "aoss"
aws_region = boto3.Session().region_name
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, aws_region, service)

aos_client = OpenSearch(
    hosts = [{"host": aoss_host, "port": 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

Let's create a new index with a name "10k_finanical". As you will recreate this index with different techniques, delete the index if any before creating. Just creating an empty index in OpenSearch, the index schema would be extened as and when it finds a new attribute with dynamic type inference. 

In [4]:
index_name="10k_financial"

exist=False
try:
    aos_client.indices.get(index=index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=index_name)
else:
    print("index does not exist.")
    
aos_client.indices.create(index=index_name,ignore=400)

delete existing index before creating new one


{'acknowledged': True, 'shards_acknowledged': True, 'index': '10k_financial'}

Now you can load all the financial reports to the index you just created.

In [6]:
import os
from opensearchpy import helpers
# Set the directory path
directory_path =  "extracted"
batch_size = 10
# Initialize a list to store the documents
documents = []

# Iterate through the files in the directory
for filename in os.listdir(directory_path):
    file_path = os.path.join(directory_path, filename)

    # Read the file contents
    with open(file_path, 'r') as file:
        file_contents = file.read()
    docJson = json.loads(file_contents)
    docJson["_index"] = index_name
    documents.append(docJson)
    # If the batch size is reached, index the documents
    if len(documents) == batch_size:
        aos_response= helpers.bulk(aos_client, documents)
        print(f"Indexed {len(documents)} documents.")
        documents = []

# Index the remaining documents
if documents:
    aos_response= helpers.bulk(aos_client, documents)
    print(f"Indexed {len(documents)} documents.")

Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 10 documents.
Indexed 1 documents.


you can check the total number of documents indexed into the OpenSearch index.

In [7]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])


Records found: 191.


## Part 2: Different appoach to search

### 2.1 Keyword search
---
Keyword search refers to finding information one is looking for using terms or words, called "query", from among a large body of textual data. It uses various tokenization methods to split the actual text and score the results with the token statistics like, how many times the word exist in the document, how common it is across the entire corpus, proximity, etc. With these data structure, OpenSearch can handle fuzziness, and tolerate the typo mistakes in the search enables the users search with phonetically similar terms or not knowing the exact spelling( for example scientific names). It works great with exact matches

Lets search for companies in the state of Illinois

In [8]:
query = {
    "_source" : ["company", "filing_date", "state_location"],
    "query": {
        "bool" :{
            "filter" : [{
                    "match" :{
                        "state_location.keyword" : "IL"
                    }
                }]
        }
    }
}

In [9]:
import pandas as pd
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Unnamed: 0,_id,_score,company,filing_date,state_location
0,1%3A0%3A1W6HnZIB70AM0i6qNUEE,0.0,"Sprout Social, Inc.",2021-02-24,IL
1,1%3A0%3AiW6HnZIB70AM0i6qFEFJ,0.0,"ENVESTNET, INC.",2021-02-26,IL
2,1%3A0%3At82HnZIBE0VMSqoxIMv4,0.0,OneSpan Inc.,2021-02-25,IL
3,1%3A0%3AsW6HnZIB70AM0i6qJ0HG,0.0,"CDK Global, Inc.",2021-08-18,IL
4,1%3A0%3A382HnZIBE0VMSqoxMsvc,0.0,Paylocity Holding Corp,2021-08-06,IL


You can search across fields and highlight them why they match the document.

In [26]:
query = {
    "_source" : ["company", "filing_date", "state_location", "item_1"],
    "query": {
        "bool": {
            "must" :[
                {
                    "multi_match" :{
                        "query" : "microchips",
                        "fields" :["item_1"]
                    }
                }
            ]
        }
    },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}

In [27]:
import pandas as pd
res = aos_client.search(index=index_name, body=query)
query_result=[]
print (f"Total documents found: {res['hits']['total']['value']}")
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location'],hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location", "item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Total documents found: 1


Unnamed: 0,_id,_score,company,filing_date,state_location,item_1_highlight
0,1%3A0%3Aqc2HnZIBE0VMSqoxG8u0,4.506137,"SmartMetric, Inc.",2021-10-12,NV,"needed for manufacture of the SmartMetric Biometric Card include, but are not limited to, sensors, <em>microchips</em>"


You can also perform a phrase match where the words are together, with additional filters like companies located in California as below

In [28]:
query = {
    "_source" : ["company", "filing_date", "state_location", "item_1"],
    "query": {
        "bool": {
            "must" :[
                {
                    "match_phrase" :{
                        "item_1" : "digital media"
                    }
                }
            ],
            "filter" :[
                {
                    "term": {
                        "state_location.keyword" : "CA"
                    }
                }
            ]
        }
    },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}

In [29]:
import pandas as pd
res = aos_client.search(index=index_name, body=query)
query_result=[]
print (f"Total documents found: {res['hits']['total']['value']}")
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location'],hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location", "item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Total documents found: 6


Unnamed: 0,_id,_score,company,filing_date,state_location,item_1_highlight
0,1%3A0%3Amm6HnZIB70AM0i6qGEH1,1.407576,ADOBE INC.,2021-01-15,CA,By combining the creativity of our <em>Digital</em> <em>Media</em> business with the science of our Digital Experience
1,1%3A0%3A0M2HnZIBE0VMSqoxLstj,1.30252,"Trade Desk, Inc.",2021-02-19,CA,We believe that the market is evolving and that advertisers will shift more spend to <em>digital</em> <em>media</em>.
2,1%3A0%3Ay26HnZIB70AM0i6qMEHH,1.232645,"J2 GLOBAL, INC.",2021-03-01,CA,"Our <em>Digital</em> <em>Media</em> business specializes in the technology, shopping, gaming, and healthcare markets, offering"
3,1%3A0%3Al26HnZIB70AM0i6qGEH1,1.185809,Max Sound Corp,2021-03-26,CA,"video, downloadable audio and video and high definition audio and video featuring improved HD audio; <em>Digital</em>"
4,1%3A0%3Ahm6HnZIB70AM0i6qD0GR,0.9435,"Autodesk, Inc.",2021-03-19,CA,"serve customers in architecture, engineering, and construction; product design and manufacturing; and <em>digital</em>"
5,1%3A0%3AxW6HnZIB70AM0i6qMEHH,0.930523,"Veritone, Inc.",2021-03-05,CA,.\n•\n<em>Digital</em> <em>Media</em> Hub.


While the full text search works great with structural data with fuzziness, and typo tolerance, proximity searching, highlighting. However, When it comes to natural language, pure keyword search could result less relevant and a long tail of noise.

![Keyword Search](./static/keyword-search-flow.png)

Lets consider the below query

In [30]:
query_text="What Microsoft's researh and development organization is responsible for?"

Run the query and check the search result. Some irrelevant documents are returned.

In [31]:
query={
  "size": 10,
  "query": {
    "match": {
      "item_1": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_*" : {}
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['item_1'], hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","item_1","item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Unnamed: 0,_id,_score,company,item_1,item_1_highlight
0,1%3A0%3A482HnZIBE0VMSqoxN8u3,4.165623,CHANNELADVISOR CORP,"ITEM 1. BUSINESS\nCOMPANY OVERVIEW\nOur mission is to connect and optimize the world's commerce. Our proprietary software-as-a-service, or SaaS, cloud platform helps brands and retailers worldwide improve their online performance by expanding sales channels, connecting with consumers around the world, optimizing their operations for peak performance and providing actionable analytics to improve competitiveness. More specifically, our platform allows our customers to manage their product list...","<em>Microsoft's</em> Bing, <em>and</em> social commerce sites such as Facebook, Instagram <em>and</em> Pinterest."
1,1%3A0%3AoW6HnZIB70AM0i6qHkG1,3.205613,"Asana, Inc.","Item 1. Business\nOverview\nOur mission is to help humanity thrive by enabling the world’s teams to work together effortlessly.\nAsana is a work management platform that helps teams orchestrate work, from daily tasks to cross-functional strategic initiatives. Over 93,000 paying customers use Asana to manage everything from product launches to marketing campaigns to organization-wide goal setting. Our platform adds structure to unstructured work, creating clarity, transparency, and accountabi...","who <em>is</em> doing <em>what</em>, by when."
2,1%3A0%3Aq82HnZIBE0VMSqoxG8u0,2.776259,"NEW RELIC, INC.","Item 1. Business\nOverview\nNew Relic delivers the observability platform for engineers to plan, build, deploy and operate more perfect software. We offer a comprehensive suite of products delivered on an open and extensible cloud-based platform that enables organizations to collect, store and analyze massive amounts of data in real time so they can better operate their applications and infrastructure and improve their digital customer experience.\nNew Relic One is our purpose-built offering...",Customers only pay us <em>for</em> <em>what</em> they use <em>and</em> our sales team’s interests are better aligned with the interests
3,1%3A0%3Akm6HnZIB70AM0i6qFEFJ,2.743868,GTY Technology Holdings Inc.,"Item 1. Business.\nGTY Business Overview\nGTY is a software-as-a-service (“SaaS”) company that offers a cloud-based suite of solutions for the public sector in North America. GTY brings government technology companies together to achieve a new standard in citizen engagement and resource management. GTY solutions provide public sector organizations with the ability to communicate, engage, interact, conduct business, and transact with their constituents in procurement, payments, grants managem...","Bonfire’s research <em>and</em> <em>development</em> team <em>is</em> primarily <em>responsible</em> <em>for</em> the design, <em>development</em>, testing"
4,1%3A0%3A2s2HnZIBE0VMSqoxMsvc,2.729312,"Sailpoint Technologies Holdings, Inc.","ITEM 1. BUSINESS\nOverview\nSailPoint Technologies Holdings, Inc. (“SailPoint,” “the Company” or “we”) is the leading provider of enterprise identity security solutions. SailPoint was launched by a team of visionary industry veterans to empower our customers to efficiently and securely govern the digital identities of employees, contractors, business partners, software bots and other human and non-human users, and manage their constantly changing access rights to enterprise applications and ...",The governance engine <em>is</em> <em>responsible</em> <em>for</em> managing the ongoing process of aligning these two states.
5,1%3A0%3A2G6HnZIB70AM0i6qNUEE,2.710184,"Cloudflare, Inc.","Item 1. Business\nOverview\nCloudflare’s mission is to help build a better Internet.\nIn recent years, the technology industry has undergone a massive transition from on-premise hardware and software that customers buy, to services in the cloud that they rent. Organizations find themselves at different points in this transition to the cloud. Regardless of where organizations are in their transition, they all face a common set of challenges: they exist in a complex, heterogeneous infrastructu...",<em>what</em> device they use or where they are located.
6,1%3A0%3An82HnZIBE0VMSqoxFstu,2.693367,"Sumo Logic, Inc.","Item 1. Business\nOverview\nSumo Logic empowers organizations to close the intelligence gap.\nSumo Logic is the pioneer of Continuous Intelligence, a new category of software, which enables organizations of all sizes to address the challenges and opportunities presented by digital transformation, modern applications, and cloud computing. Our Continuous Intelligence Platform enables organizations to automate the collection, ingestion, and analysis of application, infrastructure, security, and...",Organizations can succeed or fail based on how well they understand <em>and</em> respond to <em>what</em> <em>is</em> happening
7,1%3A0%3Axc2HnZIBE0VMSqoxKcu3,2.64407,Snowflake Inc.,"ITEM 1. BUSINESS\nWe believe in a data connected world where organizations have seamless access to explore, share, and unlock the value of data. To realize this vision, we deliver the Data Cloud, an ecosystem where Snowflake customers, partners, data providers, and data consumers can break down data silos and derive value from rapidly growing data sets in secure, governed, and compliant ways.\nOur platform is the innovative technology that powers the Data Cloud, enabling customers to consoli...",", reducing hidden costs <em>and</em> ensuring customers pay only <em>for</em> <em>what</em> they use."
8,1%3A0%3AwM2HnZIBE0VMSqoxJcud,2.642278,"DOMO, INC.","Item 1. Business\nOverview\nAt Domo, we believe people and data are an organization's most valuable assets in the cloud era. Our Business Cloud is a modern business intelligence software platform that enables processes that are critically dependent on business intelligence data - which historically could take weeks, months or longer - to be done on-the-fly, in as fast as minutes or seconds, at scale. From marketing to operations, HR to finance, IT to product development, supply chain to sale...","We thereby enable employees to be aware of <em>what</em> <em>is</em> happening on a real-time basis, <em>and</em> take appropriate"
9,1%3A0%3A0G6HnZIB70AM0i6qNUEE,2.560325,"Dynatrace, Inc.","ITEM 1. BUSINESS\nOverview\nWe offer the market-leading software intelligence platform, purpose-built for dynamic multicloud environments. As enterprises embrace the cloud to effect their digital transformation, our all-in-one intelligence platform is designed to address the growing complexity faced by technology and digital business teams. With automatic and intelligent observability, our all-in-one platform delivers precise answers about the performance and security of applications, the un...",<em>and</em> <em>what</em> to do about it.


In [32]:
# lets execute another query example
query_text="What is Microsoft's main revenue?"

Run the query and check the search result. Some irrelevant documents are returned.

In [33]:
query={
  "size": 10,
  "query": {
    "match": {
      "item_1": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['item_1'], hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","item_1","item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

# you can notice some irrelevant results

Unnamed: 0,_id,_score,company,item_1,item_1_highlight
0,1%3A0%3A482HnZIBE0VMSqoxN8u3,3.763052,CHANNELADVISOR CORP,"ITEM 1. BUSINESS\nCOMPANY OVERVIEW\nOur mission is to connect and optimize the world's commerce. Our proprietary software-as-a-service, or SaaS, cloud platform helps brands and retailers worldwide improve their online performance by expanding sales channels, connecting with consumers around the world, optimizing their operations for peak performance and providing actionable analytics to improve competitiveness. More specifically, our platform allows our customers to manage their product list...",The remaining portion of GMV-based or advertising spend-based fee <em>is</em> typically variable and <em>is</em> based
1,1%3A0%3Apm6HnZIB70AM0i6qHkG1,3.489703,ISSUER DIRECT CORP,"ITEM 1. DESCRIPTION OF BUSINESS.\nCompany Overview\nIssuer Direct Corporation and its subsidiaries are hereinafter collectively referred to as “Issuer Direct”, the “Company”, “We” or “Our” unless otherwise noted. Our corporate offices are located at One Glenwood Ave., Suite 1001, Raleigh, North Carolina, 27603.\nWe announce material financial information to our investors using our investor relations website, SEC filings, investor events, news and earnings releases, public conference calls, w...",In the past we have disclosed revenues in two <em>main</em> categories: (i) Platform and Technology and (ii) Services
2,1%3A0%3An82HnZIBE0VMSqoxFstu,3.206255,"Sumo Logic, Inc.","Item 1. Business\nOverview\nSumo Logic empowers organizations to close the intelligence gap.\nSumo Logic is the pioneer of Continuous Intelligence, a new category of software, which enables organizations of all sizes to address the challenges and opportunities presented by digital transformation, modern applications, and cloud computing. Our Continuous Intelligence Platform enables organizations to automate the collection, ingestion, and analysis of application, infrastructure, security, and...",Organizations can succeed or fail based on how well they understand and respond to <em>what</em> <em>is</em> happening
3,1%3A0%3Aqc2HnZIBE0VMSqoxG8u0,3.158566,"SmartMetric, Inc.","Item 1. Business\nCorporate History and Overview\nSmartMetric, Inc. (“SmartMetric” or the “Company”) is a company that was incorporated pursuant to the laws of Nevada on December 18, 2002 and is engaged in the biometric technology manufacturing industry. SmartMetric has an issued patent covering technology that involves connection to networks using data cards (smart cards and EMV cards). In addition, SmartMetric holds the sole license to five issued patents covering features of its biometric...",SmartMetric’s <em>main</em> products are a fingerprint sensor activated payments card for use in the credit and
4,1%3A0%3AhW6HnZIB70AM0i6qD0GR,3.044114,"DOCUSIGN, INC.","ITEM 1. BUSINESS\nOverview\nDocuSign helps organizations do business faster with less risk, lower costs, and better experiences for customers and employees. We accomplish this by transforming the foundational element of business: the agreement.\nAgreements are everywhere. In the regular course of doing business, organizations sign contracts, offer letters, and hundreds of other types of agreements with customers, employees, and business partners. This is true for every size of organization, ...",Our <em>main</em> infrastructure <em>is</em> powered by near real-time data synchronization across a ring of three geo-dispersed
5,1%3A0%3Akm6HnZIB70AM0i6qFEFJ,3.034235,GTY Technology Holdings Inc.,"Item 1. Business.\nGTY Business Overview\nGTY is a software-as-a-service (“SaaS”) company that offers a cloud-based suite of solutions for the public sector in North America. GTY brings government technology companies together to achieve a new standard in citizen engagement and resource management. GTY solutions provide public sector organizations with the ability to communicate, engage, interact, conduct business, and transact with their constituents in procurement, payments, grants managem...","Sherpa’s team has seen <em>what</em> has worked and <em>what</em> has not, so Sherpa can offer counsel on business processes"
6,1%3A0%3AiW6HnZIB70AM0i6qFEFJ,2.958108,"ENVESTNET, INC.","Item 1. Business\nGeneral\nEnvestnet, through its subsidiaries, is transforming the way financial advice and wellness are delivered. Its mission is to empower advisors and financial service providers with innovative technology, solutions and intelligence to make financial wellness a reality for everyone. Envestnet has been a leader in helping transform wealth management, working towards its goal of building a holistic financial wellness ecosystem to improve the financial lives of millions of...",Envestnet Data & Analytics serves two <em>main</em> customer groups: financial institutions (“FI”) and financial
7,1%3A0%3ArM2HnZIBE0VMSqoxG8u0,2.813218,MICROSOFT CORP,"Item 1\nThe investments we make in sustainability carry through to our products, services, and devices. We design our devices, from Surface to Xbox, to minimize their impact on the environment. Our cloud and AI services and datacenters help businesses cut energy consumption, reduce physical footprints, and design sustainable products. We also pledged a $50 million investment in AI for Earth to accelerate innovation by putting AI in the hands of those working to directly address sustainabilit...",Office Commercial <em>revenue</em> <em>is</em> mainly affected by a combination of continued installed base growth and
8,1%3A0%3Arm6HnZIB70AM0i6qI0Fr,2.794993,PAID INC,"Item 1. Business\nOverview\nPAID, Inc. (the “Company” or “PAID”) was incorporated in Delaware on August 9, 1995. The Company has multiple web addresses, www.paid.com, which offers updated information on various aspects of our operations and www.shiptime.com which showcases our online label generation software. Information contained in the Company's website shall not be deemed to be a part of this Annual Report. The Company's principal executive offices are located at 225 Cedar Hill Street, M...","In addition to these features, ShipTime also provides <em>what</em> it refers to as “Heroic Multilingual Customer"
9,1%3A0%3Ag26HnZIB70AM0i6qD0GR,2.757499,BRIDGEWAY NATIONAL CORP.,"ITEM 1. BUSINESS\nGeneral\nThe Company is structured as a holding company with a business strategy focused on owning subsidiaries engaged in a number of diverse business activities. We are not a “blank check company” as defined in Rule 419 under the Securities Act of 1933, as amended (the “Securities Act”). We conduct and plan to continue to conduct our activities in such a manner as not to be deemed an investment company under the Investment Company Act of 1940, as amended (the “Investment ...",We will seek to focus on acquiring operating businesses and securities that (a) can be purchased at <em>what</em>


### Initialize embedding model to vectorize text data

### 2.2 Semantic/Vector search

---
Semantic search refers to using machine learning to understand the meaning of queries. It improves usefulness of search by understanding the intent and contextual meaning of those terms by bringing results that are hopefully more relevant than simple text search.  

![Semantic Search](./static/semantic-search-flow.png)

---


![Semantic Search Architecture](./static/semantic-search-architecture.png)



**Embeddings** are numerical representations of data, typically used to convert complex, high-dimensional data into a lower-dimensional space where similar data points are closer together. In the context of natural language processing (NLP), embeddings are used to represent words, phrases, or sentences as vectors of real numbers. These vectors capture semantic relationships, meaning that words with similar meanings are represented by vectors that are close together in the embedding space.

**Embedding models** are machine learning models that are trained to create these numerical representations. They learn to encode various types of data into embeddings that capture the essential characteristics and relationships within the data. For example, in NLP, embedding models like Word2Vec, GloVe, and BERT are trained on large text corpora to produce word embeddings. These embeddings can then be used for various downstream tasks, such as text classification, sentiment analysis, or machine translation. In this case we'll be using it for semantic similarity

We use embedding model to convert questions into vector and use vector similiarity to search semantic similiar 10-K data. The following diagram shows the flow: 

<!-- ![Convert Text to Vector](./static/text2vector.png) -->

---

![opensearch vector store](./static/opensearch-vector-store.png)


In [34]:
import os
import pandas as pd

# Specify the path to the folder containing the JSON files
folder_path = "extracted"

# Initialize an empty list to store list of company 10-K filing file names
company_filing_file_name_list = []

#For this session, we only ingest few company information.
company_list=["Alteryx, Inc.",
              "MICROSTRATEGY Inc", 
              "Elastic N.V.", 
              "MongoDB, Inc.", 
              "Palo Alto Networks Inc", 
              "Okta, Inc.",
              "Datadog, Inc.", 
              "Snowflake Inc.",
              "SALESFORCE.COM, INC.", 
              "ORACLE CORP",
              "MICROSOFT CORP", 
              "Palantir Technologies Inc."
             ]

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        df = pd.DataFrame([pd.read_json(file_path,typ='series')])
        if df.iloc[0]['company'] in company_list:
            company_filing_file_name_list.append(file_path)

company_filing_file_name_list

['extracted/1108524_10K_2021_0001108524-22-000008.json',
 'extracted/1561550_10K_2020_0001564590-21-009770.json',
 'extracted/1321655_10K_2020_0001193125-21-060650.json',
 'extracted/1441816_10K_2021_0001441816-21-000051.json',
 'extracted/789019_10K_2021_0001564590-21-039151.json',
 'extracted/1327567_10K_2021_0001327567-21-000029.json',
 'extracted/1660134_10K_2021_0001660134-21-000007.json',
 'extracted/1707753_10K_2021_0001707753-21-000026.json',
 'extracted/1640147_10K_2021_0001640147-21-000073.json',
 'extracted/1050446_10K_2020_0001564590-21-005783.json',
 'extracted/1689923_10K_2020_0001689923-21-000024.json',
 'extracted/1341439_10K_2021_0001564590-21-033616.json']

In [35]:
#from langchain.embeddings import BedrockEmbeddings
from langchain_aws import BedrockEmbeddings

aws_region = boto3.Session().region_name

boto3_bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=f"https://bedrock-runtime.{aws_region}.amazonaws.com")
bedrock_embeddings = BedrockEmbeddings(model_id='amazon.titan-embed-text-v2:0',client=boto3_bedrock)
#bedrock_embeddings = BedrockEmbeddings(model_id='cohere.embed-multilingual-v3',client=boto3_bedrock)

In [36]:
result = bedrock_embeddings.embed_query("This is a content of the document")
result[0:20]

[-0.07963293790817261,
 0.022934285923838615,
 0.03599408641457558,
 -0.00426036212593317,
 0.005773387849330902,
 -0.0063308184035122395,
 0.03153464198112488,
 -0.017678512260317802,
 0.03408289700746536,
 0.024049146100878716,
 -0.028030794113874435,
 0.07166964560747147,
 0.026278868317604065,
 -0.004519169218838215,
 -0.0232528168708086,
 0.0576542466878891,
 -0.04140912741422653,
 0.02723446488380432,
 0.03169390931725502,
 0.041090596467256546]

In [37]:
len(result)

1024

#### Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

### Create a index in Amazon OpenSearch Service collection

The OpenSearch k-NN plugin introduces a custom data type, the knn_vector, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. 

<!-- ---
#### OpenSearch Approximate Nearest Neighbor Algorithms and Engines
![ANN algorithm](./static/ann-algorithm.png)

--- -->

<!-- #### HNSW parameter tuning
![hnsw parameter tuning](./static/hnsw-parameter-tuning.png) -->

<!-- ---

#### IVF parameter tuning
![ivf parameter tuning](./static/ivf-parameter-tuning.png)

#### How to select the engine and algorithms
![opensearch ann comparison](./static/opensearch-ann-selection.png)  -->

In [38]:
knn_index = {
    "settings": {
        "index.knn": True,
        #"index.knn.space_type": "cosinesimil"
    },
    "mappings": {
        "properties": {
            "item_vector": {
                "type": "knn_vector",
                "dimension": 1024,
                "store": True,
                "method": {
                    "name": "hnsw",
                    "space_type": "l2",
                    "engine": "nmslib",
                    "parameters": {
                      "ef_construction": 128,
                      "m": 24
                    }
                }
            },
            "item_content": {
                "type": "text",
                "store": True
            },
            "company_name": {
                "type": "text",
                "store": True
            }
        }
    }
}


Using the above index definition, we now need to create the index in Amazon OpenSearch

In [39]:
index_name="10k_financial"

exist=False
try:
    aos_client.indices.get(index=index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=index_name)
else:
    print("index does not exist.")
    
aos_client.indices.create(index=index_name,body=knn_index,ignore=400)

delete existing index before creating new one


{'acknowledged': True, 'shards_acknowledged': True, 'index': '10k_financial'}

In [42]:
aos_client.indices.get(index=index_name)
# you can verify the mappings

{'10k_financial': {'aliases': {},
  'mappings': {'properties': {'company_name': {'type': 'text', 'store': True},
    'item_content': {'type': 'text', 'store': True},
    'item_vector': {'type': 'knn_vector',
     'store': True,
     'dimension': 1024,
     'method': {'engine': 'nmslib',
      'space_type': 'l2',
      'name': 'hnsw',
      'parameters': {'ef_construction': 128, 'm': 24}}}}},
  'settings': {'index': {'number_of_shards': '2',
    'provided_name': '10k_financial',
    'knn': 'true',
    'creation_date': '1729220740822',
    'number_of_replicas': '0',
    'uuid': 'qcqWnZIBsSQ232iRhraL',
    'version': {'created': '136327827'}}}}}

###  Load the raw data into the Index
Next, let's load the financial billing data into the index you've just created.

In [45]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
import pandas

from typing import Any, Dict, List, Optional, Sequence

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class PandasDataFrameLoader(BaseLoader):
    
    def __init__(self,dataframe:pandas.DataFrame):
        self.dataframe=dataframe
        
    def load(self) -> List[Document]:
        docs = []
        items=["item_1","item_1A","item_1B","item_2","item_3","item_4","item_5","item_6","item_7","item_7A","item_8","item_9","item_9A", "item_9B", "item_10", "item_11", "item_12", "item_13", "item_14", "item_15"]
        
        for index, row in self.dataframe.iterrows():
            metadata={}
            # you can use as many metadata possible
            metadata["cik"]=row['cik']
            metadata["company_name"]=row['company']
            metadata["filing_date"]=row['filing_date']
            for item in items:
                content=row[item]
                metadata['item'] = item
                doc = Document(page_content=content,metadata=metadata)
                #print(doc.metadata)
                docs.append(doc)
        return docs

Use Bedrock embedding convert item content into vector and use OpenSearch bulk ingest to store data into OpenSearch index

In [46]:
import time
from opensearchpy import helpers

def ingest_downloaded_10k_into_opensearch(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    item_contents=[]
    company_name=splitted_documents[0].metadata['company_name']
    for doc in splitted_documents:
        item_contents.append(doc.page_content)
    
    print("\ncompany:" + company_name + ", item count:" + str(len(item_contents)))
    start = time.time()
    embedding_results = bedrock_embeddings.embed_documents(item_contents)
    end = time.time()
    elapsed = end - start
    #print(f"total time elapsed for Bedrock embedding: {elapsed:.2f} seconds")
        
    data = []
    i=0
    for content in item_contents:
        data.append({"_index": index_name,  "company_name": company_name, "item_content":content, "item_vector":embedding_results[i]})
        i = i+1
    aos_response= helpers.bulk(aos_client, data)
    print(f"Bulk-inserted {aos_response[0]} items.")

In [47]:
# load the data in to OpenSearch Serverless collection. Note: This would take some time to complete
for file in company_filing_file_name_list:
    ingest_downloaded_10k_into_opensearch(file)
    print("Ingested :" + file)


company:SALESFORCE.COM, INC., item count:70
Bulk-inserted 70 items.
Ingested :extracted/1108524_10K_2021_0001108524-22-000008.json

company:Datadog, Inc., item count:67
Bulk-inserted 67 items.
Ingested :extracted/1561550_10K_2020_0001564590-21-009770.json

company:Palantir Technologies Inc., item count:88
Bulk-inserted 88 items.
Ingested :extracted/1321655_10K_2020_0001193125-21-060650.json

company:MongoDB, Inc., item count:76
Bulk-inserted 76 items.
Ingested :extracted/1441816_10K_2021_0001441816-21-000051.json

company:MICROSOFT CORP, item count:57
Bulk-inserted 57 items.
Ingested :extracted/789019_10K_2021_0001564590-21-039151.json

company:Palo Alto Networks Inc, item count:67
Bulk-inserted 67 items.
Ingested :extracted/1327567_10K_2021_0001327567-21-000029.json

company:Okta, Inc., item count:96
Bulk-inserted 96 items.
Ingested :extracted/1660134_10K_2021_0001660134-21-000007.json

company:Elastic N.V., item count:79
Bulk-inserted 79 items.
Ingested :extracted/1707753_10K_2021_0

To validate the load, you can query the number of documents number in the index. 

In [48]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])

Records found: 464.


In [49]:
# now you can use the same queries where the full text search weren't return relevant results.
query_text="What Microsoft's research and development organization is responsible for?"

Run the query and check the search result. 

In [50]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

Unnamed: 0,_id,_score,company_name,item_content
0,1%3A0%3AkM2ZnZIBE0VMSqox9MxP,0.471287,MICROSOFT CORP,"•\nExperiences and Devices, focuses on instilling a unifying product ethos across our end-user experiences and devices, including Office, Windows, Enterprise Mobility + Security, and Surface.\n•\nAI and Research, focuses on our AI innovations and other forward-looking research and development efforts spanning infrastructure, services, applications, and search.\n•\nLinkedIn, focuses on our services that transform the way customers hire, market, sell, and learn.\n•\nGaming, focuses on developi..."
1,1%3A0%3A3W6ZnZIB70AM0i6qg0Fr,0.454781,"Datadog, Inc.","Research and Development\nOur research and development organization is responsible for the design, development, testing and delivery of new technologies, features and integrations of our platform, as well as the continued improvement and iteration of our existing products. It is also responsible for operating and scaling our platform including the underlying cloud infrastructure. Our research and development investments seek to drive core technology innovation and bring new products to marke..."
2,1%3A0%3Aj82ZnZIBE0VMSqox9MxP,0.437216,MICROSOFT CORP,"•\nConstraints in the supply chain of device components.\n•\nPiracy.\nWindows Commercial revenue, which includes volume licensing of the Windows operating system and Windows cloud services such as Microsoft Defender Advanced Threat Protection, is affected mainly by the demand from commercial customers for volume licensing and Software Assurance (“SA”), as well as advanced security offerings. Windows Commercial revenue often reflects the number of information workers in a licensed enterprise ..."
3,1%3A0%3Ai82ZnZIBE0VMSqox9MxP,0.434114,MICROSOFT CORP,"Item 1\nThe investments we make in sustainability carry through to our products, services, and devices. We design our devices, from Surface to Xbox, to minimize their impact on the environment. Our cloud and AI services and datacenters help businesses cut energy consumption, reduce physical footprints, and design sustainable products. We also pledged a $50 million investment in AI for Earth to accelerate innovation by putting AI in the hands of those working to directly address sustainabilit..."
4,1%3A0%3AjM2ZnZIBE0VMSqox9MxP,0.431347,MICROSOFT CORP,"We strive to include others by holding ourselves accountable for diversity, driving global systemic change in our workplace and workforce, and creating an inclusive work environment. Through this commitment we can allow everyone the chance to be their authentic selves and do their best work every day. We support multiple highly active Employee Resource Groups for women, families, racial and ethnic minorities, military, people with disabilities, or who identify as LGBTQI+, where employees can..."
5,1%3A0%3Al82ZnZIBE0VMSqox9MxP,0.421236,MICROSOFT CORP,"Issues in the use of AI in our offerings may result in reputational harm or liability. We are building AI into many of our offerings and we expect this element of our business to grow. We envision a future in which AI operating in our devices, applications, and the cloud helps our customers be more productive in their work and personal lives. As with many disruptive innovations, AI presents risks and challenges that could affect its adoption, and therefore our business. AI algorithms may be ..."
6,1%3A0%3Aks2ZnZIBE0VMSqox9MxP,0.418061,MICROSOFT CORP,"Ms. Hogan was appointed Executive Vice President, Human Resources in November 2014. Prior to that Ms. Hogan was Corporate Vice President of Microsoft Services. She also served as Corporate Vice President of Customer Service and Support. Ms. Hogan joined Microsoft in 2003. Ms. Hogan also serves on the Board of Directors of Alaska Air Group, Inc.\nMs. Hood was appointed Executive Vice President and Chief Financial Officer in July 2013, subsequent to her appointment as Chief Financial Officer i..."
7,1%3A0%3Ajs2ZnZIBE0VMSqox9MxP,0.417,MICROSOFT CORP,"•\nEnterprise Services, including Premier Support Services and Microsoft Consulting Services.\nServer Products and Cloud Services\nAzure is a comprehensive set of cloud services that offer developers, IT professionals, and enterprises freedom to build, deploy, and manage applications on any platform or device. Customers can use Azure through our global network of datacenters for computing, networking, storage, mobile and web application services, AI, IoT, cognitive services, and machine lear..."
8,1%3A0%3Ao82ZnZIBE0VMSqox9MxP,0.416729,MICROSOFT CORP,"Item 7\nITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS\nThe following Management’s Discussion and Analysis of Financial Condition and Results of Operations (“MD&A”) is intended to help the reader understand the results of operations and financial condition of Microsoft Corporation. MD&A is provided as a supplement to, and should be read in conjunction with, our consolidated financial statements and the accompanying Notes to Financial Statements ..."
9,1%3A0%3AbM2anZIBE0VMSqox080m,0.412077,"Alteryx, Inc.","Competitive Pay and Benefits\nWe strive to provide pay, comprehensive benefits and services that help meet the varying needs of our associates. Our total rewards package includes market-competitive pay, including equity compensation, paid time off, and other comprehensive and competitive global benefits. For example, in the United States, we provide 12 weeks of paid parental leave for all new parents (either through birth or adoption). And, for all of our associates, we offer competitive fin..."


In [51]:
query_text="What is Microsoft's main revenue?"

Run the query and check the search result. 

In [52]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

Unnamed: 0,_id,_score,company_name,item_content
0,1%3A0%3AuM2ZnZIBE0VMSqox9MxP,0.55572,MICROSOFT CORP,"Our More Personal Computing segment consists of products and services that put customers at the center of the experience with our technology. This segment primarily comprises:\n•\nWindows, including Windows OEM licensing and other non-volume licensing of the Windows operating system; Windows Commercial, comprising volume licensing of the Windows operating system, Windows cloud services, and other Windows commercial offerings; patent licensing; Windows Internet of Things; and MSN advertising...."
1,1%3A0%3Ao82ZnZIBE0VMSqox9MxP,0.534775,MICROSOFT CORP,"Item 7\nITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS\nThe following Management’s Discussion and Analysis of Financial Condition and Results of Operations (“MD&A”) is intended to help the reader understand the results of operations and financial condition of Microsoft Corporation. MD&A is provided as a supplement to, and should be read in conjunction with, our consolidated financial statements and the accompanying Notes to Financial Statements ..."
2,1%3A0%3Arc2ZnZIBE0VMSqox9MxP,0.531813,MICROSOFT CORP,"The consolidated financial statements include the accounts of Microsoft Corporation and its subsidiaries. Intercompany transactions and balances have been eliminated.\nEstimates and Assumptions\nPreparing financial statements requires management to make estimates and assumptions that affect the reported amounts of assets, liabilities, revenue, and expenses. Examples of estimates and assumptions include: for revenue recognition, determining the nature and timing of satisfaction of performance..."
3,1%3A0%3Aj82ZnZIBE0VMSqox9MxP,0.529577,MICROSOFT CORP,"•\nConstraints in the supply chain of device components.\n•\nPiracy.\nWindows Commercial revenue, which includes volume licensing of the Windows operating system and Windows cloud services such as Microsoft Defender Advanced Threat Protection, is affected mainly by the demand from commercial customers for volume licensing and Software Assurance (“SA”), as well as advanced security offerings. Windows Commercial revenue often reflects the number of information workers in a licensed enterprise ..."
4,1%3A0%3ApM2ZnZIBE0VMSqox9MxP,0.510958,MICROSOFT CORP,"PART II\nItem 7\nChange in Accounting Estimate\nIn July 2020, we completed an assessment of the useful lives of our server and network equipment and determined we should increase the estimated useful life of server equipment from three years to four years and increase the estimated useful life of network equipment from two years to four years. This change in accounting estimate was effective beginning fiscal year 2021. Based on the carrying amount of server and network equipment included in ..."
5,1%3A0%3Apc2ZnZIBE0VMSqox9MxP,0.498688,MICROSOFT CORP,"Gross margin increased $18.9 billion or 20% driven by growth across each of our segments and the change in estimated useful lives of our server and network equipment. Gross margin percentage increased with the change in estimated useful lives of our server and network equipment. Excluding this impact, gross margin percentage decreased slightly driven by gross margin percentage reduction in More Personal Computing. Commercial cloud gross margin percentage increased 4 points to 71% driven by g..."
6,1%3A0%3Ajs2ZnZIBE0VMSqox9MxP,0.468894,MICROSOFT CORP,"•\nEnterprise Services, including Premier Support Services and Microsoft Consulting Services.\nServer Products and Cloud Services\nAzure is a comprehensive set of cloud services that offer developers, IT professionals, and enterprises freedom to build, deploy, and manage applications on any platform or device. Customers can use Azure through our global network of datacenters for computing, networking, storage, mobile and web application services, AI, IoT, cognitive services, and machine lear..."
7,1%3A0%3ArM2ZnZIBE0VMSqox9MxP,0.463083,MICROSOFT CORP,"Item 8\nITEM 8. FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA\nINCOME STATEMENTS\n(In millions, except per share amounts)\nYear Ended June 30,\nRevenue:\nProduct\n$\n71,074\n$\n68,041\n$\n66,069\nService and other\n97,014\n74,974\n59,774\nTotal revenue\n168,088\n143,015\n125,843\nCost of revenue:\nProduct\n18,219\n16,017\n16,273\nService and other\n34,013\n30,061\n26,637\nTotal cost of revenue\n52,232\n46,078\n42,910\nGross margin\n115,856\n96,937\n82,933\nResearch and development\n20,716\n19,..."
8,1%3A0%3AC82ZnZIBE0VMSqoxX8x3,0.44714,"SALESFORCE.COM, INC.","Highlights from the Fiscal Year 2021.\n•Revenue: Total fiscal 2021 revenue was $21.3 billion, an increase of 24 percent year-over-year.\n•Earnings per Share: Fiscal 2021 diluted earnings per share was $4.38 as compared to earnings per share of $0.15 from a year ago, and was benefited by approximately $2.0 billion from the one-time discrete tax benefit resulting from the recognition of deferred tax assets related to an intra-entity transfer of intangible property and an unrealized gain of $1...."
9,1%3A0%3Atc2ZnZIBE0VMSqox9MxP,0.446558,MICROSOFT CORP,"As of June 30, 2021, we had federal, state, and foreign net operating loss carryforwards of $304 million, $1.3 billion, and $2.0 billion, respectively. The federal and state net operating loss carryforwards will expire in various years from fiscal 2022 through 2041, if not utilized. The majority of our foreign net operating loss carryforwards do not expire. Certain acquired net operating loss carryforwards are subject to an annual limitation but are expected to be realized with the exception..."


### 2.3 Retrieval Augmented Generation(RAG)

In RAG, external data can be sourced from various data sources, such as document repositories, databases, or APIs. The first step is to convert the documents and the user query into a format that enables comparison and allows for performing relevancy search. To achieve comparability for relevancy search, a document collection (knowledge library) and the user-submitted query are transformed into numerical representations using embedding language models. These embeddings are essentially numerical representations of concepts in text.

Next, based on the embedding of the user query, relevant text is identified in the document collection through similarity search in the embedding space. The prompt provided by the user is then combined with the searched relevant text and added to the context. This updated prompt, which includes relevant external data along with the original prompt, is sent to the LLM (Language Model) for processing. As a result, the model output becomes relevant and accurate due to the context containing the relevant external data.

The major components of RAG, including embedding, vector databases, augmentation, and generation:


- **Embedding**: Purpose: Embeddings transform text data into numerical vectors in a high-dimensional space. These vectors represent the semantic meaning of the text. Process: The embedding process typically uses pre-trained models (like BERT or a variant) to convert both the input queries and the documents in the database into dense vectors. Role in RAG: Embeddings are crucial for the retrieval component as they allow the model to compute the similarity between the query and the documents in the database efficiently.
- **Vector Database**: Function: A vector database stores the embeddings of a large collection of documents or passages. Construction: It is created by processing a vast corpus (like Wikipedia or other specialized datasets) through an embedding model. Usage in RAG: When a query comes in, the model searches this database to find the documents whose embeddings are most similar to the embedding of the query.
- **Retrieval (Augmentation)**: Mechanism: The retrieval part of RAG functions by taking the input query, converting it into a vector using embeddings, and then searching the vector database to retrieve relevant documents. Result: It augments the original query with additional context by selecting documents or passages that are semantically related to the query. This augmented information is essential for generating more informed responses.
- **Generation**: Integration with a Language Model: The generative component, often a large language model like Amazon Titan Text, receives both the original query and the retrieved documents. Response Generation: It synthesizes information from these inputs to produce a coherent and contextually appropriate response. 
- **Training and Fine-Tuning**: This component is generally pre-trained on vast amounts of text and may be further fine-tuned to optimize its performance for specific tasks or datasets.
- **End-to-End Training (Optional)**: Joint Optimization: In RAG, both retrieval and generation components can be fine-tuned together, allowing the system to optimize the selection of documents and the generation of responses simultaneously. Feedback Loop: The model learns not only to generate relevant responses but also to retrieve the most useful documents for any given query.

---
### Architecture

![RAG](./static/RAG_Architecture.png)

---

In [53]:
langchain_index_name="10k_financial_embedding"

In [54]:
exist=False
try:
    aos_client.indices.get(index=langchain_index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=langchain_index_name)
else:
    print("index does not exist.")
    

delete existing index before creating new one


In [55]:
from langchain.vectorstores import OpenSearchVectorSearch
from typing import Callable
from requests_aws4auth import AWS4Auth

os_domain_ep = 'https://'+aoss_host

credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, aws_region, service, session_token=credentials.token)


def ingest_10k_into_opensearch_with_langchain(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    OpenSearchVectorSearch.from_documents(
                index_name = langchain_index_name,
                documents=splitted_documents,
                embedding=bedrock_embeddings,
                opensearch_url=os_domain_ep,
                http_auth=awsauth,
                timeout=600,
                use_ssl=True,
                verify_certs=True,
                connection_class=RequestsHttpConnection,
    )

In [56]:
## In this step, you're indexing the selected companies 10K files in to bedrock knowledge base (Amazon OpenSearch Serverless collection)
## This will take a while to load all the chunks

for file in company_filing_file_name_list:
    ingest_10k_into_opensearch_with_langchain(file)
    print("Ingested :" + file)


Ingested :extracted/1108524_10K_2021_0001108524-22-000008.json
Ingested :extracted/1561550_10K_2020_0001564590-21-009770.json
Ingested :extracted/1321655_10K_2020_0001193125-21-060650.json
Ingested :extracted/1441816_10K_2021_0001441816-21-000051.json
Ingested :extracted/789019_10K_2021_0001564590-21-039151.json
Ingested :extracted/1327567_10K_2021_0001327567-21-000029.json
Ingested :extracted/1660134_10K_2021_0001660134-21-000007.json
Ingested :extracted/1707753_10K_2021_0001707753-21-000026.json
Ingested :extracted/1640147_10K_2021_0001640147-21-000073.json
Ingested :extracted/1050446_10K_2020_0001564590-21-005783.json
Ingested :extracted/1689923_10K_2020_0001689923-21-000024.json
Ingested :extracted/1341439_10K_2021_0001564590-21-033616.json


In [57]:
aos_client.indices.get(index=langchain_index_name)

{'10k_financial_embedding': {'aliases': {},
  'mappings': {'properties': {'id': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'metadata': {'properties': {'cik': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'company_name': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'filing_date': {'type': 'date'},
      'item': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}},
    'text': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'vector_field': {'type': 'knn_vector',
     'dimension': 1024,
     'method': {'engine': 'nmslib',
      'space_type': 'l2',
      'name': 'hnsw',
      'parameters': {'ef_construction': 512, 'm': 16}}}}},
  'settings': {'index': {'number_of_shards': '2',
    'knn.algo_param': {'ef_search': '512'},
    'provided_name': '10k_financial_embedd

In [58]:

class SimiliarOpenSearchVectorSearch(OpenSearchVectorSearch):
    
    def relevance_score(self, distance: float) -> float:
        return distance
    
    def _select_relevance_score_fn(self) -> Callable[[float], float]:
        return self.relevance_score


open_search_vector_store = SimiliarOpenSearchVectorSearch(
                                    index_name=langchain_index_name,
                                    embedding_function=bedrock_embeddings,
                                    opensearch_url=os_domain_ep,
                                    http_auth=awsauth,
                                    timeout=600,
                                    use_ssl=True,
                                    verify_certs=True,
                                    connection_class=RequestsHttpConnection,
                                    ) 

Initialize Bedrock LLM model with Claude

In [59]:
from langchain_aws import BedrockLLM, ChatBedrock

#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=boto3_bedrock)
bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=boto3_bedrock)
#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-opus-20240229-v1:0", client=boto3_bedrock)

bedrock_llm.model_kwargs = {"temperature":0.001,"top_k":300,"top_p":1}


#### Note: This session's prompt is desinged for Claude 3. Output result may be different if use other LLMs, for example guardrails impact.

In [60]:
from langchain.chains import RetrievalQA
import langchain

bedrock_retriever = open_search_vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        'k': 5,
        'score_threshold': 0.005
    }
)
rag_qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm,
    retriever=bedrock_retriever,
    chain_type="stuff" #stuff, refine, map_reduce, and map_rerank
)

In [61]:
question="What Microsoft's research and development organization is responsible for?"

langchain.debug=True
result = rag_qa({"query": question})


  result = rag_qa({"query": question})


[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What Microsoft's research and development organization is responsible for?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What Microsoft's research and development organization is responsible for?",
  "context": "•\nExperiences and Devices, focuses on instilling a unifying product ethos across our end-user experiences and devices, including Office, Windows, Enterprise Mobility + Security, and Surface.\n•\nAI and Research, focuses on our AI innovations and other forward-looking research and development efforts spanning infrastructure, services, applications, and search.\n•\nLinkedIn, focuses on our services that transform the way customers hire, market, 

In [62]:
print("Result:" + result["result"])

Result:According to the information provided, Microsoft's research and development organization is responsible for:

- The design, development, testing and delivery of new technologies, features and integrations of Microsoft's platform.
- The continued improvement and iteration of Microsoft's existing products.
- Operating and scaling Microsoft's platform including the underlying cloud infrastructure.

Specifically, it mentions that Microsoft's research and development team consists of software engineering, product management, development and site reliability engineering teams. The research and development investments seek to drive core technology innovation and bring new products to market.


In [63]:
question="What is Microsoft main revenue?"

langchain.debug=False
result = rag_qa({"query": question})


In [64]:
print("Result:" + result["result"])

Result:Microsoft's main revenue comes from its commercial cloud business, which includes Azure, Office 365 Commercial, the commercial portion of LinkedIn, Dynamics 365, and other commercial cloud properties.

Specifically, in fiscal year 2021:

- Commercial cloud revenue was $69.1 billion, up 34% year-over-year. This was Microsoft's largest revenue stream.

- Within the commercial cloud, the biggest drivers were Azure revenue (up 50%) and Office 365 Commercial revenue (up 22%).

- Other major revenue streams included Windows OEM, Server products, Office Consumer/Microsoft 365 Consumer subscriptions, Xbox content and services, Search advertising, and Surface devices.

So while Microsoft has a diversified revenue base across productivity software, operating systems, devices, gaming, search advertising etc., the commercial cloud business centered around Azure, Office 365 Commercial, and other cloud offerings for enterprises is now Microsoft's biggest and fastest growing revenue driver.


## Part 3: AI agent powered search

![standard rag limitation](./static/rag-limitation.png)

### What is an AI agent ?
An agentic employs a chain-of-thought reasoning process, where the LLM is prompted to think gradually through a question, interleaving its reasoning with the ability to use external tools such as search engines and APIs. This allows the LLM to retrieve relevant information that can help answer partial aspects of the question, ultimately leading to a more comprehensive and accurate final response. This approach is inspired by the "Reason and Act" (ReAct) design introduced in the paper [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/pdf/2210.03629)  which aims to synergize the reasoning capabilities of language models with the ability to interact with external resources and take actions. By combining these two facets, an agentic LLM assistant can provide more informed and well-rounded answers to complex user queries.

### Why build an AI agent?
In today's digital landscape, enterprises are inundated with a vast array of data sources, ranging from traditional PDF documents to complex SQL and NoSQL databases, and everything in between. While this wealth of information holds immense potential for gaining valuable insights and driving operational efficiency, the sheer volume and diversity of data can often pose significant challenges in terms of accessibility and utilization.

This is where the power of agentic LLM assistants comes into play. By leveraging progress in LLM design patterns such as Reason and Act (ReAct) and other traditional or novel design patterns, these intelligent assistants are capable of integrating with an enterprise's diverse data sources. Through the development of specialized tools tailored to each data source and the ability of LLM agents to identify the right tool for a given question, agentic LLM assistants can simplify how you navigate and extract relevant information, regardless of its origin or structure.

This enables a rich, multi-source conversation that promises to unlock the full potential of the entreprise data, enabling data-driven decision-making, enhancing operational efficiency, and ultimately driving productivity and growth.


### Architecture

The following is the overall architecture on agent based finincial filings analysis:

![AI Agent powered search architecture](./static/architecture.png)

---

### Data Flow

The user submit query, first AI agent will judge if the query is financial statements related. If so, AI agent will use vector search to get similiar financial statements for this company from OpenSearch. If there is no financial statements for this company, AI agent will download the data from internet by calling SEC API, ingest the data into OpenSearch and do the search again. If there is related financial statements, AI agent will see if the query is stock price related question. If so, AI agent will query Redshift database to get company's stock price data. LLM will generate the response with all collected data. Overall data flow is like following:

![AI Agent powered search data flow](./static/ai-agent-search-data-flow.png)

### 3.1 Prepare other tools used by AI agent

#### 3.1.1 Ingest and query structured data in Redshift

Get Redshift Serverless username, password and endpoint

In [None]:
kms = boto3.client('secretsmanager')

redshift_serverless_credentials = json.loads(kms.get_secret_value(SecretId=outputs['RedshiftServerlessSecret'])['SecretString'])
redshift_serverless_username=redshift_serverless_credentials['username']
redshift_serverless_password=redshift_serverless_credentials['password']
redshift_serverless_endpoint =  outputs['RedshiftServerlessEndpoint']

Create `stock_symbol` table and populate the table from S3. We will use this table to query company stock ticker by company name.

In [None]:
import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host=redshift_serverless_endpoint, 
port=5439,
database='dev',
username=redshift_serverless_username,
password=redshift_serverless_password
)

%sql $connect_to_db
%sql select current_user, version();

%sql CREATE TABLE IF NOT EXISTS public.stock_symbol (stock_symbol text PRIMARY KEY, company_name text NOT NULL);

stock_price_bucket = outputs["s3BucketStock"]
s3_location = f's3://{stock_price_bucket}/stock-price/'
print(s3_location)
!aws s3 sync ./stock-price/ $s3_location

stock_symbol_s3_location = f's3://{stock_price_bucket}/stock-price/stock_symbol.csv'
quoted_stock_symbol_s3_location = "'" + stock_symbol_s3_location + "'"

%sql COPY STOCK_SYMBOL FROM $quoted_stock_symbol_s3_location iam_role default IGNOREHEADER 1 CSV;


url = URL.create(
    drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
    host=redshift_serverless_endpoint, 
    port=5439,
    database='dev',
    username=redshift_serverless_username,
    password=redshift_serverless_password
)

engine = sa.create_engine(url)
redshift_connection = engine.connect()
    
def query_stock_ticker(company_name):
    strSQL = "SELECT stock_symbol FROM stock_symbol WHERE lower(company_name) ILIKE '%" + company_name + "%'"
    stock_ticker=''
    try:
        result = redshift_connection.execute(strSQL)
        df = pd.DataFrame(result)
        stock_ticker=df['stock_symbol'][0]
    except Exception as e:
        print(e)
    return stock_ticker


In [None]:
query_stock_ticker("Amazon")

In [None]:
%sql CREATE TABLE IF NOT EXISTS public.stock_price (stock_date DATE, stock_symbol text, open_price DECIMAL, high_price DECIMAL, low_price DECIMAL, close_price DECIMAL, adjusted_close_price DECIMAL, volume DECIMAL);

msft_s3_location = f's3://{stock_price_bucket}/stock-price/MSFT.csv'
quoted_msft_s3_location = "'" + msft_s3_location + "'"
print(quoted_msft_s3_location)
print("---------")

crm_s3_location = f's3://{stock_price_bucket}/stock-price/CRM.csv'
quoted_crm_s3_location = "'" + crm_s3_location + "'"
print(quoted_crm_s3_location)
print("---------")

orcl_s3_location = f's3://{stock_price_bucket}/stock-price/ORCL.csv'
quoted_orcl_s3_location = "'" + orcl_s3_location + "'"
print(quoted_orcl_s3_location)
print("---------")

snow_s3_location = f's3://{stock_price_bucket}/stock-price/SNOW.csv'
quoted_snow_s3_location = "'" + snow_s3_location + "'"
print(quoted_snow_s3_location)
print("---------")

%sql COPY STOCK_PRICE FROM $quoted_msft_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_crm_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_orcl_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_snow_s3_location iam_role default IGNOREHEADER 1 CSV;


In [None]:
%sql select * from public.stock_price

In [None]:
def query_stock_price(stock_ticker):
    strSQL = "SELECT stock_date, stock_symbol, open_price, high_price, low_price, close_price FROM stock_price WHERE stock_symbol ='" + stock_ticker + "' limit 100"
    try:
        result = redshift_connection.execute(strSQL)
        stock_price = pd.DataFrame(result)
    except Exception as e:
        print(e)
    return stock_price

In [None]:
query_stock_price('MSFT')

#### 3.1.2 Download 10-K filing from SEC
---
https://sec-api.io

Create a new account and get free API key.



In [None]:
!pip install sec-api

##### Replace your sec-api key in the following line

In [None]:
sec_api_key="{security_api_key}"

In [None]:
from sec_api import ExtractorApi, QueryApi
import json
import os

def get_filings(ticker):
    global sec_api_key

    # Finding Recent Filings with QueryAPI
    queryApi = QueryApi(api_key=sec_api_key)
    query = {
      "query": f"ticker:{ticker} AND formType:\"10-K\"",
      "from": "0",
      "size": "1",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }
    response = queryApi.get_filings(query)

    # Getting 10-K URL
    filing_url = response["filings"][0]["linkToFilingDetails"]
    filing_type=response['filings'][0]['formType']
    cik=response['filings'][0]['cik']
    company=response['filings'][0]['companyName']
    filing_date=response['filings'][0]['filedAt']
    period_of_report=response['filings'][0]['periodOfReport']
    filing_html_index=response['filings'][0]['linkToFilingDetails']
    complete_text_filing_link=response['filings'][0]['linkToTxt']

    # Extracting Text with ExtractorAPI
    extractorApi = ExtractorApi(api_key=sec_api_key)
    
    one_text = extractorApi.get_section(filing_url, "1", "text")       #Section 1 - Business
    onea_text = extractorApi.get_section(filing_url, "1A", "text")     # Section 1A - Risk Factors
    oneb_text = extractorApi.get_section(filing_url, "1B", "text")     # Section 1B - Unresolved Staff Comments
    two_text = extractorApi.get_section(filing_url, "2", "text")       # Section 2 - Properties
    three_text = extractorApi.get_section(filing_url, "3", "text")     # Section 3 - Legal Proceedings
    four_text = extractorApi.get_section(filing_url, "4", "text")      # Section 4 - Mine Safety Disclosures
    five_text = extractorApi.get_section(filing_url, "5", "text")      # Section 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
    six_text = extractorApi.get_section(filing_url, "6", "text")       # Section 6 - Selected Financial Data (prior to February 2021)
    seven_text = extractorApi.get_section(filing_url, "7", "text")     # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
    sevena_text = extractorApi.get_section(filing_url, "7A", "text")   # Section 7A - Quantitative and Qualitative Disclosures about Market Risk
    eight_text = extractorApi.get_section(filing_url, "8", "text")     # Section 8 - Financial Statements and Supplementary Data
    nine_text = extractorApi.get_section(filing_url, "9", "text")      # Section 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
    ninea_text = extractorApi.get_section(filing_url, "9A", "text")    # Section 9A - Controls and Procedures
    nineb_text = extractorApi.get_section(filing_url, "9B", "text")    # Section 9B - Other Information
    ten_text = extractorApi.get_section(filing_url, "10", "text")      # Section 10 - Directors, Executive Officers and Corporate Governance
    eleven_text = extractorApi.get_section(filing_url, "11", "text")   # Section 11 - Executive Compensation
    twelve_text = extractorApi.get_section(filing_url, "12", "text")   # Section 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
    thirteen_text = extractorApi.get_section(filing_url, "13", "text") # Section 13 - Certain Relationships and Related Transactions, and Director Independence
    fourteen_text = extractorApi.get_section(filing_url, "14", "text") # Section 14 - Principal Accountant Fees and Services
    fifteen_text = extractorApi.get_section(filing_url, "15", "text")  # Section 15 - Exhibits and Financial Statement Schedules
    
    data = {}
    data['filing_url'] = filing_url
    data['filing_type'] = filing_type
    data['cik'] = cik
    data['company'] = company
    data['filing_date'] = filing_date
    data['period_of_report'] = period_of_report
    data['filing_html_index'] = filing_html_index
    data['complete_text_filing_link'] = complete_text_filing_link
    
    
    data['item_1'] = one_text
    data['item_1A'] = onea_text
    data['item_1B'] = oneb_text
    data['item_2'] = two_text
    data['item_3'] = three_text
    data['item_4'] = four_text
    data['item_5'] = five_text
    data['item_6'] = six_text
    data['item_7'] = seven_text
    data['item_7A'] = sevena_text
    data['item_8'] = eight_text
    data['item_9'] = nine_text
    data['item_9A'] = ninea_text
    data['item_9B'] = nineb_text
    data['item_10'] = ten_text
    data['item_11'] = eleven_text
    data['item_12'] = twelve_text
    data['item_13'] = thirteen_text
    data['item_14'] = fourteen_text
    data['item_15'] = fifteen_text
    
    json_data = json.dumps(data)
    
    
    if not os.path.exists("./download_filings"):
        os.makedirs("./download_filings")
    
    try:
        file_name = filing_url.split("/")[-2] + "-" + filing_url.split("/")[-1].split(".")[0]+".json"
        download_to = "./download_filings/" + file_name
        with open(download_to, "w") as f:
          json.dump(data, f, ensure_ascii=False, indent=4)
    except Exception as e:
        print("Problem with {url}".format(url=url))
        print(e)
    
    return file_name

In [None]:
#downloaded_file=get_filings("AMZN")
#ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file)

### 3.2 Create AI agent

#### Define methods used by AI agent

One popular architecture for building agents is ReAct. ReAct combines reasoning and acting in an iterative process - in fact the name "ReAct" stands for "Reason" and "Act".

The general flow looks like this:

- The model will "think" about what step to take in response to an input and any previous observations.
- The model will then choose an action from available tools (or choose to respond to the user).
- The model will generate arguments to that tool.
- The agent runtime (executor) will parse out the chosen tool and call it with the generated arguments.
- The executor will return the results of the tool call back to the model as an observation.
- This process repeats until the agent chooses to respond.

In [None]:
from langchain.prompts.chat import ChatPromptTemplate
from langchain.chains import LLMChain
import time

def is_financial_statement_related_query(human_input):
    #template = """You are a helpful assistant to judge if the human input is stock related question.
    #If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    template = """You are a helpful assistant to judge if the human input is trying to analyze company financial statement.
    If the human input is financial statement related question, answer \"yes\". Otherwise answer \"no\".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def is_stock_related_query(human_input):
    #template = """You are a helpful assistant to judge if the human input is stock related question.
    #If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    template = """
    You are a helpful assistant to judge if the human input is stock related question. 
    If the human innput is stock related question, return "yes".Otherwise return "no".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def get_company_name(human_input):
    template = """You are a helpful assistant who extract company name from the human input.Please only output the company"""
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )

    company_name=llm_chain({"text":human_input})['text'].strip()
    return company_name
    
def semantic_search_and_check(human_input, k=10,with_post_filter=True):

    company_name=get_company_name(human_input)
    
    search_vector = bedrock_embeddings.embed_query(human_input)

    no_post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        }
    }

    post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        },
        "post_filter": {
           "match": { 
               "company_name":company_name
           }
        }
    }
    
    search_query=no_post_filter_search_query
    if with_post_filter:
        search_query=post_filter_search_query
    
    res = aos_client.search(index=index_name, 
                       body=search_query,
                       stored_fields=["company_name","item_category","item_content"])
    
    query_result=[]
    for hit in res['hits']['hits']:
        hit_company=hit['fields']['company_name'][0]
        print("\nsemantic search hit company: " + hit_company)
        row=[hit['fields']['company_name'][0], hit['fields']['item_content'][0]]
        query_result.append(row)

    query_result_df = pd.DataFrame(data=query_result,columns=["company_name","company_financial_statements"])
    return query_result_df

def search_for_similiar_content_in_10k_filing(human_input):
    company_statements = semantic_search_and_check(human_input)
    return company_statements

def search_financial_statements_for_company(company_financial_statements_query):
    company_statements = semantic_search_and_check(company_financial_statements_query)
    return company_statements

def get_stock_ticker(human_input):
    company_name=get_company_name(human_input)
    company_ticker = query_stock_ticker(company_name)
    return company_ticker

def get_stock_price(stock_ticker):
    stock_price = query_stock_price(stock_ticker)
    return stock_price

def download_10k_filing_from_sec_and_ingest_into_opensearch(stock_ticker):
    result = "download failed."
    try:
        #downloaded_file=get_filings(stock_ticker)
        downloaded_file="000101872424000008-amzn-20231231.json"
        ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file)
        time.sleep(60) #wait the data can be searchable
        result="download succeeded."
    except Exception as e:
        result = "download failed."
    return result

### Note

Uncomment the line `downloaded_file=get_filings(company_stock_ticker)` if you have sec-api security key so that you can download 10-K from SEC. In the meanwhile, comment the line 'downloaded_file="000101872424000008-amzn-20231231.json"`



---
![OpenSearch KNN Filter](./static/opensearch-knn-filter.png)

#### Define tools for financial statements analysis AI agent

In [None]:
from langchain.agents import Tool

annual_report_tools=[
    Tool(
        name="is_financial_statement_related_query",
        func=is_financial_statement_related_query,
        description="""
        Use this tool when you need to know whether user input query is financial statement analysis related query. Human orginal query is the input to this tool. This tool output is whether human input is financial statement analysis related or not. 
        If the query is not finance statement related, please answer \"I am finiancial statement ansysis assitant. I can not answer question which is not finance related.\" and terminate the dialog.
        """
    ),
    Tool(
        name="search_financial_statements_for_company",
        func=search_financial_statements_for_company,
        description="""
        Use this tool to get financial statement of the company. This tool output is company financial statements.
        """
    ),
    Tool(
        name="get_stock_ticker",
        func=get_stock_ticker,
        description="Use this tool when you need to get the company stock ticker. Human orginal query is the input to this tool. This tool will output company stock ticker."
    ),
    Tool(
        name="download_10k_filing_from_sec_and_ingest_into_opensearch",
        func=download_10k_filing_from_sec_and_ingest_into_opensearch,
        description="""
        Use this tool to download company financial statements from internet. Company stock ticker is the input to this tool. The tool output is download succeed or not.
        Use this tool if and only if "search_financial_statements_for_company" output result is empty. After downloading financial statements, you must use "search_financial_statements_for_company" tool to search financial statements again.
        """
    ),
    Tool(
        name="is_stock_related_query",
        func=is_stock_related_query,
        description="Use this tool when you need to know whether user input query is stock related query. Human orginal query is the input to this tool. This tool output is whether human input is stock related or not."
    ),
    Tool(
        name="get_stock_price",
        func=get_stock_price,
        description="""
        Use this tool to get company stock price data. Company stock ticker is the input to this tool. This tool will output company historic stock price. The output includes 'stock_date', 'stock_ticker', 'open_price', 'high_price', 'low_price', 'close_price' of the company in the latest 100 days.
        This tool is mandatory to use if the input query is both finance statement related and stock related. If the output of "get_stock_price" is empty, please answer \"I cannot provide stock analysis without stock price information.\" and terminate the dialog.
        """
    )
]

#### Define prompt for financial statements analysis AI agent 

In [None]:
from langchain_core.prompts import ChatPromptTemplate


system_message = f"""
You are finiancial analyst assistant and you will analyze company financial statements and stock data. 
Leverage the <conversation_history> to avoid duplicating work when answering questions.

Available tools:
<tools>
{{tools}}
</tools>


To answer, first review the <conversation_history>. If insufficient use tool(s) with the following format:
<thinking>Think about which tool(s) to use and why. "get_stock_price" tool is mandatory to use if the input query is both finance statements related and stock related.</thinking>
<tool>tool_name</tool>
<tool_input>input</tool_input>
<observation>response</observation>

When you are done, provide a final answer in markdown within <final_answer></final_answer>.
If <user_input> is stock related and the output of "get_stock_price" tool is empty, respond directly within <final_answer> with the exact content \"I cannot provide stock analysis without stock price information.\".
Otherwise, use the following format to organize your <final_answer>:

Summary:
...

Support points:
Support point 1: ...
Support point 2: ...
Support point 3: ...


"""

user_message = """
Begin!

Previous conversation history:
<conversation_history>
{chat_history}
</conversation_history>

User input message:
<user_input>
{input}
</user_input>

{agent_scratchpad}
"""

# Construct the prompt from the messages
messages = [
    ("system", system_message),
    ("human", user_message),
]

financial_statements_analysis_prompt = ChatPromptTemplate.from_messages(messages)

#### Define memory for financial statements analysis AI agent 

In [None]:
from langchain_community.chat_message_histories import DynamoDBChatMessageHistory
from uuid import uuid4

dynamo = boto3.client('dynamodb')

history_table_name = 'conversation-history-memory'

try:
    response = dynamo.describe_table(TableName=history_table_name)
    print("The table "+history_table_name+" exists")
except dynamo.exceptions.ResourceNotFoundException:
    print("The table "+history_table_name+" does not exist")
    
    dynamo.create_table(
    TableName=history_table_name,
    AttributeDefinitions=[
        {
            'AttributeName': 'SessionId',
            'AttributeType': 'S',
        }
    ],
    KeySchema=[
        {
            'AttributeName': 'SessionId',
            'KeyType': 'HASH',
        }
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 5,
        'WriteCapacityUnits': 5,
    }
    )

    response = dynamo.describe_table(TableName=history_table_name) 
    
    while response["Table"]["TableStatus"] == 'CREATING':
        time.sleep(1)
        print('.', end='')
        response = dynamo.describe_table(TableName=history_table_name) 

    print("\ndynamo DB Table, '"+response['Table']['TableName']+"' is created")



#### Create financial statements analysis AI agent AI using defined Memory,  LLM, tools and prompt

In [None]:
from langchain.agents import create_xml_agent
from langchain.memory import ConversationBufferMemory
from langchain.agents import AgentExecutor


def create_new_memory_with_session(session_id):
    chat_memory = DynamoDBChatMessageHistory(table_name=history_table_name,session_id=session_id)    
    return chat_memory

def get_agentic_chatbot_conversation_chain(session_id, verbose=True):
    chat_memory=create_new_memory_with_session(session_id)
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        # Change the human_prefix from Human to something else
        # to not conflict with Human keyword in Anthropic Claude model.
        human_prefix="Hu",
        chat_memory=chat_memory,
        return_messages=False)

    agent = create_xml_agent(
        bedrock_llm,
        annual_report_tools,
        financial_statements_analysis_prompt,
        stop_sequence=["</tool_input>", "</final_answer>"]
    )

    agent_chain = AgentExecutor(
        agent=agent,
        tools=annual_report_tools,
        return_intermediate_steps=False,
        verbose=True,
        memory=memory,
        handle_parsing_errors="Check your output and make sure it conforms!"
    )
    return agent_chain

### 3.3 Use financial statements analysis AI agent

#### Example 1:

Query is "Per Microsoft financial statements, what Microsoft's research and development organization is responsible for?". 

The data flow is like following:

![example 1](./static/example-1-data-flow.png)


In [None]:
import warnings


langchain.debug=False
warnings.filterwarnings("ignore")

session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Per Microsoft financial statements, what Microsoft's research and development organization is responsible for?"})

In [None]:
print(response["output"])

#### Example 2

Query is "Is Microsoft a good investment choice right now?". 

The data flow is like following:

![example 2](./static/example-2-data-flow.png)

In [None]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Is Microsoft a good investment choice right now?"})


In [None]:
print(response["output"])

#### Example 3

Query is "Compare Oracle and Microsoft company financial statements"

The data flow is like following:

![example 3](./static/example-3-data-flow.png)

In [None]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Compare Oracle and Microsoft company financial statements"})

In [None]:
print(response["output"])

#### Example 4

Query is "Is Amazon a good investment choice right now?"

The data flow is like following:

![example 4](./static/example-4-data-flow.png)

In [None]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "Is Amazon a good investment choice right now?"})

In [None]:
print(response["output"])

#### Example 5

Query is "What is OpenSearch?"

The data flow is like following:

![example 5](./static/example-5-data-flow.png)

In [None]:
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": "What is OpenSearch?"})

In [None]:
print(response["output"])