# Generative AI-powered search with Amazon OpenSearch Service

---
### Using Scenario
Form 10-K is a comprehensive report filed annually by a publicly traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). Some of the information a company is required to document in the 10-K includes its history, organizational structure, financial statements, earnings per share, subsidiaries, executive compensation, and any other relevant data.

The SEC mandates that all public companies file regular 10-Ks to keep investors aware of a company's financial condition and to allow them to have enough information before they buy or sell securities issued by that company. The 10-K can appear overly complex at first glance, complete with tables full of data and figures. However, it is so comprehensive that this filing is critical for investors to handle a company's financial position and prospects.

Form 10-K is an annual report that provides a comprehensive analysis of the company's financial condition. The Form 10-K is comprised of several parts. These include:

- 1 - Business-This describes the company's operations. 
- 1A - Risk Factors
- 1B - Unresolved Staff Comments
- 2 - Properties
- 3 - Legal Proceedings
- 4 - Mine Safety Disclosures
- 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
- 6 - Selected Financial Data (prior to February 2021)
- 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
- 7A - Quantitative and Qualitative Disclosures about Market Risk
- 8 - Financial Statements and Supplementary Data
- 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
- 9A - Controls and Procedures
- 9B - Other Information
- 10 - Directors, Executive Officers and Corporate Governance
- 11 - Executive Compensation
- 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
- 13 - Certain Relationships and Related Transactions, and Director Independence
- 14 - Principal Accountant Fees and Services
- 15 - Exhibits and Financial Statement Schedules

---

Many investors rely of SEC filings to analyze the financial health of a company, and they can certainly be a treasure trove of valuable information. Keyword based search may return some irrelevant information. Even with semantic search, information is overwhelming. Can we leverage generative AI to help us on company financial statements interpertation?


In this code talk session, we will show you how to modernize your search application to improve search relevance with Amazon OpenSearch while leveraging generative AI to improve search productivity. The code includes the following topics:
- Comparison search relevance between keyword search and semantic search with Amazon OpenSearch.
- How to leverage Retrieval Augmented Generation(RAG) improve search productivity.
- How to build intelligent agent which orchestrate and execute multistep tasks to automate 10-K filings analysis.
- OpenSearch vector store best practices

---


### Code Structure


The code includes the following sections:
- [Initialize](#Initialize)
- [Part 1: Ingest unstructured data into OpenSearch](#Part-1:-Ingest-unstructured-data-into-OpenSearch)
- [Part 2: Different appoach to search](#Part-2:-Different-appoach-to-search)
    - [2.1 Keyword search](#2.1-Keyword-search)
    - [2.2 Semantic/Vector search](#2.2-Semantic/Vector-search)
    - [2.3 Retrieval Augmented Generation(RAG)](#2.3-Retrieval-Augmented-Generation(RAG))
- [Part 3: AI agent powered search](#Part-3:-AI-agent-powered-search)
    - [3.1 Prepare other tools used by AI agent](#3.1-Prepare-other-tools-used-by-AI-agent)
        - [3.1.1 Ingest and query structured data in Redshift](#3.1.1-Ingest-and-query-structured-data-in-Redshift)
        - [3.1.2 Download 10-K filing from SEC](#3.1.2-Download-10-K-filing-from-SEC)
    - [3.2 Create AI agent](#3.2-Create-AI-agent)
    - [3.3 Use AI agent](#3.3-Use-AI-agent)


## Initialize




###  Install dependency Python library for OpenSearch, Redshift, Langchain

In [1]:
%pip install opensearch-py
%pip install torch
%pip install requests-aws4auth
%pip install boto3
%pip install sqlalchemy
%pip install sqlalchemy-redshift
%pip install redshift_connector
%pip install ipython-sql==0.4.1
%pip install langchain==0.3.1
%pip install langchain-aws==0.2.1
%pip install langchain-community==0.3.1


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Import library



In [2]:
import boto3
import re
import time
import sagemaker,json
from sagemaker.session import Session
import pandas as pd
import os
import uuid
import json

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Part 1: Ingest unstructured data into OpenSearch

### Get SEC 10-K form files

Lets download it and unzip it

In [None]:
!wget https://ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com/df655552-1e61-4a6b-9dc4-c03eb94c6f75/10k-financial-filing.zip
!unzip 10k-financial-filing.zip

Read the dataset in JSON format and contruct pandas DataFrame

### Load the data to OpenSearch
OpenSearch is good with dynamic type inference and can perform full text search with fuzziness and type tolerance. Lets fetch the AOSS endpoint from the deployed CloudFormation template. 

In [3]:
cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "generative-ai-powered-search"

outputs = get_cfn_outputs(cloudformation_stack_name)

aoss_endpoint = outputs['OpenSearchServerlessCollectionEndpoint']

aoss_host = aoss_endpoint.split("//")[1]

outputs


{'RedshiftClusterSecurityGroupName': 'sg-0267ba9bb5badfc63',
 'RedshiftServerlessWorkroup': 'workgroup-8210c140',
 's3BucketStock': 'generative-ai-powered-search-s3bucketstock-i34dvcmdobut',
 'SageMakerNotebookURL': 'https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/notebook-instances/openNotebook/generative-ai-powered-search?view=classic',
 'Workgroupname': 'workgroup-8210c140',
 'VPC': 'vpc-0e3a85ced309f52eb',
 'RedshiftRoleName': 'RedshiftServerlessImmersionRole',
 'RedshiftRoleNameArn': 'arn:aws:iam::649735563824:role/RedshiftServerlessImmersionRole',
 'NamespaceName': 'namespace-8210c140',
 'AdminUsername': 'awsuser',
 'RedshiftServerlessEndpoint': 'workgroup-8210c140.649735563824.us-east-1.redshift-serverless.amazonaws.com',
 'Region': 'us-east-1',
 'RedshiftServerlessSecret': 'arn:aws:secretsmanager:us-east-1:649735563824:secret:RedshiftServerlessSecret-N47JCZ',
 'AdminPassword': 'Awsuser123!',
 'OpenSearchServerlessCollectionEndpoint': 'https://tii7rfbr6upem93azm0

Lets create a client to OpenSearch Serverless and we use this for the entire workshop.

In [4]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

service = "aoss"
aws_region = boto3.Session().region_name
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, aws_region, service)

aos_client = OpenSearch(
    hosts = [{"host": aoss_host, "port": 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

Let's create a new index with a name "10k_finanical". As you will recreate this index with different techniques, delete the index if any before creating. Just creating an empty index in OpenSearch, the index schema would be extened as and when it finds a new attribute with dynamic type inference. 

In [6]:
raw_index_name="10k_financial_raw"

exist=False
try:
    aos_client.indices.get(index=raw_index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=raw_index_name)
else:
    print("index does not exist.")
    
aos_client.indices.create(index=raw_index_name,ignore=400)

delete existing index before creating new one


{'acknowledged': True,
 'shards_acknowledged': True,
 'index': '10k_financial_raw'}

Now you can load all the financial reports to the index you just created.

In [7]:
from opensearchpy import helpers
# Set the directory path
directory_path =  "extracted"
batch_size = 50
# Initialize a list to store the documents
documents = []

# Iterate through the files in the directory
for filename in os.listdir(directory_path):
    file_path = os.path.join(directory_path, filename)

    # Read the file contents
    with open(file_path, 'r') as file:
        file_contents = file.read()
    docJson = json.loads(file_contents)
    docJson["_index"] = raw_index_name
    documents.append(docJson)
    # If the batch size is reached, index the documents
    if len(documents) == batch_size:
        aos_response= helpers.bulk(aos_client, documents)
        print(f"Indexed {len(documents)} documents.")
        documents = []

# Index the remaining documents
if documents:
    aos_response= helpers.bulk(aos_client, documents)
    print(f"Indexed {len(documents)} documents.")

Indexed 50 documents.
Indexed 50 documents.
Indexed 50 documents.
Indexed 41 documents.


you can check the total number of documents indexed into the OpenSearch index.

In [33]:
raw_index_name="10k_financial_raw"
res = aos_client.search(index=raw_index_name, body={"size": 0, "query": { "match_all": {}}})

print("Records found: %d." % res['hits']['total']['value'])

Records found: 191.


## Part 2: Different appoach to search

### 2.1 Keyword search
---
Keyword search refers to finding information one is looking for using terms or words, called "query", from among a large body of textual data. It uses various tokenization methods to split the actual text and score the results with the token statistics like, how many times the word exist in the document, how common it is across the entire corpus, proximity, etc. With these data structure, OpenSearch can handle fuzziness, and tolerate the typo mistakes in the search enables the users search with phonetically similar terms or not knowing the exact spelling( for example scientific names). It works great with exact matches

Lets search for companies in the state of Illinois

In [13]:
query = {
    "_source" : ["company", "filing_date", "state_location"],
    "query": {
        "bool" :{
            "filter" : [{
                    "match" :{
                        "state_location.keyword" : "IL"
                    }
                }]
        }
    }
}

In [14]:
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Unnamed: 0,_id,_score,company,filing_date,state_location
0,1%3A0%3AtiwoKZMBZ23FzQVe241Y,0.0,"ENVESTNET, INC.",2021-02-26,IL
1,1%3A0%3Ad9YoKZMB-aPiNMN84ITa,0.0,OneSpan Inc.,2021-02-25,IL
2,1%3A0%3AjdYoKZMB-aPiNMN84ITa,0.0,"Sprout Social, Inc.",2021-02-24,IL
3,1%3A0%3AxCwoKZMBZ23FzQVe5o1A,0.0,Paylocity Holding Corp,2021-08-06,IL
4,1%3A0%3A1iwoKZMBZ23FzQVe5o1B,0.0,"CDK Global, Inc.",2021-08-18,IL


You can search across fields and highlight them why they match the document.

In [15]:
query = {
    "_source" : ["company", "filing_date", "state_location", "item_1"],
    "query": {
        "bool": {
            "must" :[
                {
                    "multi_match" :{
                        "query" : "travel",
                        "fields" :["item_1"]
                    }
                }
            ]
        }
    },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}

In [16]:
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
print (f"Total documents found: {res['hits']['total']['value']}")
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location'],hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location", "item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Total documents found: 51


Unnamed: 0,_id,_score,company,filing_date,state_location,item_1_highlight
0,1%3A0%3AkdYoKZMB-aPiNMN84ITa,2.815863,EBIX INC,2021-04-27,GA,"encompasses leadership in the areas of domestic & international money remittance, foreign exchange (Forex), <em>travel</em>"
1,1%3A0%3AkNYoKZMB-aPiNMN84ITa,2.639751,SVMK Inc.,2021-02-18,CA,"COVID-19 pandemic, we have modified certain aspects of our business, including restricting employee <em>travel</em>"
2,1%3A0%3AlywoKZMBZ23FzQVe241Y,2.247283,"PROS Holdings, Inc.",2021-02-12,TX,Airline Revenue Optimization\nPROS revenue optimization solutions enable enterprises in the <em>travel</em> industry
3,1%3A0%3AviwoKZMBZ23FzQVe5o1A,2.201491,LIVEPERSON INC,2021-03-08,NY,"to focus primarily on key target markets: consumer/retail, telecommunications, financial services, <em>travel</em>"
4,1%3A0%3AmiwoKZMBZ23FzQVe241Y,2.183523,Coupa Software Inc,2021-03-18,CA,"Coupa also provides additional <em>travel</em> management capabilities, such as <em>travel</em> price assurance that helps"
5,1%3A0%3AWNYoKZMB-aPiNMN804S2,2.138577,"FireEye, Inc.",2021-02-26,CA,"various local, state and federal government public health orders, facility and business closures, and <em>travel</em>"
6,1%3A0%3AQdYoKZMB-aPiNMN804S2,2.058728,EVOLVING SYSTEMS INC,2021-03-17,CO,The inability to <em>travel</em> has delayed interactions with our clients on projects and in the traditional
7,1%3A0%3AftYoKZMB-aPiNMN84ITa,2.030981,"EVERBRIDGE, INC.",2021-02-26,MA,".\n•\nIntegration of physical security data with location awareness data gathered from <em>travel</em>, network"
8,1%3A0%3AWdYoKZMB-aPiNMN804S2,1.955579,IMAGEWARE SYSTEMS INC,2021-04-05,CA,"are highly uncertain and cannot be predicted with confidence, such as the duration of the outbreak, <em>travel</em>"
9,1%3A0%3AY9YoKZMB-aPiNMN804S2,1.922584,AUDIOEYE INC,2021-03-11,AZ,"Our typical market sectors include, but are not limited to:\n· Finance and banking institutions;\n· <em>Travel</em>"


You can also perform a phrase match where the words are together, with additional filters like companies located in California as below

In [17]:
query = {
    "_source" : ["company", "filing_date", "state_location", "item_1"],
    "query": {
        "bool": {
            "must" :[
                {
                    "match_phrase" :{
                        "item_1" : "Storage"
                    }
                }
            ],
            "filter" :[
                {
                    "term": {
                        "state_location.keyword" : "CA"
                    }
                }
            ]
        }
    },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}

In [18]:
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
print (f"Total documents found: {res['hits']['total']['value']}")
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location'],hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location", "item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Total documents found: 32


Unnamed: 0,_id,_score,company,filing_date,state_location,item_1_highlight
0,1%3A0%3AP9YoKZMB-aPiNMN804S2,1.91214,"Nutanix, Inc.",2021-09-21,CA,•<em>Storage</em> Capabilities.
1,1%3A0%3ArSwoKZMBZ23FzQVe241Y,1.624939,PROOFPOINT INC,2021-02-19,CA,"The key components of our security-as-a-service platform, including services for secure <em>storage</em>, content"
2,1%3A0%3AbNYoKZMB-aPiNMN804S2,1.589684,"DROPBOX, INC.",2021-02-19,CA,"With Smart Sync, users can access all of their content natively on their computers without taking up <em>storage</em>"
3,1%3A0%3ApiwoKZMBZ23FzQVe241Y,1.508496,JFrog Ltd,2021-02-12,CA,Our platform supports a wide variety of enterprise-scale <em>storage</em> capabilities and also accommodates spikes
4,1%3A0%3AhiwoKZMBZ23FzQVe241Y,1.47256,"Cloudera, Inc.",2021-03-25,CA,"provides better analytic experiences for users, is easier to manage and optimizes data center server and <em>storage</em>"
5,1%3A0%3AXtYoKZMB-aPiNMN804S2,1.462097,"QUALYS, INC.",2021-02-22,CA,"handles, registries, and network connections, and uploads the data to the Qualys Cloud Platform for <em>storage</em>"
6,1%3A0%3A1ywoKZMBZ23FzQVe5o1B,1.414713,"DOCUSIGN, INC.",2021-03-31,CA,"our systems and processes also exceed industry practices for data protection, transmission and secure <em>storage</em>-including"
7,1%3A0%3Ac9YoKZMB-aPiNMN84ITa,1.344486,"Anaplan, Inc.",2021-03-12,CA,"Powered by our proprietary Hyperblock® technology, our platform’s in-memory data <em>storage</em> and calculation"
8,1%3A0%3ASdYoKZMB-aPiNMN804S2,1.335236,"Veritone, Inc.",2021-03-05,CA,"increasing at an annual growth rate of 30-60% per year, according to Gartner (2020 Strategic Roadmap for <em>Storage</em>"
9,1%3A0%3AkiwoKZMBZ23FzQVe241Y,1.292474,"Cloudflare, Inc.",2021-02-25,CA,This has opened up an entirely new market for us: <em>storage</em> and compute.


While the full text search works great with structural data with fuzziness, and typo tolerance, proximity searching, highlighting. However, When it comes to natural language, pure keyword search could result less relevant and a long tail of noise.

Lets run the query and check the search result. Some irrelevant documents are returned.

In [22]:
query_text="What are the operating expenses of Adobe?"
query={
  "size": 10,
  "query": {
    "match": {
      "item_1": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_*" : {}
    }
  }
}
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['item_1'], hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","item_1","item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

Unnamed: 0,_id,_score,company,item_1,item_1_highlight
0,1%3A0%3ATdYoKZMB-aPiNMN804S2,6.423738,"Medallia, Inc.","Item 1. Business.\nOverview\nMedallia, Inc. was founded in 2001 to help the world’s largest companies understand and improve customer experiences at scale. In doing so, we created a new category of enterprise software, experience management.\nOur SaaS (software-as-a-service) platform, the Medallia Experience Cloud, is built on modern technology and open architecture, utilizing artificial intelligence (AI) and machine learning to analyze massive amounts of data. We capture experience data fro...","•In addition to integrations with <em>Adobe</em>, Salesforce, ServiceNow, and many more, we have hundreds <em>of</em> connectors"
1,1%3A0%3AhtYoKZMB-aPiNMN84ITa,6.026036,ADOBE INC.,"ITEM 1. BUSINESS\nFounded in 1982, Adobe Inc. is one of the largest and most diversified software companies in the world. We offer a line of products and services used by creative professionals including photographers, video editors, designers and developers; communicators including content creators, students, marketers and knowledge workers; businesses of all sizes; and consumers for creating, managing, delivering, measuring, optimizing, engaging and transacting with compelling content and ...","This is <em>the</em> core <em>of</em> <em>what</em> we have delivered for decades, and we have evolved our business model to provide"
2,1%3A0%3AiCwoKZMBZ23FzQVe241Y,5.856948,iSign Solutions Inc.,"Item 1. Business\nGeneral\niSign Solutions Inc. (the “Company” or “iSign”), was incorporated in Delaware in October 1986. iSign is a leading supplier of digital transaction management (DTM) software enabling the paperless, secure and cost-effective management and authentication of document-based transactions. iSign’s solutions encompass a wide array of functionality and services, including electronic signatures, simple-to-complex workflow management and various options for biometric authenti...","For <em>the</em> year ended December 31, 2020, <em>operating</em> <em>expenses</em> were $1,619, a decrease <em>of</em> $8, or 0.5%, compared"
3,1%3A0%3ATtYoKZMB-aPiNMN804S2,5.564781,"Verb Technology Company, Inc.","ITEM 1. BUSINESS\nOverview\nWe are a Software-as-a-Service (“SaaS”) applications platform developer. Our platform is comprised of a suite of interactive, video-based sales enablement business software products marketed on a subscription basis. Our applications, available in both mobile and desktop versions, are offered as a fully integrated suite, as well as on a standalone basis, and include verbCRM, our white-labelled Customer Relationship Management (“CRM”) application for large, sales-ba...","viewers watched <em>the</em> video, how many times they watched it, and <em>what</em> they clicked on, in addition to"
4,1%3A0%3AtSwoKZMBZ23FzQVe241Y,5.251533,MARIN SOFTWARE INC,"ITEM 1.\nBUSINESS\nWe are a leading provider of digital marketing software for search, social and eCommerce advertising channels, offered as a unified software-as-a-service, or SaaS, advertising management platform for performance-driven advertisers and agencies. Our platform is an analytics, workflow and optimization solution for marketing professionals, allowing them to effectively manage their digital advertising spend. We market and sell our solutions to advertisers directly and through ...",Our platforms <em>are</em> comprised <em>of</em> <em>the</em> following modules:\n•\nOptimization.
5,1%3A0%3AktYoKZMB-aPiNMN84ITa,4.898939,BOX INC,"Item 1. BUSINESS\nOverview\nBox is the Content Cloud: one secure, cloud-native platform for managing the entire content journey. Content - from blueprints to wireframes, videos to documents, proprietary formats to PDFs - is the source of an organization’s unique value. Our cloud content management platform enables our customers, including 67% of the Fortune 500, to securely manage the entire content lifecycle, from the moment a file is created or ingested to when it’s shared, edited, publish...","Sales and marketing <em>expenses</em> were $275.7 million, $317.6 million and $312.2 million for <em>the</em> years ended"
6,1%3A0%3AnNYoKZMB-aPiNMN84ITb,4.735422,BSQUARE CORP /WA,"Item 1.\nBusiness.\nOverview\nBsquare is a software and services company that designs, configures, and deploys technologies that solve difficult problems for manufacturers and operators of connected devices. Our customers choose Bsquare to help realize the promise of the Internet of Things (IoT) to transform their businesses. Our products include software that connect devices to create intelligent systems that are cloud-enabled, contribute critical data, and facilitate distributed control an...","We <em>are</em> also authorized to sell Windows IoT <em>operating</em> systems in Canada, <em>the</em> United States, Argentina,"
7,1%3A0%3A1ywoKZMBZ23FzQVe5o1B,4.367266,"DOCUSIGN, INC.","ITEM 1. BUSINESS\nOverview\nDocuSign helps organizations do business faster with less risk, lower costs, and better experiences for customers and employees. We accomplish this by transforming the foundational element of business: the agreement.\nAgreements are everywhere. In the regular course of doing business, organizations sign contracts, offer letters, and hundreds of other types of agreements with customers, employees, and business partners. This is true for every size of organization, ...","In addition to <em>what</em> we do, we believe we <em>are</em> distinguished by how we do it:\n•Stringent security standards"
8,1%3A0%3AYNYoKZMB-aPiNMN804S2,4.338048,"BigCommerce Holdings, Inc.","Item 1. Business.\nOverview\nBigCommerce is leading a new era of ecommerce. Our software-as-a-service (“SaaS”) platform simplifies the creation of beautiful, engaging online stores by delivering a unique combination of ease-of-use, enterprise functionality, and flexibility. We power both our customers’ branded ecommerce stores and their cross-channel connections to popular online marketplaces, social networks, and offline point-of-sale (“POS”) systems.\nBigCommerce empowers businesses to tur...","Businesses must address <em>the</em> breadth <em>of</em> touch points influencing <em>what</em> and where shoppers buy, including"
9,1%3A0%3AidYoKZMB-aPiNMN84ITa,4.265956,"Vertex, Inc.","Item 1. Business\nOverview\nVertex is a leading provider of enterprise tax technology solutions. Our vision is to accelerate global commerce, one transaction at a time. Companies with complex tax operations rely on Vertex to automate their end-to-end indirect tax processes. Our software, content and services address the increasing complexities of global commerce and compliance by reducing friction, enhancing transparency, and enabling greater confidence in meeting indirect tax obligations. A...","<em>The</em> majority <em>of</em> our integrations <em>are</em> designed, tested and supported by us."


Run the query and check the search result.

In [23]:
# lets execute another query example
query_text="What is Adobe's main revenue source?"
query={
  "size": 10,
  "query": {
    "match": {
      "item_1": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['item_1'], hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","item_1","item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

# you can notice some irrelevant results

Unnamed: 0,_id,_score,company,item_1,item_1_highlight
0,1%3A0%3AhtYoKZMB-aPiNMN84ITa,5.409208,ADOBE INC.,"ITEM 1. BUSINESS\nFounded in 1982, Adobe Inc. is one of the largest and most diversified software companies in the world. We offer a line of products and services used by creative professionals including photographers, video editors, designers and developers; communicators including content creators, students, marketers and knowledge workers; businesses of all sizes; and consumers for creating, managing, delivering, measuring, optimizing, engaging and transacting with compelling content and ...","This <em>is</em> the core of <em>what</em> we have delivered for decades, and we have evolved our business model to provide"
1,1%3A0%3AaNYoKZMB-aPiNMN804S2,3.595602,"Sumo Logic, Inc.","Item 1. Business\nOverview\nSumo Logic empowers organizations to close the intelligence gap.\nSumo Logic is the pioneer of Continuous Intelligence, a new category of software, which enables organizations of all sizes to address the challenges and opportunities presented by digital transformation, modern applications, and cloud computing. Our Continuous Intelligence Platform enables organizations to automate the collection, ingestion, and analysis of application, infrastructure, security, and...",Organizations can succeed or fail based on how well they understand and respond to <em>what</em> <em>is</em> happening
2,1%3A0%3A3iwoKZMBZ23FzQVe5o1B,3.49186,BRIDGEWAY NATIONAL CORP.,"ITEM 1. BUSINESS\nGeneral\nThe Company is structured as a holding company with a business strategy focused on owning subsidiaries engaged in a number of diverse business activities. We are not a “blank check company” as defined in Rule 419 under the Securities Act of 1933, as amended (the “Securities Act”). We conduct and plan to continue to conduct our activities in such a manner as not to be deemed an investment company under the Investment Company Act of 1940, as amended (the “Investment ...",We will seek to focus on acquiring operating businesses and securities that (a) can be purchased at <em>what</em>
3,1%3A0%3AY9YoKZMB-aPiNMN804S2,3.356695,AUDIOEYE INC,"Item 1. Business\nOverview\nAudioEye is an industry-leading software solution provider delivering website accessibility compliance at all price points to businesses of all sizes. Our solutions advance accessibility with patented technology that reduces barriers, expands access for individuals with disabilities, and enhances the user experience for a broader audience. We believe that, when implemented, our solution offers businesses and organizations the opportunity to reach more customers, i...",AudioEye primarily generates <em>revenue</em> through the sale of subscriptions for our software-as-a-service
4,1%3A0%3AyCwoKZMBZ23FzQVe5o1A,3.35396,"SmartMetric, Inc.","Item 1. Business\nCorporate History and Overview\nSmartMetric, Inc. (“SmartMetric” or the “Company”) is a company that was incorporated pursuant to the laws of Nevada on December 18, 2002 and is engaged in the biometric technology manufacturing industry. SmartMetric has an issued patent covering technology that involves connection to networks using data cards (smart cards and EMV cards). In addition, SmartMetric holds the sole license to five issued patents covering features of its biometric...",SmartMetric’s <em>main</em> products are a fingerprint sensor activated payments card for use in the credit and
5,1%3A0%3AniwoKZMBZ23FzQVe241Y,3.309396,GTY Technology Holdings Inc.,"Item 1. Business.\nGTY Business Overview\nGTY is a software-as-a-service (“SaaS”) company that offers a cloud-based suite of solutions for the public sector in North America. GTY brings government technology companies together to achieve a new standard in citizen engagement and resource management. GTY solutions provide public sector organizations with the ability to communicate, engage, interact, conduct business, and transact with their constituents in procurement, payments, grants managem...","CityBase SaaS integrates its platform to underlying systems of record, billing, and other <em>source</em> systems"
6,1%3A0%3AuywoKZMBZ23FzQVe5o1A,3.297866,MICROSOFT CORP,"Item 1\nThe investments we make in sustainability carry through to our products, services, and devices. We design our devices, from Surface to Xbox, to minimize their impact on the environment. Our cloud and AI services and datacenters help businesses cut energy consumption, reduce physical footprints, and design sustainable products. We also pledged a $50 million investment in AI for Earth to accelerate innovation by putting AI in the hands of those working to directly address sustainabilit...",Office Commercial <em>revenue</em> <em>is</em> mainly affected by a combination of continued installed base growth and
7,1%3A0%3AuSwoKZMBZ23FzQVe5o1A,3.237124,ISSUER DIRECT CORP,"ITEM 1. DESCRIPTION OF BUSINESS.\nCompany Overview\nIssuer Direct Corporation and its subsidiaries are hereinafter collectively referred to as “Issuer Direct”, the “Company”, “We” or “Our” unless otherwise noted. Our corporate offices are located at One Glenwood Ave., Suite 1001, Raleigh, North Carolina, 27603.\nWe announce material financial information to our investors using our investor relations website, SEC filings, investor events, news and earnings releases, public conference calls, w...",In the past we have disclosed revenues in two <em>main</em> categories: (i) Platform and Technology and (ii) Services
8,1%3A0%3AqCwoKZMBZ23FzQVe241Y,3.192009,"Rapid7, Inc.","Item 1. Business\nOverview\nRapid7 is advancing security with visibility, analytics, and automation delivered through our Insight Platform. Our solutions simplify the complex, allowing security teams to work more effectively with IT and development to reduce vulnerabilities, monitor for misconfigurations and malicious behavior, investigate and shut down attacks, and automate routine tasks.\nIn the 20 years that Rapid7 has been in business, security companies and trends have come and gone, wh...","remediation projects, and pre-built automation workflows, the Insight Platform provides a granular view of <em>what</em>"
9,1%3A0%3AgdYoKZMB-aPiNMN84ITa,3.060556,GIVEMEPOWER CORP,ITEM 1.\nBUSINESS\nBusiness Overview\nGiveMePower Corporation operates and manages a portfolio of real estate and financial services assets and operations to empower black persons in the United States through financial tools and resources. Givemepower is primarily focused on: (1) creating and empowering local black businesses in urban America; and (2) creating real estate properties and businesses in opportunity zones and other distressed neighborhood across America. This Offering represents...,The Company’s <em>main</em> telephone number <em>is</em> (310) 895-1839.


These term statistics based retrieval on a large text corpus produced these results. 
![Keyword Search](./static/keyword-search-flow.png)

And a segway to the world of vector search or semantic search.

### Initialize embedding model to vectorize text data

### 2.2 Semantic/Vector search

---
In Vector search,documents and queries are represented as high-dimensional numerical vectors, rather than just as strings of text.

The key idea behind vector search is that semantically similar documents or queries can be mapped to vectors that are close to each other in the vector space, even if the textual content doesn't have much lexical overlap.

Here's a bit more detail on how vector search works:

    Documents are converted into numerical vector representations using machine learning models like word embeddings or document embeddings. These models learn to map words, phrases, or entire documents into a high-dimensional vector space.

    Queries are also converted into vector form, allowing them to be compared to the document vectors in this mathematical vector space.

    Rather than doing a simple keyword match, the search engine calculates the similarity between the query vector and each document vector, often using a metric like cosine similarity.

    The most similar document vectors are then returned as the search results, even if they don't contain the exact words from the query.

This allows vector search to uncover semantically relevant content that would be missed by traditional text-based searches. It's especially useful for tasks like question answering, e-commerce search, and retrieval of similar documents or images.

The underlying machine learning models need to be trained on large datasets, but vector search can significantly improve the quality and relevance of search results compared to purely lexical approaches.


![Semantic Search](./static/semantic-search-flow.png)

---


![Semantic Search Architecture](./static/semantic-search-architecture.png)



**Embeddings** are numerical representations of data, typically used to convert complex, high-dimensional data into a lower-dimensional space where similar data points are closer together. In the context of natural language processing (NLP), embeddings are used to represent words, phrases, or sentences as vectors of real numbers. These vectors capture semantic relationships, meaning that words with similar meanings are represented by vectors that are close together in the embedding space.

**Embedding models** are machine learning models that are trained to create these numerical representations. They learn to encode various types of data into embeddings that capture the essential characteristics and relationships within the data. For example, in NLP, embedding models like Word2Vec, GloVe, and BERT are trained on large text corpora to produce word embeddings. These embeddings can then be used for various downstream tasks, such as text classification, sentiment analysis, or machine translation. In this case we'll be using it for semantic similarity

We use embedding model to convert questions into vector and use vector similiarity to search semantic similiar 10-K data. The following diagram shows the flow: 

<!-- ![Convert Text to Vector](./static/text2vector.png) -->

---

![opensearch vector store](./static/opensearch-vector-store.png)


In [24]:
# Specify the path to the folder containing the JSON files
folder_path = "extracted"

# Initialize an empty list to store list of company 10-K filing file names
company_filing_file_name_list = []

#For this session, we only ingest few company information.
company_list=["Zoom Video Communications, Inc.",
              "MICROSTRATEGY Inc", 
              "PagerDuty, Inc", 
              "Unity Software Inc.", 
              "Autodesk, Inc.",
              "ADOBE INC.",
              "DOCUSIGN, INC.",
              "Okta, Inc.",
              "Datadog, Inc.",
              "INTUIT INC",
              "AUTOMATIC DATA PROCESSING INC",
              "SALESFORCE.COM, INC.", 
              "BOX INC",
              "Asana, Inc", 
              "Palantir Technologies Inc."
             ]

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        df = pd.DataFrame([pd.read_json(file_path,typ='series')])
#         print(df.iloc[0]['company'])
        if df.iloc[0]['company'] in company_list:
            company_filing_file_name_list.append(file_path)

company_filing_file_name_list

['extracted/8670_10K_2021_0000008670-21-000027.json',
 'extracted/769397_10K_2021_0000769397-21-000014.json',
 'extracted/1585521_10K_2021_0001585521-21-000048.json',
 'extracted/896878_10K_2021_0000896878-21-000233.json',
 'extracted/1050446_10K_2020_0001564590-21-005783.json',
 'extracted/1810806_10K_2020_0001810806-21-000052.json',
 'extracted/1660134_10K_2021_0001660134-21-000007.json',
 'extracted/1561550_10K_2020_0001564590-21-009770.json',
 'extracted/1108524_10K_2021_0001108524-22-000008.json',
 'extracted/796343_10K_2020_0000796343-21-000004.json',
 'extracted/1372612_10K_2021_0001564590-21-014377.json',
 'extracted/1321655_10K_2020_0001193125-21-060650.json',
 'extracted/1261333_10K_2021_0001261333-21-000059.json']

In [25]:
#from langchain.embeddings import BedrockEmbeddings
from langchain_aws import BedrockEmbeddings

aws_region = boto3.Session().region_name

boto3_bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=f"https://bedrock-runtime.{aws_region}.amazonaws.com")
bedrock_embeddings = BedrockEmbeddings(model_id='amazon.titan-embed-text-v2:0',client=boto3_bedrock)
#bedrock_embeddings = BedrockEmbeddings(model_id='cohere.embed-multilingual-v3',client=boto3_bedrock)

#### Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

### Create a index in Amazon OpenSearch Service collection

The OpenSearch k-NN plugin introduces a custom data type, the knn_vector, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. 

<!-- ---
#### OpenSearch Approximate Nearest Neighbor Algorithms and Engines
![ANN algorithm](./static/ann-algorithm.png)

--- -->

<!-- #### HNSW parameter tuning
![hnsw parameter tuning](./static/hnsw-parameter-tuning.png) -->

<!-- ---

#### IVF parameter tuning
![ivf parameter tuning](./static/ivf-parameter-tuning.png)

#### How to select the engine and algorithms
![opensearch ann comparison](./static/opensearch-ann-selection.png)  -->

In [26]:
knn_index = {
    "settings": {
        "index.knn": True
    },
    "mappings": {
        "properties": {
            "item_vector": {
                "type": "knn_vector",
                "dimension": 1024
            },
            "item_content": {
                "type": "text",
                "store": True
            },
            "company_name": {
                "type": "text",
                "store": True
            }
        }
    }
}


Using the above index definition, we now need to create the index in Amazon OpenSearch

In [27]:
vector_index_name="10k_financial_semantic"

exist=False
try:
    aos_client.indices.get(index=vector_index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=vector_index_name)
else:
    print("index does not exist.")
    
aos_client.indices.create(index=vector_index_name,body=knn_index,ignore=400)

delete existing index before creating new one


{'acknowledged': True,
 'shards_acknowledged': True,
 'index': '10k_financial_semantic'}

In [34]:
vector_index_name="10k_financial_semantic"
aos_client.indices.get(index=vector_index_name)
# you can verify the mappings

{'10k_financial_semantic': {'aliases': {},
  'mappings': {'properties': {'company_name': {'type': 'text', 'store': True},
    'item_content': {'type': 'text', 'store': True},
    'item_vector': {'type': 'knn_vector', 'dimension': 1024}}},
  'settings': {'index': {'number_of_shards': '2',
    'provided_name': '10k_financial_semantic',
    'knn': 'true',
    'creation_date': '1731562569482',
    'number_of_replicas': '0',
    'uuid': 'No8rKZMBgvaQIXWG_m7U',
    'version': {'created': '136327827'}}}}}

###  Load the raw data into the Index
Next, let's load the financial billing data into the index you've just created.

In [29]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
import pandas

from typing import Any, Dict, List, Optional, Sequence

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class PandasDataFrameLoader(BaseLoader):
    
    def __init__(self,dataframe:pandas.DataFrame):
        self.dataframe=dataframe
        
    def load(self) -> List[Document]:
        docs = []
        items=["item_1","item_1A","item_1B","item_2","item_3","item_4","item_5","item_6","item_7","item_7A","item_8","item_9","item_9A", "item_9B", "item_10", "item_11", "item_12", "item_13", "item_14", "item_15"]
        
        for index, row in self.dataframe.iterrows():
            metadata={}
            # you can use as many metadata possible
            metadata["cik"]=row['cik']
            metadata["company_name"]=row['company']
            metadata["filing_date"]=row['filing_date']
            for item in items:
                content=row[item]
                metadata['item'] = item
                doc = Document(page_content=content,metadata=metadata)
                #print(doc.metadata)
                docs.append(doc)
        return docs

Use Bedrock embedding convert item content into vector and use OpenSearch bulk ingest to store data into OpenSearch index

In [96]:
import time
from opensearchpy import helpers

def ingest_downloaded_10k_into_opensearch(file_name, index_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    item_contents=[]
    company_name=splitted_documents[0].metadata['company_name']
    for doc in splitted_documents:
        item_contents.append(doc.page_content)
    
    print("\ncompany:" + company_name + ", item count:" + str(len(item_contents)))
    start = time.time()
    embedding_results = bedrock_embeddings.embed_documents(item_contents)
    end = time.time()
    elapsed = end - start
    #print(f"total time elapsed for Bedrock embedding: {elapsed:.2f} seconds")
        
    data = []
    i=0
    for content in item_contents:
        data.append({"_index": index_name,  "company_name": company_name, "item_content":content, "item_vector":embedding_results[i]})
        i = i+1
    aos_response= helpers.bulk(aos_client, data)
    print(f"Bulk-inserted {aos_response[0]} items.")

In [31]:
# load the data in to OpenSearch Serverless collection. Note: This would take some time to complete
for file in company_filing_file_name_list:
    ingest_downloaded_10k_into_opensearch(file, vector_index_name)
    print("Ingested :" + file)


company:AUTOMATIC DATA PROCESSING INC, item count:51
Bulk-inserted 51 items.
Ingested :extracted/8670_10K_2021_0000008670-21-000027.json

company:Autodesk, Inc., item count:68
Bulk-inserted 68 items.
Ingested :extracted/769397_10K_2021_0000769397-21-000014.json

company:Zoom Video Communications, Inc., item count:68
Bulk-inserted 68 items.
Ingested :extracted/1585521_10K_2021_0001585521-21-000048.json

company:INTUIT INC, item count:67
Bulk-inserted 67 items.
Ingested :extracted/896878_10K_2021_0000896878-21-000233.json

company:MICROSTRATEGY Inc, item count:65
Bulk-inserted 65 items.
Ingested :extracted/1050446_10K_2020_0001564590-21-005783.json

company:Unity Software Inc., item count:74
Bulk-inserted 74 items.
Ingested :extracted/1810806_10K_2020_0001810806-21-000052.json

company:Okta, Inc., item count:96
Bulk-inserted 96 items.
Ingested :extracted/1660134_10K_2021_0001660134-21-000007.json

company:Datadog, Inc., item count:67
Bulk-inserted 67 items.
Ingested :extracted/1561550_1

To validate the load, you can query the number of documents number in the index. 

In [36]:
res = aos_client.search(index=vector_index_name, body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])

Records found: 917.


In [37]:
print(res['hits']['hits'][1]['_source'])

{'company_name': 'AUTOMATIC DATA PROCESSING INC', 'item_content': 'Our Return to Workplace dashboard powered by ADP DataCloud uses data analytics and employee surveys to allow clients to monitor workforce trends including availability, health attestation results, and worker readiness and sentiment toward returning to the workplace; identify and schedule workers based on availability, location, job title and other attributes; track vaccination status; and facilitate contact tracing, in order to help transition workers back to workplaces with more clarity and confidence.\nWith ADP® Compliance on Demand, clients can easily tap into a knowledge base for compliance - from new leave laws and time tracking requirements, to record-keeping and more.\nThe new ADP Time Kiosk helps employers manage safe levels of occupancy by equipping workers with time & attendance tracking without touching a device. The Time Kiosk uses optional facial recognition to log workers in compliantly and voice activatio

In [38]:
# now you can use the same queries where the full text search weren't return relevant results.
query_text="What are the operating expenses of ADP?"

Run the query and check the search result. 

In [40]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=vector_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

Unnamed: 0,_id,_score,company_name,item_content
0,1%3A0%3A8SwsKZMBZ23FzQVenY2y,0.523385,AUTOMATIC DATA PROCESSING INC,"Item 7. Management's Discussion and Analysis of Financial Condition and Results of Operations\nTabular dollars are presented in millions, except per share amounts\nThe following section discusses our year ended June 30, 2021 (“fiscal 2021”), as compared to year ended June 30, 2020 (“fiscal 2020”). A detailed review of our fiscal 2020 performance compared to our fiscal 2019 performance is set forth in Part II, Item 7 of our Form 10-K for the fiscal year ended June 30, 2020.\nFORWARD-LOOKING S..."
1,1%3A0%3A7CwsKZMBZ23FzQVenY2y,0.512342,AUTOMATIC DATA PROCESSING INC,"Item 2. Properties\nADP owns 7 of its processing/print centers, and 12 other operational offices, sales offices, and its corporate headquarters in Roseland, New Jersey, which aggregate approximately 3,070,644 square feet. None of ADP's owned facilities is subject to any material encumbrances. ADP leases space for some of its processing centers, other operational offices, and sales offices. All of these leases, which aggregate approximately 5,366,245 square feet worldwide, expire at various t..."
2,1%3A0%3A4iwsKZMBZ23FzQVenY2y,0.505817,AUTOMATIC DATA PROCESSING INC,"Our Return to Workplace dashboard powered by ADP DataCloud uses data analytics and employee surveys to allow clients to monitor workforce trends including availability, health attestation results, and worker readiness and sentiment toward returning to the workplace; identify and schedule workers based on availability, location, job title and other attributes; track vaccination status; and facilitate contact tracing, in order to help transition workers back to workplaces with more clarity and..."
3,1%3A0%3A4ywsKZMBZ23FzQVenY2y,0.505309,AUTOMATIC DATA PROCESSING INC,"insights to drive employee engagement and leadership development, which in turn help drive employee performance.\nWorkforce Management. ADP’s Workforce Management offers a range of solutions to over 100,000 employers of all sizes, including time and attendance, absence management and scheduling tools. Time and attendance solutions include time capture via online timesheets, timeclocks with badge readers, biometrics and touch-screens, telephone/interactive voice response, and mobile smartphon..."
4,1%3A0%3ADSwsKZMBZ23FzQVenY6y,0.493537,AUTOMATIC DATA PROCESSING INC,"Item 10. Directors, Executive Officers and Corporate Governance\nThe executive officers of the Company, their ages, positions, and the period during which they have been employed by ADP are as follows:\nEmployed by\nName Age Position ADP Since\nBrock Albinson 46 Corporate Controller and Principal Accounting Officer 2007\nJohn Ayala 54 President, Employer Services North America 2002\nMaria Black 47 President, Worldwide Sales and Marketing 1996\nMichael A. Bonarti 55 Chief Administrative Offic..."
5,1%3A0%3A4CwsKZMBZ23FzQVenY2y,0.490687,AUTOMATIC DATA PROCESSING INC,"Item 1. Business\nCORPORATE BACKGROUND\nGeneral\nIn 1949, our founders established ADP to shape the world of work with a simple, innovative idea: help clients focus on their business by freeing them up from certain non-core tasks such as payroll. Today, we are one of the world’s leading providers of cloud-based human capital management (HCM) solutions to employers, offering solutions to businesses of all sizes, whether they have simple or complex needs. We serve over 920,000 clients and pay ..."
6,1%3A0%3ACiwsKZMBZ23FzQVenY6y,0.475884,AUTOMATIC DATA PROCESSING INC,"Item 9A. Controls and Procedures\nAttached as Exhibits 31.1 and 31.2 to this Annual Report on Form 10-K are certifications of ADP's Chief Executive Officer and Chief Financial Officer, which are required by Rule 13a-14(a) of the Securities Exchange Act of 1934, as amended (the “Exchange Act”). This “Controls and Procedures” section should be read in conjunction with the report of Deloitte & Touche LLP that appears in this Annual Report on Form 10-K and is hereby incorporated herein by refere..."
7,1%3A0%3Av9YsKZMB-aPiNMN8wIQF,0.475778,"Autodesk, Inc.","Marketing and sales expenses include salaries, bonuses, benefits, and stock-based compensation expense for our marketing and sales employees, the expense of travel, entertainment, and training for such personnel, sales and dealer commissions, and the costs of programs aimed at increasing revenue, such as advertising, trade shows and expositions, and various sales and promotional programs. Marketing and sales expenses also include SaaS vendor costs and allocated IT costs, payment processing f..."
8,1%3A0%3A8iwsKZMBZ23FzQVenY2y,0.469598,AUTOMATIC DATA PROCESSING INC,"We have a strong business model, a highly cash generative business with low capital intensity, and offer a suite of products that provide critical support to our clients’ HCM functions. We generate sufficient free cash flow to satisfy our cash dividend and our modest debt obligations, which enables us to absorb the impact of downturns and remain steadfast in our reinvestments, our longer term strategy, and our commitments to shareholder friendly actions. We are committed to building upon our..."
9,1%3A0%3A5SwsKZMBZ23FzQVenY2y,0.469186,AUTOMATIC DATA PROCESSING INC,"As one of the world’s largest providers of HCM solutions, our systems contain a significant amount of sensitive data related to clients, employees of our clients, vendors and our employees. We are, therefore, subject to compliance obligations under federal, state and foreign privacy, data protection and cybersecurity-related laws, including federal, state and foreign security breach notification laws with respect to both client employee data and our own employee data. The changing nature of ..."


In [41]:
query_text="What is ADP's main revenue?"

Run the query and check the search result. 

In [42]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=vector_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

Unnamed: 0,_id,_score,company_name,item_content
0,1%3A0%3A8SwsKZMBZ23FzQVenY2y,0.566075,AUTOMATIC DATA PROCESSING INC,"Item 7. Management's Discussion and Analysis of Financial Condition and Results of Operations\nTabular dollars are presented in millions, except per share amounts\nThe following section discusses our year ended June 30, 2021 (“fiscal 2021”), as compared to year ended June 30, 2020 (“fiscal 2020”). A detailed review of our fiscal 2020 performance compared to our fiscal 2019 performance is set forth in Part II, Item 7 of our Form 10-K for the fiscal year ended June 30, 2020.\nFORWARD-LOOKING S..."
1,1%3A0%3A4CwsKZMBZ23FzQVenY2y,0.554931,AUTOMATIC DATA PROCESSING INC,"Item 1. Business\nCORPORATE BACKGROUND\nGeneral\nIn 1949, our founders established ADP to shape the world of work with a simple, innovative idea: help clients focus on their business by freeing them up from certain non-core tasks such as payroll. Today, we are one of the world’s leading providers of cloud-based human capital management (HCM) solutions to employers, offering solutions to businesses of all sizes, whether they have simple or complex needs. We serve over 920,000 clients and pay ..."
2,1%3A0%3A4iwsKZMBZ23FzQVenY2y,0.544394,AUTOMATIC DATA PROCESSING INC,"Our Return to Workplace dashboard powered by ADP DataCloud uses data analytics and employee surveys to allow clients to monitor workforce trends including availability, health attestation results, and worker readiness and sentiment toward returning to the workplace; identify and schedule workers based on availability, location, job title and other attributes; track vaccination status; and facilitate contact tracing, in order to help transition workers back to workplaces with more clarity and..."
3,1%3A0%3A4ywsKZMBZ23FzQVenY2y,0.541147,AUTOMATIC DATA PROCESSING INC,"insights to drive employee engagement and leadership development, which in turn help drive employee performance.\nWorkforce Management. ADP’s Workforce Management offers a range of solutions to over 100,000 employers of all sizes, including time and attendance, absence management and scheduling tools. Time and attendance solutions include time capture via online timesheets, timeclocks with badge readers, biometrics and touch-screens, telephone/interactive voice response, and mobile smartphon..."
4,1%3A0%3ADSwsKZMBZ23FzQVenY6y,0.540261,AUTOMATIC DATA PROCESSING INC,"Item 10. Directors, Executive Officers and Corporate Governance\nThe executive officers of the Company, their ages, positions, and the period during which they have been employed by ADP are as follows:\nEmployed by\nName Age Position ADP Since\nBrock Albinson 46 Corporate Controller and Principal Accounting Officer 2007\nJohn Ayala 54 President, Employer Services North America 2002\nMaria Black 47 President, Worldwide Sales and Marketing 1996\nMichael A. Bonarti 55 Chief Administrative Offic..."
5,1%3A0%3A7CwsKZMBZ23FzQVenY2y,0.538768,AUTOMATIC DATA PROCESSING INC,"Item 2. Properties\nADP owns 7 of its processing/print centers, and 12 other operational offices, sales offices, and its corporate headquarters in Roseland, New Jersey, which aggregate approximately 3,070,644 square feet. None of ADP's owned facilities is subject to any material encumbrances. ADP leases space for some of its processing centers, other operational offices, and sales offices. All of these leases, which aggregate approximately 5,366,245 square feet worldwide, expire at various t..."
6,1%3A0%3ADiwsKZMBZ23FzQVenY6y,0.510612,AUTOMATIC DATA PROCESSING INC,"Stuart Sackman joined ADP in 1992. Prior to his appointment as Corporate Vice President, Global Shared Services in July 2018, he served as Corporate Vice President, Global Product and Technology from March 2015 to June 2018, as Corporate Vice President and General Manager of Multinational Corporations Services from June 2012 to February 2015, and as Division Vice President and General Manager of the National Account Services’ East National Service Center from February 2008 to May 2012.\nDona..."
7,1%3A0%3A4SwsKZMBZ23FzQVenY2y,0.505891,AUTOMATIC DATA PROCESSING INC,"clients to confidently handle compliance matters like tax filing and deposits.\nToday, big data provides a real competitive advantage. That is why we have accelerated the deployment of machine learning (ML) against our unmatched HCM dataset - the same HCM dataset that drives our renowned ADP National Employment Report®. We are leading this innovation effort with ADP® DataCloud, our award-winning ML and workforce analytics platform which is by far one of the largest repositories of payroll in..."
8,1%3A0%3A5SwsKZMBZ23FzQVenY2y,0.497488,AUTOMATIC DATA PROCESSING INC,"As one of the world’s largest providers of HCM solutions, our systems contain a significant amount of sensitive data related to clients, employees of our clients, vendors and our employees. We are, therefore, subject to compliance obligations under federal, state and foreign privacy, data protection and cybersecurity-related laws, including federal, state and foreign security breach notification laws with respect to both client employee data and our own employee data. The changing nature of ..."
9,1%3A0%3A5CwsKZMBZ23FzQVenY2y,0.484117,AUTOMATIC DATA PROCESSING INC,"• Protection and Compliance: ADP TotalSource HR experts help clients manage the risks of being an employer by advising how to handle properly a range of issues - from HR and safety compliance to employee-relations. This includes access to workers' compensation coverage and expertise designed to help them handle both routine and unexpected incidents, including discrimination and harassment claims.\n• Talent Engagement: Featuring a talent blueprint, ADP TotalSource HR experts work with clients..."


### 2.3 Retrieval Augmented Generation(RAG)

You can leverage the Large Language Models to generate answers rather than provinding the document back to the user. By provinding these retrieved documents as context to generate answers, we minimizes the halucination. This method is called Retrieval Augmented Generation or simply RAG. In RAG, external data can be sourced from various data sources, such as document repositories, databases, or APIs. The first step is to convert the documents and the user query into a format that enables comparison and allows for performing relevancy search. To achieve comparability for relevancy search, a document collection (knowledge library) and the user-submitted query are transformed into numerical representations using embedding language models. These embeddings are essentially numerical representations of concepts in text.

Next, based on the embedding of the user query, relevant text is identified in the document collection through similarity search in the embedding space. The prompt provided by the user is then combined with the searched relevant text and added to the context. This updated prompt, which includes relevant external data along with the original prompt, is sent to the LLM (Language Model) for processing. As a result, the model output becomes relevant and accurate due to the context containing the relevant external data.

The major components of RAG, including embedding, vector databases, augmentation, and generation:


- **Embedding**: Purpose: Embeddings transform text data into numerical vectors in a high-dimensional space. These vectors represent the semantic meaning of the text. Process: The embedding process typically uses pre-trained models (like BERT or a variant) to convert both the input queries and the documents in the database into dense vectors. Role in RAG: Embeddings are crucial for the retrieval component as they allow the model to compute the similarity between the query and the documents in the database efficiently.
- **Vector Database**: Function: A vector database stores the embeddings of a large collection of documents or passages. Construction: It is created by processing a vast corpus (like Wikipedia or other specialized datasets) through an embedding model. Usage in RAG: When a query comes in, the model searches this database to find the documents whose embeddings are most similar to the embedding of the query.
- **Retrieval (Augmentation)**: Mechanism: The retrieval part of RAG functions by taking the input query, converting it into a vector using embeddings, and then searching the vector database to retrieve relevant documents. Result: It augments the original query with additional context by selecting documents or passages that are semantically related to the query. This augmented information is essential for generating more informed responses.
- **Generation**: Integration with a Language Model: The generative component, often a large language model like Amazon Titan Text, receives both the original query and the retrieved documents. Response Generation: It synthesizes information from these inputs to produce a coherent and contextually appropriate response. 
- **Training and Fine-Tuning**: This component is generally pre-trained on vast amounts of text and may be further fine-tuned to optimize its performance for specific tasks or datasets.
- **End-to-End Training (Optional)**: Joint Optimization: In RAG, both retrieval and generation components can be fine-tuned together, allowing the system to optimize the selection of documents and the generation of responses simultaneously. Feedback Loop: The model learns not only to generate relevant responses but also to retrieve the most useful documents for any given query.

---
### Architecture

![RAG](./static/RAG_Architecture.png)

---

In [43]:
langchain_index_name="10k_financial_embedding"

In [44]:
exist=False
try:
    aos_client.indices.get(index=langchain_index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=langchain_index_name)
else:
    print("index does not exist.")
    

delete existing index before creating new one


In [45]:
from langchain.vectorstores import OpenSearchVectorSearch
from typing import Callable
from requests_aws4auth import AWS4Auth

os_domain_ep = 'https://'+aoss_host

credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, aws_region, service, session_token=credentials.token)


def ingest_10k_into_opensearch_with_langchain(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    OpenSearchVectorSearch.from_documents(
                index_name = langchain_index_name,
                documents=splitted_documents,
                embedding=bedrock_embeddings,
                opensearch_url=os_domain_ep,
                http_auth=awsauth,
                timeout=600,
                use_ssl=True,
                verify_certs=True,
                connection_class=RequestsHttpConnection,
    )

In [46]:
## In this step, you're indexing the selected companies 10K files in to Amazon OpenSearch Serverless with LangChain
## This will take a while to load all the chunks

for file in company_filing_file_name_list:
    ingest_10k_into_opensearch_with_langchain(file)
    print("Ingested :" + file)


Ingested :extracted/8670_10K_2021_0000008670-21-000027.json
Ingested :extracted/769397_10K_2021_0000769397-21-000014.json
Ingested :extracted/1585521_10K_2021_0001585521-21-000048.json
Ingested :extracted/896878_10K_2021_0000896878-21-000233.json
Ingested :extracted/1050446_10K_2020_0001564590-21-005783.json
Ingested :extracted/1810806_10K_2020_0001810806-21-000052.json
Ingested :extracted/1660134_10K_2021_0001660134-21-000007.json
Ingested :extracted/1561550_10K_2020_0001564590-21-009770.json
Ingested :extracted/1108524_10K_2021_0001108524-22-000008.json
Ingested :extracted/796343_10K_2020_0000796343-21-000004.json
Ingested :extracted/1372612_10K_2021_0001564590-21-014377.json
Ingested :extracted/1321655_10K_2020_0001193125-21-060650.json
Ingested :extracted/1261333_10K_2021_0001261333-21-000059.json


In [47]:
aos_client.indices.get(index=langchain_index_name)

{'10k_financial_embedding': {'aliases': {},
  'mappings': {'properties': {'id': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'metadata': {'properties': {'cik': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'company_name': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'filing_date': {'type': 'date'},
      'item': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}},
    'text': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'vector_field': {'type': 'knn_vector',
     'dimension': 1024,
     'method': {'engine': 'nmslib',
      'space_type': 'l2',
      'name': 'hnsw',
      'parameters': {'ef_construction': 512, 'm': 16}}}}},
  'settings': {'index': {'number_of_shards': '2',
    'knn.algo_param': {'ef_search': '512'},
    'provided_name': '10k_financial_embedd

In [48]:

class SimiliarOpenSearchVectorSearch(OpenSearchVectorSearch):
    
    def relevance_score(self, distance: float) -> float:
        return distance
    
    def _select_relevance_score_fn(self) -> Callable[[float], float]:
        return self.relevance_score


open_search_vector_store = SimiliarOpenSearchVectorSearch(
                                    index_name=langchain_index_name,
                                    embedding_function=bedrock_embeddings,
                                    opensearch_url=os_domain_ep,
                                    http_auth=awsauth,
                                    timeout=600,
                                    use_ssl=True,
                                    verify_certs=True,
                                    connection_class=RequestsHttpConnection,
                                    ) 

Initialize Bedrock LLM model with Claude

In [49]:
from langchain_aws import BedrockLLM, ChatBedrock

#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=boto3_bedrock)
bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=boto3_bedrock)
#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-opus-20240229-v1:0", client=boto3_bedrock)

bedrock_llm.model_kwargs = {"temperature":0.001,"top_k":300,"top_p":1}


#### Note: This session's prompt is desinged for Claude 3. Output result may be different if use other LLMs, for example guardrails impact.

In [50]:
from langchain.chains import RetrievalQA
import langchain

bedrock_retriever = open_search_vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        'k': 5,
        'score_threshold': 0.005
    }
)
rag_qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm,
    retriever=bedrock_retriever,
    chain_type="stuff" #stuff, refine, map_reduce, and map_rerank
)

In [51]:
question="What is Adobe's main revenue??"

langchain.debug=True
result = rag_qa({"query": question})


  result = rag_qa({"query": question})


[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What is Adobe's main revenue??"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What is Adobe's main revenue??",
  "context": "ITEM 1. BUSINESS\nFounded in 1982, Adobe Inc. is one of the largest and most diversified software companies in the world. We offer a line of products and services used by creative professionals including photographers, video editors, designers and developers; communicators including content creators, students, marketers and knowledge workers; businesses of all sizes; and consumers for creating, managing, delivering, measuring, optimizing, engaging and transacting with compelling content and experiences across personal computers, d

In [52]:
print("Result:" + result["result"])

Result:Based on the information provided, Adobe's main revenue comes from its Digital Media segment, which includes subscription revenue from its Creative Cloud and Document Cloud offerings.

Some key points about Adobe's revenue:

- In fiscal 2020, Digital Media revenue was $9.23 billion, which was 72% of Adobe's total revenue of $12.87 billion.

- Subscription revenue was $11.63 billion in fiscal 2020, which was 90% of total revenue. The majority of this subscription revenue comes from the Digital Media segment's Creative Cloud and Document Cloud offerings.

- Within Digital Media, Creative revenue (including Creative Cloud) was $7.74 billion in fiscal 2020, while Document Cloud revenue was $1.50 billion.

- Adobe's other segments are Digital Experience (24% of total 2020 revenue) and Publishing and Advertising (4% of total revenue).

So in summary, the largest portion of Adobe's revenue comes from subscription sales of its Creative Cloud suite of software for creative professionals 

## Part 3: AI agent powered search




### Standard RAG limitation

#### Questions that standard RAG can not answer:
* Comparision questions: `Compare Adobe and Asana company financial statements`
    * First, need get the 2 companies financial statements with semantic search
    * Second, compare financial statements search result
* Can not use information outside of knowledge base: `Is Snowflake a good investment choice right now?`
    * Extra information available in relational DB or datawarehouse
* Out-of-date knowledge base: `Is Amazon a good investment choice right now?`
    * There is no financial statements in the knowledgebase. Need to download 10K filing forms from internet
    * Ingest the data into knowledge base
    * Search the knowledge base and return result

#### Comparision question to standard RAG

In [None]:
question="Compare Adobe and Asana company financial statements"

In [None]:
langchain.debug=True
result = rag_qa({"query": question})

In [None]:
print("Result:" + result["result"])

#### The same comparision question to AI Agent

In [None]:
session_id = str(uuid.uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": question})

In [None]:
print(response["output"])

![standard rag limitation](./static/rag-limitation.png)


![advanced rag ](./static/advanced-rag.png)

### What is an AI agent ?
An agentic employs a chain-of-thought reasoning process, where the LLM is prompted to think gradually through a question, interleaving its reasoning with the ability to use external tools such as search engines and APIs. This allows the LLM to retrieve relevant information that can help answer partial aspects of the question, ultimately leading to a more comprehensive and accurate final response. This approach is inspired by the "Reason and Act" (ReAct) design introduced in the paper [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/pdf/2210.03629)  which aims to synergize the reasoning capabilities of language models with the ability to interact with external resources and take actions. By combining these two facets, an agentic LLM assistant can provide more informed and well-rounded answers to complex user queries.

![agent components](./static/agent-components.png)

### Why build an AI agent?
In today's digital landscape, enterprises are inundated with a vast array of data sources, ranging from traditional PDF documents to complex SQL and NoSQL databases, and everything in between. While this wealth of information holds immense potential for gaining valuable insights and driving operational efficiency, the sheer volume and diversity of data can often pose significant challenges in terms of accessibility and utilization.

This is where the power of agentic LLM assistants comes into play. By leveraging progress in LLM design patterns such as Reason and Act (ReAct) and other traditional or novel design patterns, these intelligent assistants are capable of integrating with an enterprise's diverse data sources. Through the development of specialized tools tailored to each data source and the ability of LLM agents to identify the right tool for a given question, agentic LLM assistants can simplify how you navigate and extract relevant information, regardless of its origin or structure.

This enables a rich, multi-source conversation that promises to unlock the full potential of the entreprise data, enabling data-driven decision-making, enhancing operational efficiency, and ultimately driving productivity and growth.

![agent powered search advantage](./static/agent-powered-search-advantage.png)


### AI Agent powered search reference architecture

![AI Agent powered search reference architecture](./static/reference-architecture.png)


### Lab Architecture

For demo purpose, we use SageMaker notebook run the code, the following is the overall architecture of this lab:

![AI Agent powered search architecture](./static/architecture.png)

---


### Data Flow

The user submit query, first AI agent will judge if the query is financial statements related. If so, AI agent will use vector search to get similiar financial statements for this company from OpenSearch. If there is no financial statements for this company, AI agent will download the data from internet by calling SEC API, ingest the data into OpenSearch and do the search again. If there is related financial statements, AI agent will see if the query is stock price related question. If so, AI agent will query Redshift database to get company's stock price data. LLM will generate the response with all collected data. Overall data flow is like following:

![AI Agent powered search data flow](./static/ai-agent-search-data-flow.png)

### 3.1 Prepare other tools used by AI agent

#### 3.1.1 Ingest and query structured data in Redshift

Get Redshift Serverless username, password and endpoint

In [53]:
kms = boto3.client('secretsmanager')

redshift_serverless_credentials = json.loads(kms.get_secret_value(SecretId=outputs['RedshiftServerlessSecret'])['SecretString'])
redshift_serverless_username=redshift_serverless_credentials['username']
redshift_serverless_password=redshift_serverless_credentials['password']
redshift_serverless_endpoint =  outputs['RedshiftServerlessEndpoint']

Create `stock_symbol` table and populate the table from S3. We will use this table to query company stock ticker by company name.

In [86]:
import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host=redshift_serverless_endpoint, 
port=5439,
database='dev',
username=redshift_serverless_username,
password=redshift_serverless_password
)

%sql $connect_to_db
%sql select current_user, version();

%sql CREATE TABLE IF NOT EXISTS public.stock_symbol (stock_symbol text PRIMARY KEY, company_name text NOT NULL);

stock_price_bucket = outputs["s3BucketStock"]
s3_location = f's3://{stock_price_bucket}/stock-price/'
print(s3_location)
!aws s3 sync ./stock-price/ $s3_location

stock_symbol_s3_location = f's3://{stock_price_bucket}/stock-price/stock_symbol.csv'

quoted_stock_symbol_s3_location = "'" + stock_symbol_s3_location + "'"

%sql COPY STOCK_SYMBOL FROM $quoted_stock_symbol_s3_location iam_role default IGNOREHEADER 1 CSV;


url = URL.create(
    drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
    host=redshift_serverless_endpoint, 
    port=5439,
    database='dev',
    username=redshift_serverless_username,
    password=redshift_serverless_password
)

engine = sa.create_engine(url)
redshift_connection = engine.connect()
    
def query_stock_ticker(company_name):
    strSQL = "SELECT stock_symbol FROM stock_symbol WHERE lower(company_name) ILIKE '%" + company_name + "%'"
    stock_ticker=''
    try:
        result = redshift_connection.execute(strSQL)
        df = pd.DataFrame(result)
        stock_ticker=df['stock_symbol'][0]
    except Exception as e:
        print(e)
    return stock_ticker


 * redshift+redshift_connector://awsuser:***@workgroup-8210c140.649735563824.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-8210c140.649735563824.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
s3://generative-ai-powered-search-s3bucketstock-i34dvcmdobut/stock-price/
 * redshift+redshift_connector://awsuser:***@workgroup-8210c140.649735563824.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.


In [72]:
query_stock_ticker("adobe")

'ADBE'

In [131]:
%sql CREATE TABLE IF NOT EXISTS public.stock_price (stock_date DATE, stock_symbol text, open_price DECIMAL, high_price DECIMAL, low_price DECIMAL, close_price DECIMAL, adjusted_close_price DECIMAL, volume DECIMAL);

asan_s3_location = f's3://{stock_price_bucket}/stock-price/ASAN.csv'
quoted_asan_s3_location = "'" + asan_s3_location + "'"
print(quoted_asan_s3_location)
print("---------")

crm_s3_location = f's3://{stock_price_bucket}/stock-price/CRM.csv'
quoted_crm_s3_location = "'" + crm_s3_location + "'"
print(quoted_crm_s3_location)
print("---------")

adp_s3_location = f's3://{stock_price_bucket}/stock-price/ADP.csv'
quoted_adp_s3_location = "'" + adp_s3_location + "'"
print(quoted_adp_s3_location)
print("---------")

adsk_s3_location = f's3://{stock_price_bucket}/stock-price/ADSK.csv'
quoted_adsk_s3_location = "'" + adsk_s3_location + "'"
print(quoted_adsk_s3_location)
print("---------")

box_s3_location = f's3://{stock_price_bucket}/stock-price/BOX.csv'
quoted_box_s3_location = "'" + box_s3_location + "'"
print(quoted_box_s3_location)
print("---------")

adbe_s3_location = f's3://{stock_price_bucket}/stock-price/ADBE.csv'
quoted_adbe_s3_location = "'" + adbe_s3_location + "'"
print(quoted_adbe_s3_location)
print("---------")

%sql COPY STOCK_PRICE FROM $quoted_asan_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_crm_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_adp_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_adsk_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_box_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_adbe_s3_location iam_role default IGNOREHEADER 1 CSV;

 * redshift+redshift_connector://awsuser:***@workgroup-8210c140.649735563824.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
's3://generative-ai-powered-search-s3bucketstock-i34dvcmdobut/stock-price/ASAN.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-i34dvcmdobut/stock-price/CRM.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-i34dvcmdobut/stock-price/ADP.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-i34dvcmdobut/stock-price/ADSK.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-i34dvcmdobut/stock-price/BOX.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-i34dvcmdobut/stock-price/ADBE.csv'
---------
 * redshift+redshift_connector://awsuser:***@workgroup-8210c140.649735563824.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-8210c140.649735563824.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector:

[]

In [90]:
%sql select * from public.stock_price 

 * redshift+redshift_connector://awsuser:***@workgroup-8210c140.649735563824.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.


stock_date,stock_symbol,open_price,high_price,low_price,close_price,adjusted_close_price,volume
1992-11-10,ADP,10,10,9,10,5,805988
1992-11-11,ADP,10,10,9,10,5,1426366
1992-11-12,ADP,9,9,9,9,5,662746
1992-11-13,ADP,9,9,9,9,5,922498
1992-11-16,ADP,9,9,9,9,5,1016816
1992-11-17,ADP,9,10,9,9,5,752020
1992-11-18,ADP,10,10,9,10,5,961839
1992-11-19,ADP,10,10,9,10,5,1202425
1992-11-20,ADP,10,10,9,10,5,1373407
1992-11-23,ADP,10,10,10,10,5,805483


In [91]:
def query_stock_price(stock_ticker):
    strSQL = "SELECT stock_date, stock_symbol, open_price, high_price, low_price, close_price FROM public.stock_price WHERE stock_symbol ='" + stock_ticker + "' limit 100"
    print(strSQL)
    try:
        result = redshift_connection.execute(strSQL)
        stock_price = pd.DataFrame(result)
    except Exception as e:
        print(e)
    return stock_price

In [137]:
query_stock_price('ADBE')

SELECT stock_date, stock_symbol, open_price, high_price, low_price, close_price FROM public.stock_price WHERE stock_symbol ='ADBE' limit 100


Unnamed: 0,stock_date,stock_symbol,open_price,high_price,low_price,close_price
0,2000-06-19,ADBE,29,32,29,31
1,2000-06-20,ADBE,31,32,31,32
2,2000-06-21,ADBE,32,32,31,32
3,2000-06-22,ADBE,32,32,30,30
4,2000-06-23,ADBE,30,30,29,29
5,2000-06-26,ADBE,29,31,29,31
6,2000-06-27,ADBE,30,30,29,29
7,2000-06-28,ADBE,30,31,29,31
8,2000-06-29,ADBE,30,32,30,31
9,2000-06-30,ADBE,31,32,31,32


#### 3.1.2 Download 10-K filing from SEC
---
https://sec-api.io

Create a new account and get free API key.



In [93]:
!pip install sec-api



##### Replace your sec-api key in the following line

In [94]:
sec_api_key="{security_api_key}"

In [97]:
from sec_api import ExtractorApi, QueryApi
import json
import os

def get_filings(ticker):
    global sec_api_key

    # Finding Recent Filings with QueryAPI
    queryApi = QueryApi(api_key=sec_api_key)
    query = {
      "query": f"ticker:{ticker} AND formType:\"10-K\"",
      "from": "0",
      "size": "1",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }
    response = queryApi.get_filings(query)

    # Getting 10-K URL
    filing_url = response["filings"][0]["linkToFilingDetails"]
    filing_type=response['filings'][0]['formType']
    cik=response['filings'][0]['cik']
    company=response['filings'][0]['companyName']
    filing_date=response['filings'][0]['filedAt']
    period_of_report=response['filings'][0]['periodOfReport']
    filing_html_index=response['filings'][0]['linkToFilingDetails']
    complete_text_filing_link=response['filings'][0]['linkToTxt']

    # Extracting Text with ExtractorAPI
    extractorApi = ExtractorApi(api_key=sec_api_key)
    
    one_text = extractorApi.get_section(filing_url, "1", "text")       #Section 1 - Business
    onea_text = extractorApi.get_section(filing_url, "1A", "text")     # Section 1A - Risk Factors
    oneb_text = extractorApi.get_section(filing_url, "1B", "text")     # Section 1B - Unresolved Staff Comments
    two_text = extractorApi.get_section(filing_url, "2", "text")       # Section 2 - Properties
    three_text = extractorApi.get_section(filing_url, "3", "text")     # Section 3 - Legal Proceedings
    four_text = extractorApi.get_section(filing_url, "4", "text")      # Section 4 - Mine Safety Disclosures
    five_text = extractorApi.get_section(filing_url, "5", "text")      # Section 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
    six_text = extractorApi.get_section(filing_url, "6", "text")       # Section 6 - Selected Financial Data (prior to February 2021)
    seven_text = extractorApi.get_section(filing_url, "7", "text")     # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
    sevena_text = extractorApi.get_section(filing_url, "7A", "text")   # Section 7A - Quantitative and Qualitative Disclosures about Market Risk
    eight_text = extractorApi.get_section(filing_url, "8", "text")     # Section 8 - Financial Statements and Supplementary Data
    nine_text = extractorApi.get_section(filing_url, "9", "text")      # Section 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
    ninea_text = extractorApi.get_section(filing_url, "9A", "text")    # Section 9A - Controls and Procedures
    nineb_text = extractorApi.get_section(filing_url, "9B", "text")    # Section 9B - Other Information
    ten_text = extractorApi.get_section(filing_url, "10", "text")      # Section 10 - Directors, Executive Officers and Corporate Governance
    eleven_text = extractorApi.get_section(filing_url, "11", "text")   # Section 11 - Executive Compensation
    twelve_text = extractorApi.get_section(filing_url, "12", "text")   # Section 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
    thirteen_text = extractorApi.get_section(filing_url, "13", "text") # Section 13 - Certain Relationships and Related Transactions, and Director Independence
    fourteen_text = extractorApi.get_section(filing_url, "14", "text") # Section 14 - Principal Accountant Fees and Services
    fifteen_text = extractorApi.get_section(filing_url, "15", "text")  # Section 15 - Exhibits and Financial Statement Schedules
    
    data = {}
    data['filing_url'] = filing_url
    data['filing_type'] = filing_type
    data['cik'] = cik
    data['company'] = company
    data['filing_date'] = filing_date
    data['period_of_report'] = period_of_report
    data['filing_html_index'] = filing_html_index
    data['complete_text_filing_link'] = complete_text_filing_link
    
    
    data['item_1'] = one_text
    data['item_1A'] = onea_text
    data['item_1B'] = oneb_text
    data['item_2'] = two_text
    data['item_3'] = three_text
    data['item_4'] = four_text
    data['item_5'] = five_text
    data['item_6'] = six_text
    data['item_7'] = seven_text
    data['item_7A'] = sevena_text
    data['item_8'] = eight_text
    data['item_9'] = nine_text
    data['item_9A'] = ninea_text
    data['item_9B'] = nineb_text
    data['item_10'] = ten_text
    data['item_11'] = eleven_text
    data['item_12'] = twelve_text
    data['item_13'] = thirteen_text
    data['item_14'] = fourteen_text
    data['item_15'] = fifteen_text
    
    json_data = json.dumps(data)
    
    
    if not os.path.exists("./download_filings"):
        os.makedirs("./download_filings")
    
    try:
        file_name = filing_url.split("/")[-2] + "-" + filing_url.split("/")[-1].split(".")[0]+".json"
        download_to = "./download_filings/" + file_name
        with open(download_to, "w") as f:
          json.dump(data, f, ensure_ascii=False, indent=4)
    except Exception as e:
        print("Problem with {url}".format(url=url))
        print(e)
    
    return file_name

In [99]:
#downloaded_file=get_filings("AMZN")
downloaded_file="000101872424000008-amzn-20231231.json"
ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file, langchain_index_name)


company:AMAZON COM INC, item count:52
Bulk-inserted 52 items.


### 3.2 Create AI agent

#### Define methods used by AI agent

One popular architecture for building agents is ReAct. ReAct combines reasoning and acting in an iterative process - in fact the name "ReAct" stands for "Reason" and "Act".

The general flow looks like this:

- The model will "think" about what step to take in response to an input and any previous observations.
- The model will then choose an action from available tools (or choose to respond to the user).
- The model will generate arguments to that tool.
- The agent runtime (executor) will parse out the chosen tool and call it with the generated arguments.
- The executor will return the results of the tool call back to the model as an observation.
- This process repeats until the agent chooses to respond.

In [112]:
from langchain.prompts.chat import ChatPromptTemplate
from langchain.chains import LLMChain
import time

def is_financial_statement_related_query(human_input):
    #template = """You are a helpful assistant to judge if the human input is stock related question.
    #If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    template = """You are a helpful assistant to judge if the human input is trying to analyze company financial statement.
    If the human input is financial statement related question, answer \"yes\". Otherwise answer \"no\".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def is_stock_related_query(human_input):
    #template = """You are a helpful assistant to judge if the human input is stock related question.
    #If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    template = """
    You are a helpful assistant to judge if the human input is stock related question. 
    If the human innput is stock related question, return "yes".Otherwise return "no".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def get_company_name(human_input):
    template = """You are a helpful assistant who extract company name from the human input.Please only output the company"""
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )

    company_name=llm_chain({"text":human_input})['text'].strip()
    return company_name
    
def semantic_search_and_check(human_input, k=10,with_post_filter=True):

    company_name=get_company_name(human_input)
    
    search_vector = bedrock_embeddings.embed_query(human_input)

    no_post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        }
    }

    post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        },
        "post_filter": {
           "match": { 
               "company_name":company_name
           }
        }
    }
    
    search_query=no_post_filter_search_query
    if with_post_filter:
        search_query=post_filter_search_query
    
    res = aos_client.search(index=vector_index_name, 
                       body=search_query,
                       stored_fields=["company_name","item_category","item_content"])
    
    query_result=[]
    for hit in res['hits']['hits']:
        hit_company=hit['fields']['company_name'][0]
        print("\nsemantic search hit company: " + hit_company)
        row=[hit['fields']['company_name'][0], hit['fields']['item_content'][0]]
        query_result.append(row)

    query_result_df = pd.DataFrame(data=query_result,columns=["company_name","company_financial_statements"])
    return query_result_df

def search_for_similiar_content_in_10k_filing(human_input):
    company_statements = semantic_search_and_check(human_input)
    return company_statements

def search_financial_statements_for_company(company_financial_statements_query):
    company_statements = semantic_search_and_check(company_financial_statements_query)
    return company_statements

def get_stock_ticker(human_input):
    company_name=get_company_name(human_input)
    company_ticker = query_stock_ticker(company_name)
    return company_ticker

def get_stock_price(stock_ticker):
    stock_price = query_stock_price(stock_ticker)
    return stock_price

def download_10k_filing_from_sec_and_ingest_into_opensearch(stock_ticker):
    result = "download failed."
    try:
        #downloaded_file=get_filings(stock_ticker)
        downloaded_file="000101872424000008-amzn-20231231.json"
        ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file, vector_index_name)
        time.sleep(60) #wait the data can be searchable
        result="download succeeded."
    except Exception as e:
        result = "download failed."
    return result

In [130]:
semantic_search_and_check("autodesk financial statements")

[32;1m[1;3m[chain/start][0m [1m[chain:LLMChain] Entering Chain run with input:
[0m{
  "text": "autodesk financial statements"
}
[32;1m[1;3m[llm/start][0m [1m[chain:LLMChain > llm:ChatBedrock] Entering LLM run with input:
[0m{
  "prompts": [
    "System: You are a helpful assistant who extract company name from the human input.Please only output the company\nHuman: autodesk financial statements"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[chain:LLMChain > llm:ChatBedrock] [342ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "Autodesk",
        "generation_info": null,
        "type": "ChatGeneration",
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
            "schema",
            "messages",
            "AIMessage"
          ],
          "kwargs": {
            "content": "Autodesk",
            "additional_kwargs": {
              "usage": {
                "prompt_tokens": 

Unnamed: 0,company_name,company_financial_statements
0,"Autodesk, Inc.","Balances, January 31, 2019 219.4 2,071.5 (135.0) (2,147.4) (210.9)\nCommon stock issued under stock plans 2.7 (18.6) - - (18.6)\nStock-based compensation expense - 332.7 - - 332.7\nSettlement of liability-classified restricted stock units - 23.5 - - 23.5\nPre-combination expense related to equity awards assumed - 1.2 - - 1.2\nCumulative effect of adoption of accounting standards - - - (0.7) (0.7)\nNet income - - - 214.5 214.5\nOther comprehensive loss - - (25.3) - (25.3)\nRepurchase and reti..."
1,"Autodesk, Inc.","Net revenue by sales channel:\nIndirect $ 2,600.0 $ 2,282.2 $ 1,830.8\nDirect 1,190.4 992.1 739.0\nTotal net revenue $ 3,790.4 $ 3,274.3 $ 2,569.8\nNet revenue by product type:\nDesign $ 3,365.8 $ 2,920.1 $ 2,347.8\nMake 296.4 218.4 89.6\nOther 128.2 135.8 132.4\nTotal net revenue $ 3,790.4 $ 3,274.3 $ 2,569.8\nPayments for product subscriptions, industry collections, cloud subscriptions, and maintenance subscriptions are typically due up front with payment terms of 30 to 45 days. Payments o..."
2,"Autodesk, Inc.",Defined Benefit Pension Plans\nThe funded status of Autodesk’s defined benefit pension plans is recognized in the Consolidated Balance Sheets. The funded status is measured as the difference between the fair value of plan assets and the projected benefit obligation for the fiscal years presented. The projected benefit obligation represents the actuarial present value of benefits expected to be paid upon retirement based on employee services already rendered and estimated future compensation ...
3,"Autodesk, Inc.","(1)Included in “Interest and other expense, net” on the Company’s Consolidated Statements of Operations.\nAutodesk does not consider the remaining investments to be impaired at January 31, 2021.\nForeign currency contracts designated as cash flow hedges\nAutodesk uses foreign currency contracts to reduce the exchange rate impact on a portion of the net revenue or operating expense of certain anticipated transactions. These currency collars and forward contracts are designated and documented ..."
4,"Autodesk, Inc.","Furniture and equipment, at cost 88.4 69.0\nComputer software, hardware, leasehold improvements, furniture, and equipment, at cost 635.5 576.7\nLess: Accumulated depreciation (442.7) (415.0)\nComputer software, hardware, leasehold improvements, furniture, and equipment, net\n$ 192.8 $ 161.7\nCosts incurred for computer software developed or obtained for internal use are capitalized for application development activities, if material, and immediately expensed for preliminary project activitie..."
5,"Autodesk, Inc.","ITEM 7.MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS\nThe following discussion and analysis of our financial condition and results of operations should be read in conjunction with our consolidated financial statements and related notes appearing in Part II, Item 8 of this Annual Report on Form 10-K. This discussion contains forward-looking statements based upon current expectations that involve risks and uncertainties. Our actual results may differ mat..."
6,"Autodesk, Inc.","The acquisition-date fair value of the consideration transferred totaled $252.0 million, which consisted of $214.1 million of cash and 147,264 shares of Autodesk’s common stock at an aggregate fair value of $37.9 million. Of the total consideration transferred, $231.1 million is considered purchase consideration. Of the remaining amount, $18.9 million was recorded in “Prepaid expenses and other current assets” and “Long-term other assets” on our Consolidated Balance Sheets and will be amorti..."
7,"Autodesk, Inc.","Gains and losses realized from foreign currency transactions, those transactions denominated in currencies other than the foreign subsidiary’s functional currency, are included in “Interest and other expense, net.” Monetary assets and liabilities are\nremeasured using foreign currency exchange rates at the end of the period, and non-monetary assets are remeasured based on historical exchange rates.\nForeign Currency Contracts Designated as Cash Flow Hedges\nAutodesk uses foreign currency con..."
8,"Autodesk, Inc.","Stock-based compensation expense $ 75.2 $ 88.2 $ 94.0 $ 105.0 $ 362.4\nAmortization of acquisition related intangibles\n19.0 18.3 18.1 18.0 73.4\nAcquisition related costs 12.7 6.0 2.5 2.1 23.3\nRestructuring and other exit costs, net $ 0.2 $ 0.2 $ 0.1 $ - $ 0.5\n____________________\n(1) Net income (loss) per share were computed independently for each of the periods presented; therefore the sum of the net income (loss) per share amount for the quarters may not equal the total for the fiscal..."
9,"Autodesk, Inc.","In December 2018, Autodesk entered into a credit agreement by and among Autodesk, the lenders from time to time party thereto and Citibank, N.A., as agent, which provides for an unsecured revolving loan facility in the aggregate principal amount of $650.0 million with an option, subject to customary conditions, to request an increase in the amount of the credit facility by up to an additional $350.0 million, and is available for working capital or other business needs. The credit agreement c..."


### Note

Uncomment the line `downloaded_file=get_filings(company_stock_ticker)` if you have sec-api security key so that you can download 10-K from SEC. In the meanwhile, comment the line 'downloaded_file="000101872424000008-amzn-20231231.json"`


#### Define tools for financial statements analysis AI agent

In [114]:
from langchain.agents import Tool

annual_report_tools=[
    Tool(
        name="is_financial_statement_related_query",
        func=is_financial_statement_related_query,
        description="""
        Use this tool when you need to know whether user input query is financial statement analysis related query. Human orginal query is the input to this tool. This tool output is whether human input is financial statement analysis related or not. 
        If the query is not finance statement related, please answer \"I am finiancial statement ansysis assitant. I can not answer question which is not finance related.\" and terminate the dialog.
        """
    ),
    Tool(
        name="search_financial_statements_for_company",
        func=search_financial_statements_for_company,
        description="""
        Use this tool to get financial statement of the company. This tool output is company financial statements.
        """
    ),
    Tool(
        name="get_stock_ticker",
        func=get_stock_ticker,
        description="Use this tool when you need to get the company stock ticker. Human orginal query is the input to this tool. This tool will output company stock ticker."
    ),
    Tool(
        name="download_10k_filing_from_sec_and_ingest_into_opensearch",
        func=download_10k_filing_from_sec_and_ingest_into_opensearch,
        description="""
        Use this tool to download company financial statements from internet. Company stock ticker is the input to this tool. The tool output is download succeed or not.
        Use this tool if and only if "search_financial_statements_for_company" output result is empty. After downloading financial statements, you must use "search_financial_statements_for_company" tool to search financial statements again.
        """
    ),
    Tool(
        name="is_stock_related_query",
        func=is_stock_related_query,
        description="Use this tool when you need to know whether user input query is stock related query. Human orginal query is the input to this tool. This tool output is whether human input is stock related or not."
    ),
    Tool(
        name="get_stock_price",
        func=get_stock_price,
        description="""
        Use this tool to get company stock price data. Company stock ticker is the input to this tool. This tool will output company historic stock price. The output includes 'stock_date', 'stock_ticker', 'open_price', 'high_price', 'low_price', 'close_price' of the company in the latest 100 days.
        This tool is mandatory to use if the input query is both finance statement related and stock related. If the output of "get_stock_price" is empty, please answer \"I cannot provide stock analysis without stock price information.\" and terminate the dialog.
        """
    )
]

#### Define prompt for financial statements analysis AI agent 

In [105]:
from langchain_core.prompts import ChatPromptTemplate


system_message = f"""
You are finiancial analyst assistant and you will analyze company financial statements and stock data. 
Leverage the <conversation_history> to avoid duplicating work when answering questions.

Available tools:
<tools>
{{tools}}
</tools>


To answer, first review the <conversation_history>. If insufficient use tool(s) with the following format:
<thinking>Think about which tool(s) to use and why. "get_stock_price" tool is mandatory to use if the input query is both finance statements related and stock related.</thinking>
<tool>tool_name</tool>
<tool_input>input</tool_input>
<observation>response</observation>

When you are done, provide a final answer in markdown within <final_answer></final_answer>.
If <user_input> is stock related and the output of "get_stock_price" tool is empty, respond directly within <final_answer> with the exact content \"I cannot provide stock analysis without stock price information.\".
Otherwise, use the following format to organize your <final_answer>:

Summary:
...

Support points:
Support point 1: ...
Support point 2: ...
Support point 3: ...


"""

user_message = """
Begin!

Previous conversation history:
<conversation_history>
{chat_history}
</conversation_history>

User input message:
<user_input>
{input}
</user_input>

{agent_scratchpad}
"""

# Construct the prompt from the messages
messages = [
    ("system", system_message),
    ("human", user_message),
]

financial_statements_analysis_prompt = ChatPromptTemplate.from_messages(messages)

#### Define memory for financial statements analysis AI agent 

In [115]:
from langchain_community.chat_message_histories import DynamoDBChatMessageHistory
from uuid import uuid4

dynamo = boto3.client('dynamodb')

history_table_name = 'conversation-history-memory'

try:
    response = dynamo.describe_table(TableName=history_table_name)
    print("The table "+history_table_name+" exists")
except dynamo.exceptions.ResourceNotFoundException:
    print("The table "+history_table_name+" does not exist")
    
    dynamo.create_table(
    TableName=history_table_name,
    AttributeDefinitions=[
        {
            'AttributeName': 'SessionId',
            'AttributeType': 'S',
        }
    ],
    KeySchema=[
        {
            'AttributeName': 'SessionId',
            'KeyType': 'HASH',
        }
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 5,
        'WriteCapacityUnits': 5,
    }
    )

    response = dynamo.describe_table(TableName=history_table_name) 
    
    while response["Table"]["TableStatus"] == 'CREATING':
        time.sleep(1)
        print('.', end='')
        response = dynamo.describe_table(TableName=history_table_name) 

    print("\ndynamo DB Table, '"+response['Table']['TableName']+"' is created")



The table conversation-history-memory exists


#### Create financial statements analysis AI agent AI using defined Memory,  LLM, tools and prompt

In [116]:
from langchain.agents import create_xml_agent
from langchain.memory import ConversationBufferMemory
from langchain.agents import AgentExecutor


def create_new_memory_with_session(session_id):
    chat_memory = DynamoDBChatMessageHistory(table_name=history_table_name,session_id=session_id)    
    return chat_memory

def get_agentic_chatbot_conversation_chain(session_id, verbose=True):
    chat_memory=create_new_memory_with_session(session_id)
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        # Change the human_prefix from Human to something else
        # to not conflict with Human keyword in Anthropic Claude model.
        human_prefix="Hu",
        chat_memory=chat_memory,
        return_messages=False)

    agent = create_xml_agent(
        bedrock_llm,
        annual_report_tools,
        financial_statements_analysis_prompt,
        stop_sequence=["</tool_input>", "</final_answer>"]
    )

    agent_chain = AgentExecutor(
        agent=agent,
        tools=annual_report_tools,
        return_intermediate_steps=False,
        verbose=True,
        memory=memory,
        handle_parsing_errors="Check your output and make sure it conforms!"
    )
    return agent_chain

### 3.3 Use financial statements analysis AI agent

#### Example 1

Query is "Compare Adobe and Autodesk company financial statements"

The data flow is like following:

![example 1](./static/example-1-data-flow.png)

In [138]:
question="Compare Adobe and Autodesk company financial statements"
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": question})

[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor] Entering Chain run with input:
[0m{
  "input": "Compare Adobe and Autodesk company financial statements",
  "chat_history": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad> > chain:RunnableParallel<agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad> > chain:RunnableParallel<agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[36;1m[1;3m[chain/end][0m [1m[chain:A

In [139]:
print(response["output"])



Summary:
Adobe and Autodesk are both leading software companies, with Adobe focused on creative and digital media software while Autodesk specializes in design software for architecture, engineering, construction, manufacturing, and media and entertainment industries. Based on their financial statements, here is a comparison of some key metrics:

Revenue:
- Adobe's revenue for fiscal year 2020 was $12.87 billion, an increase of 15% year-over-year. Their revenue is driven by subscription revenue from their Creative Cloud, Document Cloud, and Digital Experience products and services.
- Autodesk's revenue for fiscal year 2021 was $3.79 billion, an increase of 16% year-over-year. Their revenue comes from subscription plans for their AutoCAD and AutoCAD LT, Architecture, Engineering and Construction Collections, Product Design and Manufacturing Collections, and Media and Entertainment Collections.

Profitability:
- Adobe's net income for fiscal 2020 was $4.87 billion, with an operating ma

#### Example 2

Query is "Is Snowfake a good investment choice right now?". 

The data flow is like following:

![example 2](./static/example-2-data-flow.png)

In [140]:
question="Is Snowflake a good investment choice right now?"
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": question})


[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor] Entering Chain run with input:
[0m{
  "input": "Is Snowflake a good investment choice right now?",
  "chat_history": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad> > chain:RunnableParallel<agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad> > chain:RunnableParallel<agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[36;1m[1;3m[chain/end][0m [1m[chain:AgentExe

In [141]:
print(response["output"])



Summary:
Based on the analysis of Snowflake's financial statements and stock performance, Snowflake appears to be a promising investment opportunity, but with some risks to consider given its high valuation and growth expectations priced in.

Support points:

1. Revenue growth: Snowflake has demonstrated impressive revenue growth, with revenue increasing by 101% year-over-year in fiscal 2023. This rapid top-line expansion signals strong product demand and market share gains in the cloud data warehousing space. However, sustaining such high growth rates will become more challenging over time.

2. Profitability: Snowflake is not yet profitable on a GAAP basis, with a net loss of $838 million in fiscal 2023. However, the company is generating positive free cash flow and its operating margins are improving as it scales, indicating progress towards profitability as a public company. Investors will want to see continued margin expansion.

3. Valuation: Snowflake trades at a premium valuati

#### Example 3

Query is "Is Amazon a good investment choice right now?"

The data flow is like following:

![example 3](./static/example-3-data-flow.png)

In [142]:
question="Is Amazon a good investment choice right now?"
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": question})

[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor] Entering Chain run with input:
[0m{
  "input": "Is Amazon a good investment choice right now?",
  "chat_history": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad> > chain:RunnableParallel<agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableAssign<agent_scratchpad> > chain:RunnableParallel<agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[36;1m[1;3m[chain/end][0m [1m[chain:AgentExecut

In [143]:
print(response["output"])



Summary:
Based on the available information, I cannot provide a comprehensive analysis on whether Amazon is a good investment choice right now. The key missing piece is Amazon's current stock price data, which is essential for evaluating the company's valuation and potential as an investment. Without stock price information, it is difficult to assess factors like the price-to-earnings ratio, market capitalization, and recent stock performance trends that would inform an investment decision.

Support points:

1. The query "Is Amazon a good investment choice right now?" requires analyzing both the company's financial statements and stock data to make an informed assessment. While I was able to retrieve Amazon's financial statements, the tool to get the company's stock price data returned an empty result.

2. Stock price data is crucial for investment analysis as it provides insights into the market's current valuation of the company, recent price trends, trading volume, and other metri