# Generative AI-powered search with Amazon OpenSearch Service

---
### Using Scenario
[Form 10-K](https://www.sec.gov/files/form10-k.pdf) is an annual report that public companies in the United States are required to file with the Securities and Exchange Commission (SEC). This comprehensive report provides detailed information about the company's financial performance over the past year, including the company's history and organizational structure, its financial statements such as the balance sheet, income statement, and cash flow statement, metrics like earnings per share, information on the company's subsidiaries, details on executive compensation, and any other material data about the company's operations and financial condition.

The SEC requires all publicly traded companies to regularly file 10-K reports in order to keep investors informed about the company's financial condition. This allows investors to have sufficient information before making decisions to buy or sell the company's securities. While the 10-K may appear overly complex at first, with its many tables of data and figures, this level of comprehensive detail is critical for investors to properly understand the company's financial position and future prospects.

The Form 10-K is comprised of several parts; these include:

| Item | Description |
| ---- | ----------- |
|1|Business (This describes the company's operations.)|
|1A| Risk Factors |
|1B| Unresolved Staff Comments |
|1C| Cybersecurity |
|2| Properties |
|3| Legal Proceedings |
|4| Mine Safety Disclosures (if appropriate) |
|5| Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities |
|6| Selected Financial Data (prior to February 2021) |
|7| Management’s Discussion and Analysis of Financial Condition and Results of Operations |
|7A| Quantitative and Qualitative Disclosures About Market Risk |
|8| Financial Statements and Supplementary Data |
|9| Changes in and Disagreements With Accountants on Accounting and Financial Disclosure |
|9A| Controls and Procedures |
|9B| Other Information |
|9C| Disclosure Regarding Foreign Jurisdictions that Prevent Inspections |
|10| Directors, Executive Officers and Corporate Governance |
|11| Executive Compensation |
|12| Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters |
|13| Certain Relationships and Related Transactions, and Director Independence |
|14| Principal Accountant Fees and Services |
|15| Exhibits and Financial Statement Schedules |
|16| Form 10–K Summary (optional) |


Many investors rely on SEC filings, such as the 10-K report, to analyze the financial health of a company. These filings can be a treasure trove of valuable information. However, searching for specific data within these comprehensive documents can be challenging. Keyword-based searches may return some irrelevant information, and even semantic search methods can still lead to an overwhelming amount of data. Can we leverage generative AI to help us interpret company financial statements?


In this Code Talk, we will demonstrate how to modernize your search application to improve the relevance of search results. We will do this by utilizing Amazon OpenSearch, a popular search and analytics service. Additionally, we will explore how to leverage generative AI technology to enhance search productivity and make the search experience more efficient and effective for users. The code includes the following topics:
- Comparison of search relevance between keyword-based search and semantic search using Amazon OpenSearch
- Leveraging Retrieval Augmented Generation (RAG), a generative AI approach, to improve search productivity
- Building an intelligent agent that can orchestrate and execute multi-step tasks to automate the analysis of 10-K financial filings
- Best practices for utilizing the vector store capabilities within the OpenSearch platform to power advanced search and analysis solutions

---


### Code Structure

The code includes the following sections:
- [Initialize the Notebook](#Initialize-the-Notebook)
- [Part 1: Ingest unstructured data into OpenSearch](#Part-1:-Ingest-unstructured-data-into-OpenSearch)
- [Part 2: A different appoach to search](#Part-2:-A-different-appoach-to-search)
    - [2.1 Keyword search](#2.1-Keyword-search)
    - [2.2 Vector/Semantic search](#2.2-Vector/Semantic-search)
    - [2.3 Retrieval Augmented Generation(RAG)](#2.3-Retrieval-Augmented-Generation(RAG))
- [Part 3: AI agent powered search](#Part-3:-AI-agent-powered-search)
    - [3.1 Prepare other tools used by AI agent](#3.1-Prepare-other-tools-used-by-AI-agent)
        - [3.1.1 Ingest and query structured data in Redshift](#3.1.1-Ingest-and-query-structured-data-in-Amazon-Redshift)
        - [3.1.2 Download SEC 10-K filing from SEC-API.io](#3.1.2-Download-SEC-10-K-filing-from-SEC-API.IO)
    - [3.2 Create AI agent](#3.2-Create-an-AI-agent)
    - [3.3 Use the financial statements analysis AI agent](#3.3-Use-the-financial-statements-analysis-AI-agent)


## Initialize the Notebook




###  Install Python libraries (and dependencies) for OpenSearch, Redshift and LangChain

Install the following:
- [opensearch-py](https://docs.opensearch.org/docs/latest/clients/python-low-level/) - The OpenSearch low-level Python client, called opensearch-py, provides a set of wrapper methods that allow you to interact with your OpenSearch cluster more easily in Python. Instead of having to manually send raw HTTP requests to specific URLs, you can create an OpenSearch client instance for your cluster and then call the built-in functions provided by the client library. This makes working with the OpenSearch REST API much more natural and straightforward when using Python.
- [PyTorch](https://pytorch.org/) - a Python package that provides two key high-level features: tensor computation capabilities similar to the popular NumPy library, but with the added benefit of strong acceleration for running these computations on GPUs; and supports the building of deep neural network models, which is enabled by its tape-based autograd system.
- [requests-aws4auth](https://github.com/tedder/requests-aws4auth) - AWS authentication for the Python Requests library
- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) - AWS SDK for Python, which allows developers to write software that makes use of AWS Services
- [SQLAlchemy](https://www.sqlalchemy.org/) - is an open-source toolkit for the Python programming language that is designed to facilitate efficient and high-performing database access. Its primary function is to help Python applications interact with relational databases more easily. SQLAlchemy accomplishes this by making it simpler for developers to create, update, and query database tables within their Python applications.
- [Amazon RedShift Python connector](https://docs.aws.amazon.com/redshift/latest/mgmt/python-driver-install.html) - by using the Amazon Redshift connector for Python, you can integrate work with the AWS SDK for Python (Boto3), and also pandas and Numerical Python (NumPy).
- [iPython-SQL](https://github.com/catherinedevlin/ipython-sql) - connect to a database, using SQLAlchemy URL connect strings, and issue standard SQL commands within a Jupyter Notebook.
- [LangChain](https://python.langchain.com) - a framework that helps developers build applications powered by large language models (LLMs). It provides a set of tools and abstractions that simplify the process of integrating LLMs into various types of applications and workflows. See [What is LangChain?](https://aws.amazon.com/what-is/langchain/)

In [298]:
%pip install -q opensearch-py
%pip install -q torch
%pip install -q requests-aws4auth
%pip install -q boto3
%pip install -q sqlalchemy
%pip install -q sqlalchemy-redshift
%pip install -q redshift_connector
%pip install -q ipython-sql==0.4.1
%pip install -q langchain==0.3.1
%pip install -q langchain-aws==0.2.1
%pip install -q langchain-community==0.3.1
%pip install -q sec-api
print("Done installing dependencies.")


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Done installing dependencies.


### Import modules and packages



In [311]:
# Suppress various warnings and INFO messages during demonstration
# Comment out the following to see full warning and INFO messaging

import warnings
import logging

#warnings.filterwarnings('ignore')
#logging.getLogger().setLevel(logging.CRITICAL)

In [425]:
from IPython.display import HTML, JSON, Markdown, Latex
from langchain_aws import BedrockEmbeddings, BedrockLLM, ChatBedrock
from langchain_core.prompts import ChatPromptTemplate
from langchain.agents import AgentExecutor, create_xml_agent, Tool
from langchain.chains import LLMChain, RetrievalQA
from langchain_community.chat_message_histories import DynamoDBChatMessageHistory
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.document_loaders.base import BaseLoader
from langchain.memory import ConversationBufferMemory
from langchain.prompts.chat import ChatPromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import OpenSearchVectorSearch
from opensearchpy import helpers, OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from requests_aws4auth import AWS4Auth
from sagemaker.session import Session
from sec_api import ExtractorApi, QueryApi
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
from typing import Any, Dict, Callable, List, Optional, Sequence
from uuid import uuid4
import boto3
import json
import langchain
import os
import pandas as pd
import re
import sagemaker
import sqlalchemy as sa
import sys
import time
import uuid

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## Part 1: Ingest unstructured data into OpenSearch

### Download a sample of SEC 10-K filing documents

Let's get a collection of SEC 10-K financial reports from various companies. (We have already obtained a collection of these reports, put them into an Amazon S3 bucket and converted the HTML (hypertext markup) versions into JSON format, which makes them easier to work with.)

The demonstration files in the Amazon S3 bucket are from the year 2020/2021. However, you can retrieve more recent years of 10-K filings directly from the [SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system](https://www.sec.gov/search-filings). 

> A Form 10-K is public information. Companies that are publicly traded in the United States are required by law to file a 10-K annual report with the Securities and Exchange Commission (SEC). These 10-K filings, containing detailed financial and operational information about the companies, are then made publicly available by the SEC.

In [None]:
!wget -q https://ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com/df655552-1e61-4a6b-9dc4-c03eb94c6f75/10k-financial-filing.zip
!unzip -o -q 10k-financial-filing.zip

### Load the documents into OpenSearch
Amazon OpenSearch is effective at dynamically inferring the data types of fields in your documents. This allows it to perform flexible full-text searches, including the ability to handle misspellings and variations in the search terms (known as "fuzziness"), while still returning relevant results based on the intended meaning (known as "type tolerance"). 


We will obtain the Amazon OpenSearch Serverless endpoint from the "Outputs" section of the CloudFormation Stack that we previously deployed.

>If you do not see the OpenSearch Serverless endpoint listed in the Outputs, you may need to relaunch the CloudFormation Stack deployment from [here](https://github.com/aws-samples/semantic-search-with-amazon-opensearch/tree/generative-ai-powered-search).

In [301]:
cfn = boto3.client('cloudformation')
sec = boto3.client('secretsmanager')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "generative-ai-powered-search"

try:
    outputs = get_cfn_outputs(cloudformation_stack_name)
    aoss_endpoint = outputs['OpenSearchServerlessCollectionEndpoint']
    aoss_host = aoss_endpoint.split("//")[1]
    print(f"Amazon OpenSearch Serverless endpoint: {aoss_endpoint}.")
except KeyError:
    print(f"Unable to locate the Amazon OpenSearch Serverless endpoint.  It could be that the stackname {cloudformation_stack_name} was incorrect, or you may need to redeploy the solution from https://github.com/aws-samples/semantic-search-with-amazon-opensearch/tree/generative-ai-powered-search", file=sys.stderr) 

Amazon OpenSearch Serverless endpoint: https://xprka3489ktlkf2hjari.us-east-1.aoss.amazonaws.com.


Next, let's create an OpenSearch client (a wrapper through opensearch-py) that will connect to our Amazon OpenSearch Serverless endpoint. We will use this client for the remainder of this workshop.

In [None]:
service = "aoss"
aws_region = boto3.Session().region_name
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, aws_region, service)

aos_client = OpenSearch(
    hosts = [{"host": aoss_host, "port": 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

Let's now create a new index in OpenSearch with the name `10k_financial_raw`. (If you have already created this index previously, go ahead and delete the existing index before creating a new one.)

By simply creating an empty index in OpenSearch, the index schema will be dynamically extended as the system encounters new attributes in the data being indexed. This allows for flexible handling of the varying content found in the 10-K financial reports. See [Dynamic Mapping in Mappings and fields, in the OpenSearch documentation](https://docs.opensearch.org/docs/latest/field-types/#dynamic-mapping).

In [None]:
raw_index_name="10k_financial_raw"

try:
    aos_client.indices.get(index=raw_index_name)
    # The index already exists, delete it
    print("Deleting existing index before creating new one.")
    aos_client.indices.delete(index=raw_index_name)
except Exception as e:
    print("Index does not currently exist.")
    
aos_client.indices.create(index=raw_index_name,ignore=400)
print("Index created.")

Now we will load the JSON versions of the 10-K financial reports into the index we just created. To add multiple documents to the index efficiently, we will use OpenSearch's Bulk API. This allows us to send a single request to OpenSearch containing batches of documents we want to index, rather than sending them one by one.

In [None]:
# Set the directory path to where we unziped/extracted the 10-K filings
directory_path =  "extracted"
batch_size = 50
# Initialize a list to store the documents
documents = []

# Iterate through the files in the directory
for filename in os.listdir(directory_path):
    file_path = os.path.join(directory_path, filename)

    # Read the file contents
    with open(file_path, 'r') as file:
        file_contents = file.read()
        docJSON = json.loads(file_contents)
        docJSON["_index"] = raw_index_name
        documents.append(docJSON)
    
    # If the batch size is reached, index the documents
    if len(documents) == batch_size:
        aos_response= helpers.bulk(aos_client, documents)
        print(f"Indexed {len(documents)} documents in a batch.")
        documents = []

# Index the remaining documents
if documents:
    aos_response= helpers.bulk(aos_client, documents)
    print(f"Indexed {len(documents)} additional documents.")

print("Done loading SEC 10-K filing documents.")

We can check the total number of documents that have been indexed into our OpenSearch index by performing a search across all the documents.

>We should wait 1-2 minutes before checking the document count to allow time for the OpenSearch shards to fully refresh and provide a consistent result. This ensures we get an accurate count of the documents that have been successfully indexed.

We will print out the number of hits, or matching documents, from the OpenSearch response:
```JSON
{
  "hits": {
    "total": {
      "value": 191,
      "relation": "eq"
    }
  }
}
```

In [None]:
res = aos_client.search(index=raw_index_name, body={"size": 0, "query": { "match_all": {}}})

print("Search matching all returned %d documents." % res['hits']['total']['value'])

## Part 2: A different appoach to search

### 2.1 Keyword search
---
Keyword search is a technique for finding information within a large body of textual data. The user provides a "query" - a term or set of words they are looking for. The search engine then uses various methods to analyze the text and identify the relevant information.

The process involves tokenization, where the text is split into individual words or "tokens". The search engine then scores the results based on factors like:

- How many times the search term appears in each document
- How common the search term is across the entire dataset
- The proximity of the search terms within the document

With this data, the search engine can handle "fuzzy" matches. This allows users to search using phonetically similar terms or words that are misspelled, and still find the relevant information.

The keyword search performs well for exact matches as well. Overall, it provides a flexible way for users to find the information they are looking for within a large amount of textual data.

Let's search through our collection of business filing data to find companies that are located in the state of Illinois (`state_location.keyword` = "IL"), by constructing an OpenSearch query:

In [None]:
query = {
    "_source" : ["company", "filing_date", "state_location"],
    "query": {
        "bool" :{
            "filter" : [{
                    "match" :{
                        "state_location.keyword" : "IL"
                    }
                }]
        }
    }
}

 #### Query breakdown

- `_source`: This specifies the fields that should be returned in the search results. In this case, it will return the "company", "filing_date", and "state_location" fields.
- `query`: This defines the actual search query.
- `bool`: This is a [compound query](https://docs.opensearch.org/docs/latest/query-dsl/compound/index/) that combines multiple individual queries. In this case, it's using a "filter" to narrow down the results.
- `filter`: This applies an additional [filter](https://docs.opensearch.org/docs/latest/search-plugins/filter-search/) to the results. It will only return documents that match the criteria defined in the filter.
- `match`: This is a type of query that looks for documents where the specified field matches the given value.
- `state_location.keyword`: This refers to the "keyword" version of the "state_location" field. OpenSearch has both "text" and "keyword" field types, and the "keyword" type is often used for exact matches. See [Keyword Search in the OpenSearch documentation](https://docs.opensearch.org/docs/latest/search-plugins/keyword-search/) for details.

>It is worth noting that we are using a relatively inexpensive query, avoiding more expensive types like `fuzzy`, `prefix`, `range`, `regexp`, `wildcard` or `query_string` because the two-letter state abbreviations should be well-constrained.

In [None]:
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df[["company","filing_date","state_location"]])

> We are using the Pandas DataFrame object to make the results from our OpenSearch query more easily viewable and manageable within the Jupyter Notebook environment. This technique is commonly used when working with large datasets retrieved from OpenSearch, as it allows for more convenient manipulation and analysis of the data using Python.

In addition, we can perform a search that spans multiple fields, and have the search engine highlight the matching parts within the results returned.

In [None]:
query = {
    "_source" : ["company", "filing_date", "state_location", "item_1"],
    "query": {
        "bool": {
            "must" :[
                {
                    "multi_match" :{
                        "query" : "travel",
                        "fields" :["item_1"]
                    }
                }
            ]
        }
    },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}

#### Query breakdown

- `bool`: For our compound query, this time we will use "must" clause, which means all the conditions inside it must be met, and
- `multi_match`: This is a type of query that looks for matches across multiple fields. In this case, it's searching for the term "travel" in the "item_1" field, which in the SEC 10-K filing is the "Business" section which gives investors and regulators an overview of the company's core business operations, competitive landscape, and any significant regulatory or operational factors that could impact the business. Also notice that "item_1" has been added to our `_source` as well.
- `highlight`: This section specifies how the search results should be highlighted.
  - `pre_tags` and `post_tags`: These define the HTML tags that will be used to wrap the highlighted text, in this case, `<em>` and `</em>`.
  - `fields`: This specifies which fields should be highlighted. In this case, it's just the "item_1" field.

In [None]:
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
print("Search returned %d documents." % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location'],hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location", "item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
HTML(query_result_df[["company","filing_date","state_location", "item_1_highlight"]].rename(columns={"company": "Company", "filing_date":"Filing Date","item_1_highlight":"10-K Form Item 1 with Highlighted term 'travel'"}).to_html(escape=False))

>Note: For the sake of clarity, I have taken the liberty to rename some of the column names in the search results. This is to make the information more easily understandable.

We can also perform a phrase match, where the search terms must appear together in the specific order that we provide. In this case, we are looking for companies that mention the phrase "machine learning" in their business description or narrative.

We can add additional filters to further narrow down the search results. For example, we could filter the results to only include companies that are located in the state of California. An example of this type of filtered search is shown below:

In [None]:
query = {
    "_source" : ["company", "filing_date", "state_location", "item_1"],
    "query": {
        "bool": {
            "must" :[
                {
                    "match_phrase" :{
                        "item_1" : "machine learning"
                    }
                }
            ],
            "filter" :[
                {
                    "term": {
                        "state_location.keyword" : "CA"
                    }
                }
            ]
        }
    },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}

In [None]:
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
print("Search returned %d documents." % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['filing_date'],hit['_source']['state_location'],hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","filing_date","state_location", "item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
HTML(query_result_df[["company","filing_date","state_location", "item_1_highlight"]].rename(columns={"company": "Company Name","filing_date":"Filing Date","state_location":"State","item_1_highlight":"10-K Form Item 1 with Highlighted Terms"}).to_html(escape=False))

The full text search functionality works well with structured data, providing features like fuzzy matching, typo tolerance, and proximity searching. It can also highlight the matching text in the search results.

However, when searching unstructured natural language data, a pure keyword-based search approach may not always yield the most relevant results. It can sometimes return a long tail of less relevant documents.

Let's run the query and examine the search results. We will likely see that some irrelevant documents get returned.

In [None]:
query_text="What are the operating expenses of Adobe?"
query={
  "size": 10,
  "query": {
    "match": {
      "item_1": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_*" : {}
    }
  }
}
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['item_1'], hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","item_1","item_1_highlight"])
pd.set_option('display.max_colwidth', 500)

HTML(query_result_df[["_score","company", "item_1_highlight"]].rename(columns={"_score": "Relevance Score","company":"Company Name","item_1_highlight":"10-K Form Item 1 with Highlighted Terms"}).to_html(escape=False))

Run the query and check the search result.

In [None]:
# lets execute another query example
query_text="What is Adobe's main revenue source?"
query={
  "size": 10,
  "query": {
    "match": {
      "item_1": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_1" : {}
    }
  }
}
res = aos_client.search(index=raw_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company'],hit['_source']['item_1'], hit['highlight']['item_1'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company","item_1","item_1_highlight"])
pd.set_option('display.max_colwidth', 500)
HTML(query_result_df[["_score","company", "item_1_highlight"]].rename(columns={"_score": "Relevance Score","company":"Company Name","item_1_highlight":"10-K Form Item 1 with Highlighted Terms"}).to_html(escape=False))

# you should notice a few results not relevant to the query!

The search results you see here are generated using term statistics and ranking functions. These techniques estimate the relevance of each document based on probabilistic retrieval frameworks applied across the large text dataset we are searching.

![Keyword Search](./static/keyword-search-flow.drawio.svg)

The irrelevant results we're seeing from our more natural language-based queries provide a good opportunity to explore the benefits of vector or semantic search approaches.

### 2.2 Vector/Semantic search

---
In a vector search approach, both the documents and the search queries are represented as high-dimensional numerical vectors, rather than just as raw text strings.

The underlying principle behind vector search is that semantically similar documents or queries can be mapped to vectors that are positioned close together within the vector space. This is the case even if the actual textual content of the documents and queries doesn't have much lexical (word-level) overlap.

Here's a bit more detail on how vector search works:

1. Document Indexing
  - The search engine takes the text content of each document and converts it into a high-dimensional vector.
  - This vector representation encodes the semantic meaning of the document's content, based on techniques like word embeddings.
  - The vector for each document is then stored in the search index.
  
2. Query Formulation
  - When a user submits a search query, the search engine converts that query text into a vector as well.
  - The query vector represents the semantic meaning that the user is searching for.

3. Relevance Ranking
  - The search engine compares the query vector to all the document vectors in the index.
  - It calculates the "distance" or similarity between the query vector and each document vector.
  - Documents with vectors that are closer (more similar) to the query vector are ranked as more relevant.
  
4. Result Retrieval
  - The search engine returns the most relevant documents, based on the vector similarity scores.
  - Even if the query terms don't exactly match the document text, semantically similar documents can still be surfaced.


The key advantage of vector search is its ability to uncover semantically relevant content that would be missed by traditional text-based keyword searches. This makes vector search particularly useful for tasks like question answering, e-commerce product search, and finding similar documents or images.

The vector search approach relies on underlying machine learning models that need to be trained on large datasets. However, this investment upfront can pay off by significantly improving the quality and relevance of the search results, compared to search methods that only look for exact lexical (word-level) matches.

![Semantic Search](./static/semantic-search-flow.drawio.svg)

---


![Semantic Search Architecture](./static/semantic-search-architecture.png)



**Embeddings** are numerical representations of data, typically used to convert complex, high-dimensional data into a lower-dimensional space where similar data points are closer together. In the context of natural language processing (NLP), embeddings are used to represent words, phrases, or sentences as vectors of real numbers. These vectors capture semantic relationships, meaning that words with similar meanings are represented by vectors that are close together in the embedding space.

**Embedding models** are machine learning models that are trained to create these numerical representations. They learn to encode various types of data into embeddings that capture the essential characteristics and relationships within the data. For example, in NLP, embedding models like Word2Vec, GloVe, and BERT are trained on large text corpora to produce word embeddings. These embeddings can then be used for various downstream tasks, such as text classification, sentiment analysis, or machine translation. In this case we'll be using it for semantic similarity.

We utilize an embedding model to convert the user's questions into a vector representation. We then use the vector similarity between the question vector and the vectors representing the 10-K data to identify semantically similar content: 

<!-- ![Convert Text to Vector](./static/text2vector.png) -->

---

![opensearch vector store](./static/opensearch-vector-store.png)


In [None]:
# Specify the path to the folder containing the JSON files
folder_path = "extracted"

# Initialize an empty list to store list of company 10-K filing file names
company_filing_file_name_list = []

#For this session, we only ingest few company information.
company_list=["Zoom Video Communications, Inc.",
              "MICROSTRATEGY Inc", 
              "PagerDuty, Inc", 
              "Unity Software Inc.", 
              "Autodesk, Inc.",
              "ADOBE INC.",
              "DOCUSIGN, INC.",
              "Okta, Inc.",
              "Datadog, Inc.",
              "INTUIT INC",
              "AUTOMATIC DATA PROCESSING INC",
              "SALESFORCE.COM, INC.", 
              "BOX INC",
              "Asana, Inc", 
              "Palantir Technologies Inc."
             ]

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        df = pd.DataFrame([pd.read_json(file_path,typ='series')])
        if df.iloc[0]['company'] in company_list:
            company_filing_file_name_list.append(file_path)
            print(f"Using file {file_path} for company {df.iloc[0]['company']}")

#### Prepare to use the Amazon Titan embeddings model within Amazon Bedrock
Before proceeding, it's important to verify that you have requested and been granted access to the Titan Text Embeddings V2 model (or an alternative model, if you have chosen to use a different one). If you do not have the necessary access permissions, you will likely encounter an error when attempting to ingest the data later on.

In [None]:
aws_region = boto3.Session().region_name

boto3_bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=f"https://bedrock-runtime.{aws_region}.amazonaws.com")
bedrock_embeddings = BedrockEmbeddings(model_id='amazon.titan-embed-text-v2:0',client=boto3_bedrock)
#bedrock_embeddings = BedrockEmbeddings(model_id='cohere.embed-multilingual-v3',client=boto3_bedrock)

### Create a index in Amazon OpenSearch Service collection

The [OpenSearch k-NN (k-nearest neighbors) plugin](https://docs.opensearch.org/docs/latest/field-types/supported-field-types/knn-vector/) introduces a custom data type called 'knn_vector'. This allows users to ingest their k-NN vector representations into an OpenSearch index. Once the vectors are indexed, the plugin enables users to perform various types of k-NN search operations on the data. 

<!-- ---
#### OpenSearch Approximate Nearest Neighbor Algorithms and Engines
![ANN algorithm](./static/ann-algorithm.png)

--- -->

<!-- #### HNSW parameter tuning
![hnsw parameter tuning](./static/hnsw-parameter-tuning.png) -->

<!-- ---

#### IVF parameter tuning
![ivf parameter tuning](./static/ivf-parameter-tuning.png)

#### How to select the engine and algorithms
![opensearch ann comparison](./static/opensearch-ann-selection.png)  -->

In [None]:
knn_index = {
    "settings": {
        "index.knn": True
    },
    "mappings": {
        "properties": {
            "item_vector": {
                "type": "knn_vector",
                "dimension": 1024
            },
            "item_content": {
                "type": "text",
                "store": True
            },
            "company_name": {
                "type": "text",
                "store": True
            }
        }
    }
}

#### Index mapping for k-NN breakdown

- `settings`: This section defines the index-level settings.
  - "index.knn": This setting is set to **True**, which enables the k-NN functionality for this index.
- `mappings`: This section defines the field mappings for the index.
  - `properties`: This defines the individual fields and their configurations.
    - "item_vector": This is a field that will store the k-NN vector representations.
      - `type`: The field type is set to `knn_vector`, which is a custom data type introduced by the OpenSearch k-NN plugin.
      - `dimension`: The vector dimension is set to **1024**, meaning each vector will have 1024 elements.
    - "item_content": This is a text field that will store the actual content of the items.
      - `type`: The field type is set to `text`.
      - `store`: This is set to **True**, which means the field content will be stored and can be retrieved.
    - "company_name": This is another text field that will store the company name.
      - `type`: The field type is set to `text`.
      - `store`: This is also set to **True**, allowing the company name to be retrieved.

Using the above index definition, we can now create the index in Amazon OpenSearch:

In [None]:
vector_index_name="10k_financial_semantic"

try:
    aos_client.indices.get(index=vector_index_name)
    # The index already exists, delete it
    print("Deleting existing index before creating new one.")
    aos_client.indices.delete(index=vector_index_name)
except Exception as e:
    print("Index does not currently exist.")
    
aos_client.indices.create(index=vector_index_name,body=knn_index,ignore=400)
print("Index created.")

In [None]:
vector_index_name="10k_financial_semantic"
print(json.dumps(aos_client.indices.get(index=vector_index_name), indent=4))
# you can verify the mappings

###  Load the raw data into the Index
Over the next few cells we will load the raw data into the index we have just created.

We start by creating a pandas DataFrame loader helper class:

In [None]:
class PandasDataFrameLoader(BaseLoader):
    
    def __init__(self,dataframe:pd.DataFrame):
        self.dataframe=dataframe
        
    def load(self) -> List[Document]:
        docs = []
        items=["item_1","item_1A","item_1B","item_2","item_3","item_4","item_5","item_6","item_7","item_7A","item_8","item_9","item_9A", "item_9B", "item_10", "item_11", "item_12", "item_13", "item_14", "item_15"]
        
        for index, row in self.dataframe.iterrows():
            metadata={}
            # you can use as many metadata possible
            metadata["cik"]=row['cik']
            metadata["company_name"]=row['company']
            metadata["filing_date"]=row['filing_date']
            for item in items:
                content=row[item]
                metadata['item'] = item
                doc = Document(page_content=content,metadata=metadata)
                docs.append(doc)
        return docs

We will then create a function that does the following for each company of interest:
- uses `RecursiveCharacterTextSplitter` to split documents into 8,000 character chunks, with a 200 character overlap between each chunk. The `chunk_size` parameter sets the maximum size (in characters) for each text chunk; `chunk_overlap` specifies the number of characters that should overlap between adjacent chunks. This helps ensure that context is preserved across chunk boundaries.
- sends the "splitted" documents to Bedrock to develop into a vector
- add the company name, documents and the vector to the index using an OpenSearch Bulk operation

In [None]:
def ingest_downloaded_10k_into_opensearch(file_name, index_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    
    ## call this out
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    item_contents=[]
    company_name=splitted_documents[0].metadata['company_name']
    for doc in splitted_documents:
        item_contents.append(doc.page_content)
    
    print("\nFor company " + company_name + " ingested " + str(len(item_contents)) + " items into Bedrock.")
    start = time.time()
    embedding_results = bedrock_embeddings.embed_documents(item_contents)
    end = time.time()
    elapsed = end - start
    print(f"Time elapsed for Bedrock embedding: {elapsed:.2f} seconds")
        
    data = []
    i=0
    for content in item_contents:
        data.append({"_index": index_name,  "company_name": company_name, "item_content":content, "item_vector":embedding_results[i]})
        i = i+1
    aos_response= helpers.bulk(aos_client, data)
    print(f"Bulk-inserted {aos_response[0]} items into the OpenSearch index with the Bedrock embeddings.")

Now we iterate through our companies of interest array (`company_filing_file_name_list`) and process them through our function and class we've created.  This will result in the index being loaded.

In [None]:
# load the data in to OpenSearch Serverless collection. Note: This would take some time to complete
for file in company_filing_file_name_list:
    ingest_downloaded_10k_into_opensearch(file, vector_index_name)
    print("Ingested file:" + file)
print("**Completed ingesting 10-K Form filing data.**)

To validate the load, you can query the number of documents in the index. 

In [None]:
res = aos_client.search(index=vector_index_name, body={"query": {"match_all": {}}})
print("Search matching all returned %d documents." % res['hits']['total']['value'])

It may be of some interest to see an example of what a single 10-K Form document looks like within the index:

In [None]:
print(json.dumps(res['hits']['hits'][1]['_source'], indent=4))

Now that we have set up the vector search system, let's repeat the same queries we used previously. This will allow us to assess whether the new approach has improved the relevance of the search results, compared to the irrelevant results we encountered before.

In [None]:
# set up the same query as before...
query_text="What are the operating expenses of Adobe?"

Run the query and check the search result. 

In [None]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=vector_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df[["_score","company_name","item_content"]].rename(columns={"_score":"Relevance score", "company_name":"Company name", "item_content":"Operating expenses?"}))

In [None]:
# and now the other low-performing query...
query_text="What is Adobe's main revenue?"

Run the next query and check the search result. 

In [None]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=vector_index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df[["_score","company_name","item_content"]].rename(columns={"_score":"Relevance score", "company_name":"Company name", "item_content":"Main revenue?"}))

### 2.3 Retrieval Augmented Generation(RAG)

You can leverage Large Language Models to directly generate answers for the user, rather than just returning the retrieved documents. By using the relevant documents as context when generating the answers, this approach can help minimize the risk of the model 'hallucinating' or fabricating information.

This method is known as Retrieval Augmented Generation, or RAG for short. In RAG, the external data used to generate the answers can come from a variety of sources, such as document repositories, databases, or APIs.

The first step in this process is to convert both the documents and the user's query into a format that allows for effective comparison and relevancy search. This is achieved by transforming the document collection (the 'knowledge library') and the user-submitted query into numerical vector representations using embedding language models. These vector embeddings numerically encode the semantic concepts present in the text.

Next, based on the embedding of the user query, relevant text is identified in the document collection through similarity search in the embedding space. The prompt provided by the user is then combined with the searched relevant text and added to the context. This updated prompt, which includes relevant external data along with the original prompt, is sent to the LLM (Language Model) for processing. As a result, the model output becomes relevant and accurate due to the context containing the relevant external data.

The major components of RAG, including embedding, vector databases, augmentation, and generation:

- **Embedding**: Purpose: Embeddings transform text data into numerical vectors in a high-dimensional space. These vectors represent the semantic meaning of the text. Process: The embedding process typically uses pre-trained models (like BERT or a variant) to convert both the input queries and the documents in the database into dense vectors. Role in RAG: Embeddings are crucial for the retrieval component as they allow the model to compute the similarity between the query and the documents in the database efficiently.
- **Vector Database**: Function: A vector database stores the embeddings of a large collection of documents or passages. Construction: It is created by processing a vast corpus (like Wikipedia or other specialized datasets) through an embedding model. Usage in RAG: When a query comes in, the model searches this database to find the documents whose embeddings are most similar to the embedding of the query.
- **Retrieval (Augmentation)**: Mechanism: The retrieval part of RAG functions by taking the input query, converting it into a vector using embeddings, and then searching the vector database to retrieve relevant documents. Result: It augments the original query with additional context by selecting documents or passages that are semantically related to the query. This augmented information is essential for generating more informed responses.
- **Generation**: Integration with a Language Model: The generative component, often a large language model like Amazon Titan Text, receives both the original query and the retrieved documents. Response Generation: It synthesizes information from these inputs to produce a coherent and contextually appropriate response. 
- **Training and Fine-Tuning**: This component is generally pre-trained on vast amounts of text and may be further fine-tuned to optimize its performance for specific tasks or datasets.
- **End-to-End Training (Optional)**: Joint Optimization: In RAG, both retrieval and generation components can be fine-tuned together, allowing the system to optimize the selection of documents and the generation of responses simultaneously. Feedback Loop: The model learns not only to generate relevant responses but also to retrieve the most useful documents for any given query.

---
### Architecture

![RAG](./static/RAG_Architecture.png)

---

In [None]:
langchain_index_name="10k_financial_embedding"

try:
    aos_client.indices.get(index=langchain_index_name)
    # The index already exists, delete it
    print("Deleting existing index before creating new one.")
    aos_client.indices.delete(index=langchain_index_name)
except Exception as e:
    print("Index does not currently exist.")

We will need a new [LangChain](https://python.langchain.com/api_reference/index.html) version of our ingestion function:
- as before, we will use [`RecursiveCharacterTextSplitter`](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html) with the same parameters before
- this time, however, we will use LangChain's [`OpenSearchVectorSearch`](https://python.langchain.com/docs/integrations/vectorstores/opensearch/) class. This allows the LangChain application to perform advanced vector-based searches on the indexed data, rather than relying solely on keyword-based searches.

In [None]:
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, aws_region, service, session_token=credentials.token)

def ingest_10k_into_opensearch_with_langchain(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    OpenSearchVectorSearch.from_documents(
                index_name = langchain_index_name,
                documents=splitted_documents,
                embedding=bedrock_embeddings,
                opensearch_url=aoss_endpoint,
                http_auth=awsauth,
                timeout=600,
                use_ssl=True,
                verify_certs=True,
                connection_class=RequestsHttpConnection,
    )

In [None]:
## In this step, we're indexing the selected companies 10K files in to Amazon OpenSearch Serverless with LangChain
## This will take a little while to load all the chunks

for file in company_filing_file_name_list:
    ingest_10k_into_opensearch_with_langchain(file)
    print("Ingested file " + file)
print("Finished ingesting files.") CB7542073659 - case number : 2000 CC0099989 : 3000 CC0049965 x1509 mark studens

In [None]:
# Let's look at our index
print(json.dumps(aos_client.indices.get(index=langchain_index_name), indent=4))

We will create another `OpenSearchVectorSearch`, this time for Bedrock's use.

In [None]:
open_search_vector_store = OpenSearchVectorSearch(
                                    index_name=langchain_index_name,
                                    embedding_function=bedrock_embeddings,
                                    opensearch_url=os_domain_ep,
                                    http_auth=awsauth,
                                    timeout=600,
                                    use_ssl=True,
                                    verify_certs=True,
                                    connection_class=RequestsHttpConnection,
                                    ) 

Initialize Bedrock LLM model with Claude, and set our parameters (see [Influence response generation with inference parameters in the Amazon Bedrock User's Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-parameters.html)):
- `temperature`: 0.001: The ***temperature*** parameter controls the randomness or diversity of the model's generated output. A lower temperature value (e.g., 0.001) results in the model generating more "safe" and conservative output, with less variation. This is often useful when you want the model to produce more focused and coherent responses.

- `top_k`: 300: The ***top-k*** parameter limits the number of most likely tokens that the model considers when generating the next token in the sequence. Setting this to a higher value (e.g., 300) allows the model to consider a wider range of possible next tokens, potentially leading to more diverse output.

- `top_p`: 1: The ***top-p*** (or "nucleus sampling") parameter is a complementary technique to **top-k** sampling. It sets a threshold for the cumulative probability of the most likely tokens, rather than a fixed number. A value of 1 means the model will consider all possible tokens, without any additional filtering.


In [None]:
#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=boto3_bedrock)
bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=boto3_bedrock)
#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-opus-20240229-v1:0", client=boto3_bedrock)

bedrock_llm.model_kwargs = {"temperature":0.001,"top_k":300,"top_p":1}

>Note: This prompt has been designed specifically for the [Anthropic Claude 3](https://aws.amazon.com/bedrock/anthropic/) language model. If you use a different large language model, the output results may vary. For example, the guardrails or safeguards customized for your application's requirements and responsible AI policies may have an impact on the model's responses.

We will create a [Retriever](https://python.langchain.com/docs/concepts/retrievers/), which is a component in the LangChain library used for retrieving relevant documents (in this case, from OpenSearch).  We will direct the Retriever to return documents based on a minimum similarity score threshold, rather than a fixed number of top-k results with `search_type="similarity_score_threshold"`; as well as limit the number of returned documents to five (`'k': 5` in the `search_kwargs`, and keep the similarity threshold to 0.005.

The key purpose of this configuration is to ensure that the Retriever returns semantically relevant documents, rather than just returning the top-k results based on a simple ranking. By using a similarity score threshold, the Retriever will only return documents that meet a minimum relevance bar.

We'll also create a [RetrievalQA](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html), a component used for question-answering tasks that leverages a combination of a language model and a retriever from the LangChain library. It can then be used to perform question-answering tasks, where it will:
- Use the `bedrock_retriever` to fetch the most relevant documents based on the input question.
- Pass those retrieved documents, along with the question, to the `bedrock_llm` language model.
- Generate an answer based on the combination of the retrieved documents and the language model's capabilities.

>Note: we are using the "stuff" `chain_type`, which simply feeds the retrieved documents directly into the LLM for question-answering.  Other options are: 
> - "refine": the initial set of relevant documents is retrieved using the provided retriever, and the LLM generates an initial answer based on the retrieved documents. It then analyzes the quality of the initial answer and identifies gaps or areas that need further refinement, for which additional relevant documents are retrieved to address the identified gaps. The LLM then refines the answer based on the additional documents. This process of answer generation, quality analysis, and document retrieval/refinement can iterate multiple times until a satisfactory answer is produced.
> - "map_reduce": summarize each document on its own in a "map" step and then "reduce" the summaries into a final summary
> - "map_rerank": split the input text into smaller, more manageable document chunks; then generate a score or relevance metric for each document chunk. The document chunks are ranked based on their scores, and the document chunk with the maximum (i.e., highest) score is returned.

In [None]:
bedrock_retriever = open_search_vector_store.as_retriever(
    search_type="similarity_score_threshold",  ## ensure we return semantically relevant / instead of top_K
    search_kwargs={
        'k': 5,
        'score_threshold': 0.005
    }
)
rag_qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm,
    retriever=bedrock_retriever,
    chain_type="stuff"
)

Let's re-ask our revenue question, and see if we have improved the result:

In [None]:
question="What is Adobe's main revenue??"

langchain.debug=False #switch to see debug information tracing development of result
result = rag_qa({"query": question})

Let's examine the output from the LangChain processing to see if we have generated a coherent and human-readable result:

In [None]:
rag_result = result["result"].replace("$","\\\$")
display(Markdown("### Result\n" + rag_result))

### Standard RAG limitations

The standard Retrieval Augmented Generation (RAG) approach can be used for comparison questions between documents, but it has some notable limitations:
- Comparison-Specific Capabilities: RAG models are not specifically designed for comparison tasks and may lack the specialized capabilities required to generate high-quality comparative outputs. This includes the ability to identify and highlight the most relevant comparison points between the documents.
- Efficiency and Effectiveness: Due to the lack of targeted comparison capabilities, standard RAG models may not be the most efficient or effective approach for these types of tasks. Other techniques, such as multi-document summarization or question-answering systems, may be better suited.

Here are some example use cases that illustrate the limitations of standard RAG for comparison questions:
##### Comparison Questions:
- "Compare the financial statements of Adobe and Autodesk."
    - The RAG model would first need to retrieve the relevant financial statements for each company using semantic search.
    - It would then need to extract the key comparison points from the financial data and generate a coherent comparative analysis, which may be challenging without specialized comparison capabilities.

##### Out-of-Knowledge-Base Information:
- "Is Adobe a good investment choice right now?"
    - To answer this question, the RAG model would be limited to the information available in its knowledge base, which may not include the most up-to-date financial data or other relevant details needed to make an investment recommendation.
    - Additional information from external sources, such as a relational database or data warehouse, would be required to provide a more comprehensive and informed response.

##### Out-of-Date Knowledge Base:
- "Is Amazon a good investment choice right now?"
    - If the RAG model's knowledge base does not contain the latest financial statements or other relevant information about Amazon, it may not be able to provide a meaningful response to this question.
    - In this case, the model would need to be updated with the latest data, such as by ingesting the most recent 10-K filings from the internet, before it could generate a well-informed response.


#### Comparision question to standard RAG

Let's present our RAG implementation with a comparison question, and see what result we get:

In [None]:
question="Compare Adobe and Asana company financial statements"

In [None]:
langchain.debug=False #switch to see debug information tracing development of result
result = rag_qa({"query": question})

In [None]:
rag_result = result["result"].replace("$","\\\$")
display(Markdown("### Result\n" + rag_result))

![standard rag limitation](./static/rag-limitation.png)


![advanced rag ](./static/advanced-rag.png)

Yan, Shi-Qi, et al.
*Corrective Retrieval Augmented Generation*. 2024
https://arxiv.org/pdf/2401.15884

Jeong, Soyeong, et al. 
*Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity*. 2024
https://arxiv.org/pdf/2403.14403

## Part 3: AI agent powered search




### What is an AI agent ?
An agentic model employs a chain-of-thought reasoning process. In this approach, the large language model (LLM) is prompted to think through a question step-by-step, interleaving its reasoning with the ability to use external tools like search engines and APIs. This allows the LLM to retrieve relevant information that can help answer different aspects of the question. Ultimately, this leads to a more comprehensive and accurate final response.

This approach is inspired by the "Reason and Act" (ReAct) design introduced in the paper [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/pdf/2210.03629). ReAct aims to combine the reasoning capabilities of language models with the ability to interact with external resources and take actions.

By combining these two facets - internal reasoning and external interaction - an agentic LLM assistant can provide more informed and well-rounded answers to complex user queries.

![agent components](./static/agent-components.png)

### Why build an AI agent?
In today's digital landscape, enterprises are inundated with a vast array of data sources. These range from traditional PDF documents to complex SQL and NoSQL databases, and everything in between. While this wealth of information holds immense potential for gaining valuable insights and driving operational efficiency, the sheer volume and diversity of data can pose significant challenges in terms of accessibility and utilization.

This is where the power of agentic LLM assistants comes into play. By leveraging progress in LLM design patterns such as Reason and Act (ReAct) and other traditional or novel approaches, these intelligent assistants can integrate with an enterprise's diverse data sources. Through the development of specialized tools tailored to each data source, and the ability of LLM agents to identify the right tool for a given question, agentic LLM assistants can simplify how users navigate and extract relevant information, regardless of its origin or structure.

This enables a rich, multi-source conversation that promises to unlock the full potential of enterprise data. It can support data-driven decision-making, enhance operational efficiency, and ultimately drive productivity and growth.

![agent powered search advantage](./static/agent-powered-search-advantage.png)


### AI Agent powered search reference architecture

![AI Agent powered search reference architecture](./static/reference-architecture.png)


### Lab Architecture

For demonstration purposes, we will use an Amazon SageMaker notebook to run the code. The following diagram illustrates the overall architecture of this lab:

![AI Agent powered search architecture](./static/architecture.png)

---


### Data Flow

The user submits a query (**3** and **4**). The first AI agent will judge if the query is related to financial statements (**4.1**). If so, the AI agent will use vector search to retrieve similar financial statements for the company from an OpenSearch index (**4.2**).

If there are no financial statements for the company in the index, the AI agent will download the data from the internet by calling the SEC API, ingest the data into the OpenSearch index (**4.4**), and then perform the search again (**4.2**).

If there are related financial statements, the AI agent will check if the query is stock price-related. If so, the AI agent will query a Redshift database to retrieve the company's stock price data (**4.3**).

Finally, the large language model (LLM) will generate the response using all the collected data.

The overall data flow is as follows:

![AI Agent powered search data flow](./static/ai-agent-search-data-flow.png)

### 3.1 Prepare other tools used by AI agent

#### 3.1.1 Ingest and query structured data in Amazon Redshift

To begin, we will retrieve the credentials needed to access the Amazon Redshift Serverless database that was previously created using the CloudFormation stack.

In [303]:
redshift_serverless_credentials = json.loads(sec.get_secret_value(SecretId=outputs['RedshiftServerlessSecret'])['SecretString'])
redshift_serverless_username    = redshift_serverless_credentials['username']
redshift_serverless_password    = redshift_serverless_credentials['password']
redshift_serverless_endpoint    = outputs['RedshiftServerlessEndpoint']

In this part of the workshop, we will:
- Connect to the Amazon Redshift Serverless database ("dev").
- Create a table within the database called `stock_symbol`.
- Copy local workshop files into the Amazon S3 bucket that was created earlier using the CloudFormation stack. This simulates a remote source of stock information.
- Ingest the data from the S3 bucket into the `stock_symbol` table.
- Create a convenience function that can return the stock symbol for a given company name. This will use the SQL case-insensitive LIKE operator to perform the lookup.

In [304]:
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host=redshift_serverless_endpoint, 
port=5439,
database='dev',
username=redshift_serverless_username,
password=redshift_serverless_password
)

%sql $connect_to_db
%sql select current_user, version();

%sql CREATE TABLE IF NOT EXISTS public.stock_symbol (stock_symbol text PRIMARY KEY, company_name text NOT NULL);

stock_price_bucket = outputs["s3BucketStock"]
s3_location = f's3://{stock_price_bucket}/stock-price/'
print(s3_location)
!aws s3 sync ./stock-price/ $s3_location

stock_symbol_s3_location = f's3://{stock_price_bucket}/stock-price/stock_symbol.csv'

quoted_stock_symbol_s3_location = "'" + stock_symbol_s3_location + "'"

%sql COPY STOCK_SYMBOL FROM $quoted_stock_symbol_s3_location iam_role default IGNOREHEADER 1 CSV;


url = URL.create(
    drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
    host=redshift_serverless_endpoint, 
    port=5439,
    database='dev',
    username=redshift_serverless_username,
    password=redshift_serverless_password
)

engine = sa.create_engine(url)
redshift_connection = engine.connect()
    
def query_stock_ticker(company_name):
    strSQL = "SELECT stock_symbol FROM stock_symbol WHERE lower(company_name) ILIKE '%" + company_name + "%'"
    stock_ticker = ''
    try:
        result = redshift_connection.execute(strSQL)
        df = pd.DataFrame(result)
        stock_ticker=df['stock_symbol'][0]
    except Exception as e:
        print(e)
    return stock_ticker

 * redshift+redshift_connector://awsuser:***@workgroup-ecce4d60.273709946938.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-ecce4d60.273709946938.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
s3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/
upload: stock-price/ASAN.csv to s3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/ASAN.csv
upload: stock-price/CRM.csv to s3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/CRM.csv
upload: stock-price/PD.csv to s3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/PD.csv
upload: stock-price/ADBE.csv to s3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/ADBE.csv
upload: stock-price/DOCU.csv to s3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/DOCU.csv
upload: stock-price/PANW.csv to s3://generative-ai-powered-search-s3bucketstock-m7oexk1f3n

Next, let's test the convenience function we created and verify that the `stock_symbol` table has been loaded correctly:

In [371]:
query_stock_ticker("adobe")

'ADBE'

In [306]:
%sql CREATE TABLE IF NOT EXISTS public.stock_price (stock_date DATE, stock_symbol text, open_price DECIMAL, high_price DECIMAL, low_price DECIMAL, close_price DECIMAL, adjusted_close_price DECIMAL, volume DECIMAL);

asan_s3_location = f's3://{stock_price_bucket}/stock-price/ASAN.csv'
quoted_asan_s3_location = "'" + asan_s3_location + "'"
print(quoted_asan_s3_location)
print("---------")

crm_s3_location = f's3://{stock_price_bucket}/stock-price/CRM.csv'
quoted_crm_s3_location = "'" + crm_s3_location + "'"
print(quoted_crm_s3_location)
print("---------")

adp_s3_location = f's3://{stock_price_bucket}/stock-price/ADP.csv'
quoted_adp_s3_location = "'" + adp_s3_location + "'"
print(quoted_adp_s3_location)
print("---------")

adsk_s3_location = f's3://{stock_price_bucket}/stock-price/ADSK.csv'
quoted_adsk_s3_location = "'" + adsk_s3_location + "'"
print(quoted_adsk_s3_location)
print("---------")

box_s3_location = f's3://{stock_price_bucket}/stock-price/BOX.csv'
quoted_box_s3_location = "'" + box_s3_location + "'"
print(quoted_box_s3_location)
print("---------")

adbe_s3_location = f's3://{stock_price_bucket}/stock-price/ADBE.csv'
quoted_adbe_s3_location = "'" + adbe_s3_location + "'"
print(quoted_adbe_s3_location)
print("---------")

%sql COPY STOCK_PRICE FROM $quoted_asan_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_crm_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_adp_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_adsk_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_box_s3_location iam_role default IGNOREHEADER 1 CSV;
%sql COPY STOCK_PRICE FROM $quoted_adbe_s3_location iam_role default IGNOREHEADER 1 CSV;

 * redshift+redshift_connector://awsuser:***@workgroup-ecce4d60.273709946938.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
's3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/ASAN.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/CRM.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/ADP.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/ADSK.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/BOX.csv'
---------
's3://generative-ai-powered-search-s3bucketstock-m7oexk1f3nju/stock-price/ADBE.csv'
---------
 * redshift+redshift_connector://awsuser:***@workgroup-ecce4d60.273709946938.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-ecce4d60.273709946938.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector:

[]

Let's verify that the stock price information has been properly loaded into the `stock_price` table:

In [441]:
%sql select * from public.stock_price

 * redshift+redshift_connector://awsuser:***@workgroup-ecce4d60.273709946938.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.
 * redshift+redshift_connector://awsuser:***@workgroup-ecce4d60.273709946938.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.


stock_date,stock_symbol,open_price,high_price,low_price,close_price,adjusted_close_price,volume
2021-01-04,ASAN,30,30,28,29,29,1429100
2021-07-08,ASAN,65,67,63,66,66,1682000
2022-01-07,ASAN,61,64,59,61,61,3017600
2022-07-14,ASAN,17,17,16,16,16,3372700
2023-01-17,ASAN,14,14,13,14,14,3474500
2023-07-21,ASAN,22,22,21,21,21,2177600
2024-01-24,ASAN,19,19,18,18,18,1495400
2024-07-29,ASAN,15,15,15,15,15,1370100
2022-03-21,CRM,218,219,210,213,213,6448700
2022-09-22,CRM,149,152,149,150,150,12405100


Next, we will create another convenience function. This function will retrieve stock price information for a given ticker symbol:

In [442]:
# to assure we are still connected...
engine = sa.create_engine(url)
redshift_connection = engine.connect()

def query_stock_price(stock_ticker):
    strSQL = "SELECT stock_date, stock_symbol, open_price, high_price, low_price, close_price FROM stock_price WHERE stock_symbol='" + stock_ticker + "' limit 100"
    try:
        result = redshift_connection.execute(strSQL)
        stock_price = pd.DataFrame(result)
    except Exception as e:
        print(e)
    return stock_price


And we'll test it as well:

In [407]:
query_stock_price('ASAN')

Unnamed: 0,stock_date,stock_symbol,open_price,high_price,low_price,close_price
0,2020-10-26,ASAN,23,24,23,23
1,2021-04-30,ASAN,32,33,32,33
2,2021-11-01,ASAN,137,138,133,135
3,2022-05-05,ASAN,29,29,25,26
4,2022-11-07,ASAN,17,17,16,16
5,2023-05-12,ASAN,17,17,16,17
6,2023-11-14,ASAN,20,21,20,20
7,2024-05-20,ASAN,15,15,15,15
8,2020-09-30,ASAN,27,29,26,28
9,2021-04-06,ASAN,32,33,31,33


#### 3.1.2 Download SEC 10-K filing from SEC-API.IO
---
Data2value, a company based in Germany, develops and sells products in the areas of IT-assisted data analysis and business process optimization. Their most notable product is https://sec-api.io, which offers APIs to access various datasets from the U.S. Securities and Exchange Commission (SEC), including the EDGAR Filing Search for 10-K Forms.

For this workshop, you will need to create a free account with Data2value to acquire an API key. This API key will be valid for 100 requests at the time of writing.

Once you have acquired the API key, we will follow good practice and place it into AWS Secrets Manager:

1. In the AWS Management Console, navigate to the **AWS Secrets Manager** service.

2. Click on **Store a new secret** to create a new secret.

3. Select **Other type of secrets** as the secret type.

4. In the **Select the secret type** section, choose **Key/value**.

5. In the **Secret key/value** section, enter the name of your secret as the key (e.g., "API_KEY") and the actual API key as the value.  You can leave the **Encryption key** as "aws/secretsmanager".  Click **Next**.

6. Choose a **Secret name** (e.g., "sec-api.io") and enter any **Description** you would like. Optionally, you can add tags to the secret for better organization and access control.  We will not need to modify the default for **Resource permissions** or **Replicate secret**.  Click **Next**.

7. There will be no need to use the **Configure rotation** section for this workshop, but certainly recommended for security best practices in general (read "production"). Click **Next**.

8. Review the secret details and click **Store** to save the API Key as a secret.


>We've already installed the SEC Filing API for Python (`%pip -q install sec-api` in the [Install Python Libraries (and dependencies) for OpenSearch, Redshift and LangChain](#Install-Python-libraries-(and-dependencies)-for-OpenSearch,-Redshift-and-LangChain) section.  The API documentation is available [here](https://sec-api.io/docs).

Replace the `SecretId` below to the **Secret name** you chose above:

In [421]:
try:
    sec_api = json.loads(sec.get_secret_value(SecretId='sec-api.io')['SecretString'])
    sec_api_key=sec_api['API_KEY']
except sec.exceptions.ResourceNotFoundException:
    print("Unable to find the stored secret. Check the SecretId and keyname.",file=sys.stderr)

Next, we will create another convienence function, this time to retrieve the most recent 10-K filing for a given ticker symbol using the SEC API provided by the sec-api.io service.
- take a `ticker` parameter, the stock symbol for the company and using the QueryApi (from the sec-api.io API library), search for the most recent 10-K filing. 
- using the ExtractorApi (also from the sec-api.io) extract metadata about the company filing and store it in a dictionary, such as the filing URL, type, CIK (Central Index Key), name, filing date, period of report, and links to the HTML index and the complete filing text.
- convert the dictionary to a JSON string, and save to a file in the `download_filings` directory with a filename of the filing URL (if the `download_filings` directory does not exist, create one!).
- then return the filename of the filing for the `ticker`.

In [426]:
def get_filings(ticker):
    global sec_api_key

    # Finding Recent Filings with QueryAPI
    queryApi = QueryApi(api_key=sec_api_key)
    query = {
      "query": f"ticker:{ticker} AND formType:\"10-K\"",
      "from": "0",
      "size": "1",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }
    response = queryApi.get_filings(query)

    # Getting 10-K URL
    filing_url = response["filings"][0]["linkToFilingDetails"]
    filing_type=response['filings'][0]['formType']
    cik=response['filings'][0]['cik']
    company=response['filings'][0]['companyName']
    filing_date=response['filings'][0]['filedAt']
    period_of_report=response['filings'][0]['periodOfReport']
    filing_html_index=response['filings'][0]['linkToFilingDetails']
    complete_text_filing_link=response['filings'][0]['linkToTxt']

    # Extracting Text with ExtractorAPI
    extractorApi = ExtractorApi(api_key=sec_api_key)
    
    one_text = extractorApi.get_section(filing_url, "1", "text")       #Section 1 - Business
    onea_text = extractorApi.get_section(filing_url, "1A", "text")     # Section 1A - Risk Factors
    oneb_text = extractorApi.get_section(filing_url, "1B", "text")     # Section 1B - Unresolved Staff Comments
    two_text = extractorApi.get_section(filing_url, "2", "text")       # Section 2 - Properties
    three_text = extractorApi.get_section(filing_url, "3", "text")     # Section 3 - Legal Proceedings
    four_text = extractorApi.get_section(filing_url, "4", "text")      # Section 4 - Mine Safety Disclosures
    five_text = extractorApi.get_section(filing_url, "5", "text")      # Section 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
    six_text = extractorApi.get_section(filing_url, "6", "text")       # Section 6 - Selected Financial Data (prior to February 2021)
    seven_text = extractorApi.get_section(filing_url, "7", "text")     # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
    sevena_text = extractorApi.get_section(filing_url, "7A", "text")   # Section 7A - Quantitative and Qualitative Disclosures about Market Risk
    eight_text = extractorApi.get_section(filing_url, "8", "text")     # Section 8 - Financial Statements and Supplementary Data
    nine_text = extractorApi.get_section(filing_url, "9", "text")      # Section 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
    ninea_text = extractorApi.get_section(filing_url, "9A", "text")    # Section 9A - Controls and Procedures
    nineb_text = extractorApi.get_section(filing_url, "9B", "text")    # Section 9B - Other Information
    ten_text = extractorApi.get_section(filing_url, "10", "text")      # Section 10 - Directors, Executive Officers and Corporate Governance
    eleven_text = extractorApi.get_section(filing_url, "11", "text")   # Section 11 - Executive Compensation
    twelve_text = extractorApi.get_section(filing_url, "12", "text")   # Section 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
    thirteen_text = extractorApi.get_section(filing_url, "13", "text") # Section 13 - Certain Relationships and Related Transactions, and Director Independence
    fourteen_text = extractorApi.get_section(filing_url, "14", "text") # Section 14 - Principal Accountant Fees and Services
    fifteen_text = extractorApi.get_section(filing_url, "15", "text")  # Section 15 - Exhibits and Financial Statement Schedules
    
    data = {}
    data['filing_url'] = filing_url
    data['filing_type'] = filing_type
    data['cik'] = cik
    data['company'] = company
    data['filing_date'] = filing_date
    data['period_of_report'] = period_of_report
    data['filing_html_index'] = filing_html_index
    data['complete_text_filing_link'] = complete_text_filing_link
    
    data['item_1'] = one_text
    data['item_1A'] = onea_text
    data['item_1B'] = oneb_text
    data['item_2'] = two_text
    data['item_3'] = three_text
    data['item_4'] = four_text
    data['item_5'] = five_text
    data['item_6'] = six_text
    data['item_7'] = seven_text
    data['item_7A'] = sevena_text
    data['item_8'] = eight_text
    data['item_9'] = nine_text
    data['item_9A'] = ninea_text
    data['item_9B'] = nineb_text
    data['item_10'] = ten_text
    data['item_11'] = eleven_text
    data['item_12'] = twelve_text
    data['item_13'] = thirteen_text
    data['item_14'] = fourteen_text
    data['item_15'] = fifteen_text
    
    json_data = json.dumps(data)
    
    if not os.path.exists("./download_filings"):
        os.makedirs("./download_filings")
    
    try:
        file_name = filing_url.split("/")[-2] + "-" + filing_url.split("/")[-1].split(".")[0]+".json"
        download_to = "./download_filings/" + file_name
        with open(download_to, "w") as f:
          json.dump(data, f, ensure_ascii=False, indent=4)
    except Exception as e:
        print("Problem with {url}".format(url=url))
        print(e)
    
    return file_name

Let's again test, and assure that we can get a 10-K Filing and ingest it into our Amazon OpenSearch Serverless vector index:

In [427]:
downloaded_file=get_filings("AMZN")
ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file, vector_index_name)


For company AMAZON COM INC ingested 51 items into Bedrock.
Time elapsed for Bedrock embedding: 5.45 seconds
Bulk-inserted 51 items into the OpenSearch index with the Bedrock embeddings.


### 3.2 Create an AI agent

#### Define methods used by AI agent
As mentioned previously, a popular architecture for building intelligent agents is called ReAct. The general flow of the ReAct process is as follows:
1. The model will "think" about what action to take in response to an input and any previous observations.
2. The model will then select an action from the available tools (or choose to respond directly to the user).
3. The model will generate arguments or parameters to be passed to the selected tool.
4. The agent runtime (executor) will parse out the chosen tool and call it with the generated arguments.
5. The executor will return the results of the tool call back to the model as an observation.
6. This iterative process of reasoning and acting repeats until the agent chooses to provide a final response to the user.

##### Our agent methods/tool pathways
- `is_financial_statement_related_query` *from human input*
  - Use an LLM to determine if the input provided by the user is related to a company's finacial statement. Returns "yes" (it is related to a financial statement) or "no".
- `is_stock_related_query` *from human input*
  - Use an LLM to determine if the input provided by the user is stock-related. Returns "yes" (it is related to stock) or "no".
- `get_company_name` *from human input*
  - Use an LLM to find a company name from within a user input
- `semantic_search_and_check`
  - Perform a sementic search for a company and their filing statements within our Amazon OpenSearch Serverless vector index
- `search_for_similiar_content_in_10k_filing` *from human input*
  - A wrapper function to call `semantic_search_and_check` with human input
- `search_financial_statements_for_company`
  - A wrapper function to call `semantic_search_and_check` with a financial statement query
- `get_stock_ticker` *from human input*
  - A convienence function to call both `get_company_name` and `query_stock_ticker` from human input
- `get_stock_price` *for a stock ticker*
  - A wrapper function for `query_stock_price`
- `download_10k_filing_from_sec_and_ingest_into_opensearch` *for a stock ticker*
  - A convenience function to download a given 10-K Filing from *sec-api.io* and ingest the result into our OpenSearch Serverless vector index. Returns if the "download" succeeded - essentially if the given stock ticker was found and a 10-K filing was successfully downloaded and ingested into Amazon OpenSearch Serverless.

In [428]:
def is_financial_statement_related_query(human_input):
    template = """You are a helpful assistant to judge if the human input is trying to analyze company financial statement.
    If the human input is financial statement related question, answer \"yes\". Otherwise answer \"no\".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def is_stock_related_query(human_input):
    template = """
    You are a helpful assistant to judge if the human input is stock related question. 
    If the human input is stock related question, return "yes".Otherwise return "no".
    """
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related

def get_company_name(human_input):
    template = """You are a helpful assistant who extract company name from the human input.Please only output the company"""
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )

    company_name=llm_chain({"text":human_input})['text'].strip()
    return company_name
    
def semantic_search_and_check(human_input, k=10,with_post_filter=True):

    company_name=get_company_name(human_input)
    
    search_vector = bedrock_embeddings.embed_query(human_input)

    no_post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        }
    }

    post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        },
        "post_filter": {
           "match": { 
               "company_name":company_name
           }
        }
    }
    
    search_query=no_post_filter_search_query
    if with_post_filter:
        search_query=post_filter_search_query
    
    res = aos_client.search(index=vector_index_name, 
                       body=search_query,
                       stored_fields=["company_name","item_category","item_content"])
    
    query_result=[]
    for hit in res['hits']['hits']:
        hit_company=hit['fields']['company_name'][0]
        print("\nsemantic search hit company: " + hit_company)
        row=[hit['fields']['company_name'][0], hit['fields']['item_content'][0]]
        query_result.append(row)

    query_result_df = pd.DataFrame(data=query_result,columns=["company_name","company_financial_statements"])
    return query_result_df

def search_for_similiar_content_in_10k_filing(human_input):
    company_statements = semantic_search_and_check(human_input)
    return company_statements

def search_financial_statements_for_company(company_financial_statements_query):
    company_statements = semantic_search_and_check(company_financial_statements_query)
    return company_statements

def get_stock_ticker(human_input):
    company_name=get_company_name(human_input)
    company_ticker = query_stock_ticker(company_name)
    return company_ticker

def get_stock_price(stock_ticker):
    stock_price = query_stock_price(stock_ticker)
    return stock_price

def download_10k_filing_from_sec_and_ingest_into_opensearch(stock_ticker):
    result = "download failed."
    try:
        downloaded_file=get_filings(stock_ticker)
        ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file, vector_index_name)
        time.sleep(60) #wait the data can be searchable
        result="download succeeded."
    except Exception as e:
        result = "download failed."
    return result

#### Define tools for financial statements analysis AI agent
Now we will connect our Python functions above as tools for our AI agent to use:

In [None]:
from langchain.agents import Tool

annual_report_tools=[
    Tool(
        name="is_financial_statement_related_query",
        func=is_financial_statement_related_query,
        description="""
        Use this tool when you need to know whether user input query is financial statement analysis related query. Human orginal query is the input to this tool. This tool output is whether human input is financial statement analysis related or not. 
        If the query is not finance statement related, please answer \"I am finiancial statement ansysis assitant. I can not answer question which is not finance related.\" and terminate the dialog.
        """
    ),
    Tool(
        name="search_financial_statements_for_company",
        func=search_financial_statements_for_company,
        description="""
        Use this tool to get financial statement of the company. This tool output is company financial statements.
        """
    ),
    Tool(
        name="get_stock_ticker",
        func=get_stock_ticker,
        description="Use this tool when you need to get the company stock ticker. Human orginal query is the input to this tool. This tool will output company stock ticker."
    ),
    Tool(
        name="download_10k_filing_from_sec_and_ingest_into_opensearch",
        func=download_10k_filing_from_sec_and_ingest_into_opensearch,
        description="""
        Use this tool to download company financial statements from internet. Company stock ticker is the input to this tool. The tool output is download succeed or not.
        Use this tool if and only if "search_financial_statements_for_company" output result is empty. After downloading financial statements, you must use "search_financial_statements_for_company" tool to search financial statements again.
        """
    ),
    Tool(
        name="is_stock_related_query",
        func=is_stock_related_query,
        description="Use this tool when you need to know whether user input query is stock related query. Human orginal query is the input to this tool. This tool output is whether human input is stock related or not."
    ),
    Tool(
        name="get_stock_price",
        func=get_stock_price,
        description="""
        Use this tool to get company stock price data. Company stock ticker is the input to this tool. This tool will output company historic stock price. The output includes 'stock_date', 'stock_ticker', 'open_price', 'high_price', 'low_price', 'close_price' of the company in the latest 100 days.
        This tool is mandatory to use if the input query is both finance statement related and stock related. If the output of "get_stock_price" is empty, please answer \"I cannot provide stock analysis without stock price information.\" and terminate the dialog.
        """
    )
]

#### Define prompt for financial statements analysis AI agent 
We'll provide a prompt template for our AI Agent:

In [429]:
system_message = f"""
You are finiancial analyst assistant and you will analyze company financial statements and stock data. 
Leverage the <conversation_history> to avoid duplicating work when answering questions.

Available tools:
<tools>
{{tools}}
</tools>


To answer, first review the <conversation_history>. If insufficient use tool(s) with the following format:
<thinking>Think about which tool(s) to use and why. "get_stock_price" tool is mandatory to use if the input query is both finance statements related and stock related.</thinking>
<tool>tool_name</tool>
<tool_input>input</tool_input>
<observation>response</observation>

When you are done, provide a final answer in markdown within <final_answer></final_answer>.
If <user_input> is stock related and the output of "get_stock_price" tool is empty, respond directly within <final_answer> with the exact content \"I cannot provide stock analysis without stock price information.\".
Otherwise, use the following format to organize your <final_answer>:

Summary:
...

Support points:
Support point 1: ...
Support point 2: ...
Support point 3: ...


"""

user_message = """
Begin!

Previous conversation history:
<conversation_history>
{chat_history}
</conversation_history>

User input message:
<user_input>
{input}
</user_input>

{agent_scratchpad}
"""

# Construct the prompt from the messages
messages = [
    ("system", system_message),
    ("human", user_message),
]

financial_statements_analysis_prompt = ChatPromptTemplate.from_messages(messages)

#### Define memory for financial statements analysis AI agent
As noted from the prompt template above, we will optimize by keeping a history table in Amazon DynamoDB:

In [433]:
dynamodb = boto3.client('dynamodb')
history_table_name = 'conversation-history-memory'

try:
    response = dynamodb.describe_table(TableName=history_table_name)
    print("The table "+history_table_name+" exists.")
except dynamodb.exceptions.ResourceNotFoundException:
    print("The table "+history_table_name+" does not exist.")
    
    dynamodb.create_table(
    TableName=history_table_name,
    AttributeDefinitions=[
        {
            'AttributeName': 'SessionId',
            'AttributeType': 'S',
        }
    ],
    KeySchema=[
        {
            'AttributeName': 'SessionId',
            'KeyType': 'HASH',
        }
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 5,
        'WriteCapacityUnits': 5,
    }
    )

    response = dynamodb.describe_table(TableName=history_table_name) 
    while response["Table"]["TableStatus"] == 'CREATING':
        time.sleep(1)
        print('.', end='')
        response = dynamodb.describe_table(TableName=history_table_name) 

    print("\nAmazon Dynamo DB Table, '"+response['Table']['TableName']+"' is created")

The table conversation-history-memory exists


#### Create financial statements analysis AI agent AI using defined Memory (in Amazon DynamoDB), LLM, the tools we have created and the prompt above

>Note: to avoid potential conflicts with the "Human" keyword in the Anthropic Claude model (*"Claude enjoys helping humans and sees its role as an intelligent and kind assistant to the people, with depth and wisdom that makes it more than a mere tool."*), we use `Hu` as the human prefix.

We'll create a couple of additional functions to facilitate working with Memory (specifically, using the `DynamoDBChatMessageHistory` module of LangChain's Chat Message Histories):

In [435]:
def create_new_memory_with_session(session_id):
    chat_memory = DynamoDBChatMessageHistory(table_name=history_table_name,session_id=session_id)    
    return chat_memory

def get_agentic_chatbot_conversation_chain(session_id, verbose=True):
    chat_memory=create_new_memory_with_session(session_id)
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        human_prefix="Hu",
        chat_memory=chat_memory,
        return_messages=False
    )

    agent = create_xml_agent(
        bedrock_llm,
        annual_report_tools,
        financial_statements_analysis_prompt,
        stop_sequence=["</tool_input>", "</final_answer>"]
    )

    agent_chain = AgentExecutor(
        agent=agent,
        tools=annual_report_tools,
        return_intermediate_steps=False,
        verbose=True,
        memory=memory,
        handle_parsing_errors="Check your output and make sure it conforms!"
    )
    return agent_chain

### 3.3 Use the financial statements analysis AI agent

In [None]:
# Since we have left the Amazon Redshift Serverless configuration largely set to the default settings,
# it's possible that an hour has elapsed since our last connection. To ensure the connection is 
# available to run the examples, please execute the following:
%sql $connect_to_db
%sql select current_user, version();

#### Example 1

For our first example using an AI agent, let's revisit the comparison query. Specifically, we will compare the financial statements of two companies.

The data flow should be similar to the following:

![example 1](./static/example-1-data-flow.png)

In [443]:
question="Compare Adobe and Autodesk company financial statements"
session_id = str(uuid4())
langchain.debug=False
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": question})

 * redshift+redshift_connector://awsuser:***@workgroup-ecce4d60.273709946938.us-east-1.redshift-serverless.amazonaws.com:5439/dev
Done.


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<thinking>
To compare the financial statements of Adobe and Autodesk, I will need to:
1. Check if the query is related to financial statement analysis using the is_financial_statement_related_query tool.
2. If it is related, search for the financial statements of Adobe and Autodesk using the search_financial_statements_for_company tool.
3. If the financial statements are not found, use the get_stock_ticker tool to get the stock tickers for Adobe and Autodesk.
4. Then use the download_10k_filing_from_sec_and_ingest_into_opensearch tool to download and ingest the latest 10-K filings for both companies.
5. Search again for the financial statements using the search_financial_statements_for_company tool.
</thinking>

<tool>is_financial_statement_related_query</tool>
<tool_input>Compare Adobe and A

In [450]:
ai_agent_result1 = response["output"].replace("$","\\\$")
display(Markdown("### Result\n" + ai_agent_result1))

### Result

Summary:
Based on Adobe's financial statements and recent stock performance, Adobe appears to be a solid investment choice at the moment. The company has strong financials with growing revenue and profitability, a leading market position in its core creative software business, and its stock has shown steady gains over the past year.

Support points:

1. Adobe's revenue grew 15% year-over-year in fiscal 2020 to \\$12.87 billion, driven by strong growth in its Digital Media and Digital Experience segments. Net income also increased 25% to \\$4.79 billion.

2. Adobe has a robust balance sheet with \\$5.9 billion in cash and investments and manageable debt levels as of the end of fiscal 2020. This provides financial flexibility to invest in growth initiatives.

3. Adobe's stock price (ticker ADBE) has risen over 50% in the past year, outperforming the broader market. The stock hit new all-time highs above \\$500 in late 2020, indicating positive investor sentiment.

4. Adobe is the market leader in creative software tools like Photoshop and Illustrator. Its transition to a subscription-based cloud model provides recurring revenue stability.

5. Analysts are optimistic about Adobe's prospects, with an average price target around \\$550 implying further upside from current levels.

While no investment is risk-free, Adobe's strong fundamentals, market leadership, and positive stock momentum make it an attractive investment opportunity currently based on the available information.


#### Example 2

Let's now try asking a straightforward question: is a company's stock a good investment choice at the present time? 

The data flow should be like the following:

![example 2](./static/example-2-data-flow.png)

In [449]:
question="Is Adobe a good investment choice right now?"
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": question})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<thinking>
To determine if Adobe is a good investment choice, I need to analyze Adobe's financial statements and stock data. I will first check if the query is related to financial statements and stocks.
</thinking>

<tool>is_financial_statement_related_query</tool>
<tool_input>Is Adobe a good investment choice right now?[0m[36;1m[1;3mYes, this question is related to analyzing a company's financial statements and performance, which would be relevant for evaluating Adobe as a potential investment choice.[0m[32;1m[1;3m<thinking>To determine if Adobe is a good investment choice, I will need to analyze Adobe's financial statements and stock data. I should first check if I have Adobe's financial statements available. If not, I will need to download them from the SEC website using the stock ticker. I will also need to get Adobe's stock price data using the "get_stock_price" tool since the query is related to evaluating Adobe 

In [454]:
ai_agent_result2 = response["output"].replace("$","\\\$")
display(Markdown("### Result\n" + ai_agent_result2))

### Result


Summary:
Based on the available information, I cannot provide a comprehensive analysis on whether Amazon is a good investment choice right now. While I was able to retrieve Amazon's financial statements, I do not have access to their recent stock price data, which is crucial for evaluating the company's performance and potential as an investment.

Support points:

1. Amazon's financial statements provide insights into the company's business operations, revenue streams, expenses, and overall financial health. However, this information alone is not sufficient to determine if Amazon is a good investment choice at the moment.

2. Stock price data, including historical prices, trading volume, and market trends, is essential for analyzing a company's investment potential. Without access to Amazon's recent stock price information, it is difficult to assess factors like valuation, growth prospects, and market sentiment towards the company.

3. A comprehensive investment analysis typically involves evaluating both financial statements and stock performance data, as well as considering broader market conditions, industry trends, and competitive landscape. Without the complete set of information, I cannot provide a well-informed recommendation on whether Amazon is a good investment choice right now.



#### Example 3

Since this is an Amazon presentation showcasing AWS services, it would be appropriate to ask a probing question: Is Amazon's stock a good investment choice at the present time?

The data flow should follow this process:

![example 3](./static/example-3-data-flow.png)

In [452]:
question="Is Amazon a good investment choice right now?"
session_id = str(uuid4())
conversation_chain = get_agentic_chatbot_conversation_chain(session_id=session_id)
response=conversation_chain.invoke({"input": question})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<thinking>
To determine if Amazon is a good investment choice, I need to analyze Amazon's financial statements and stock data. I will first check if the query is related to financial statements and stocks.
</thinking>

<tool>is_financial_statement_related_query</tool>
<tool_input>Is Amazon a good investment choice right now?[0m[36;1m[1;3mYes, this question is related to analyzing a company's financial statements and performance, which would be relevant for evaluating Amazon as a potential investment choice.[0m[32;1m[1;3m<thinking>To evaluate if Amazon is a good investment choice, I need to analyze Amazon's financial statements and stock data. I will first check if I have Amazon's financial statements available. If not, I will need to download them from the SEC website using the stock ticker. I will also need to get Amazon's stock price data using the "get_stock_price" tool since the query is related to evaluating Amazon

In [453]:
ai_agent_result3 = response["output"].replace("$","\\\$")
display(Markdown("### Result\n" + ai_agent_result3))

### Result


Summary:
Based on the available information, I cannot provide a comprehensive analysis on whether Amazon is a good investment choice right now. While I was able to retrieve Amazon's financial statements, I do not have access to their recent stock price data, which is crucial for evaluating the company's performance and potential as an investment.

Support points:

1. Amazon's financial statements provide insights into the company's business operations, revenue streams, expenses, and overall financial health. However, this information alone is not sufficient to determine if Amazon is a good investment choice at the moment.

2. Stock price data, including historical prices, trading volume, and market trends, is essential for analyzing a company's investment potential. Without access to Amazon's recent stock price information, it is difficult to assess factors like valuation, growth prospects, and market sentiment towards the company.

3. A comprehensive investment analysis typically involves evaluating both financial statements and stock performance data, as well as considering broader market conditions, industry trends, and competitive landscape. Without the complete set of information, I cannot provide a well-informed recommendation on whether Amazon is a good investment choice right now.

