# Generative AI-powered search with Amazon OpenSearch Service

---
### Using Scenario
Form 10-K is a comprehensive report filed annually by a publicly traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). Some of the information a company is required to document in the 10-K includes its history, organizational structure, financial statements, earnings per share, subsidiaries, executive compensation, and any other relevant data.

The SEC mandates that all public companies file regular 10-Ks to keep investors aware of a company's financial condition and to allow them to have enough information before they buy or sell securities issued by that company. The 10-K can appear overly complex at first glance, complete with tables full of data and figures. However, it is so comprehensive that this filing is critical for investors to handle a company's financial position and prospects.

Form 10-K is an annual report that provides a comprehensive analysis of the company's financial condition. The Form 10-K is comprised of several parts. These include:

- 1 - Business-This describes the company's operations. 
- 1A - Risk Factors
- 1B - Unresolved Staff Comments
- 2 - Properties
- 3 - Legal Proceedings
- 4 - Mine Safety Disclosures
- 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
- 6 - Selected Financial Data (prior to February 2021)
- 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
- 7A - Quantitative and Qualitative Disclosures about Market Risk
- 8 - Financial Statements and Supplementary Data
- 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
- 9A - Controls and Procedures
- 9B - Other Information
- 10 - Directors, Executive Officers and Corporate Governance
- 11 - Executive Compensation
- 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
- 13 - Certain Relationships and Related Transactions, and Director Independence
- 14 - Principal Accountant Fees and Services
- 15 - Exhibits and Financial Statement Schedules

---

Many investors rely of SEC filings to analyze the financial health of a company, and they can certainly be a treasure trove of valuable information. Keyword based search may return some irrelevant information. Even with semantic search, information is overwhelming. Can we leverage generative AI to help us on company financial statements interpertation?


In this code talk session, we will show you how to modernize your search application to improve search relevance with Amazon OpenSearch while leveraging generative AI to improve search productivity. The code includes the following topics:
- Comparison search relevance between keyword search and semantic search with Amazon OpenSearch.
- How to leverage Retrieval Augmented Generation(RAG) improve search productivity.
- How to build intelligent agent which orchestrate and execute multistep tasks to automate 10-K filings analysis.
- OpenSearch vector store best practices

---


### Code Structure


The code includes the following sections:
- [Initialize](#Initialize)
- [Part 1: Ingest unstructured data into OpenSearch](#Part-1:-Ingest-unstructured-data-into-OpenSearch)
- [Part 2: Different appoach to search](#Part-2:-Different-appoach-to-search)
    - [2.1 Keyword search](#2.1-Keyword-search)
    - [2.2 Semantic/Vector search](#2.2-Semantic/Vector-search)
    - [2.3 Retrieval Augmented Generation(RAG)](#2.3-Retrieval-Augmented-Generation(RAG))
- [Part 3: AI agent powered search](#Part-3:-AI-agent-powered-search)
    - [3.1 Prepare other tools used by AI agent](#3.1-Prepare-other-tools-used-by-AI-agent)
        - [3.1.1 Ingest and query structured data in Redshift](#3.1.1-Ingest-and-query-structured-data-in-Redshift)
        - [3.1.2 Download 10-K filing from SEC](#3.1.2-Download-10-K-filing-from-SEC)
    - [3.2 Create AI agent](#3.2-Create-AI-agent)
    - [3.3 Use AI agent](#3.3-Use-AI-agent)


## Initialize




Make sure PyTorch versin is larger than or equal  2.2.0

In [None]:
import torch
print(torch.__version__)

###  Install dependency Python library for OpenSearch, Redshift, Langchain

In [None]:
%pip install -q opensearch-py
%pip install boto3
%pip install sqlalchemy>
%pip install sqlalchemy-redshift
%pip install redshift_connector
%pip install ipython-sql==0.4.1
%pip install langchain
%pip install -U langchain-aws
%pip install -U langchain-community
%pip install langchain_experimental

### Import library



In [None]:
import boto3
import re
import time
import sagemaker,json
from sagemaker.session import Session
import pandas as pd
import os

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name

## Part 1: Ingest unstructured data into OpenSearch

### Get SEC 10-K form files

In [None]:
!wget https://ws-assets-prod-iad-r-sfo-f61fc67057535f1b.s3.us-west-1.amazonaws.com/df655552-1e61-4a6b-9dc4-c03eb94c6f75/10k-financial-filing.zip

Unzip the dataset

In [None]:
!unzip 10k-financial-filing.zip

Read the dataset in JSON format and contruct pandas DataFrame

In [None]:
# Specify the path to the folder containing the JSON files
folder_path = "extracted"

# Initialize an empty list to store list of company 10-K filing file names
company_filing_file_name_list = []

#For this session, we only ingest few company information.
company_list=["Alteryx, Inc.",
              "MICROSTRATEGY Inc", 
              "Elastic N.V.", 
              "MongoDB, Inc.", 
              "Palo Alto Networks Inc", 
              "Okta, Inc.",
              "Datadog, Inc.", 
              "Snowflake Inc.",
              "SALESFORCE.COM, INC.", 
              "ORACLE CORP",
              "MICROSOFT CORP", 
              "Palantir Technologies Inc."
             ]

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        df = pd.DataFrame([pd.read_json(file_path,typ='series')])
        if df.iloc[0]['company'] in company_list:
            company_filing_file_name_list.append(file_path)


In [None]:
company_filing_file_name_list

### Initialize embedding model to vectorize text data

**Embeddings** are numerical representations of data, typically used to convert complex, high-dimensional data into a lower-dimensional space where similar data points are closer together. In the context of natural language processing (NLP), embeddings are used to represent words, phrases, or sentences as vectors of real numbers. These vectors capture semantic relationships, meaning that words with similar meanings are represented by vectors that are close together in the embedding space.

**Embedding models** are machine learning models that are trained to create these numerical representations. They learn to encode various types of data into embeddings that capture the essential characteristics and relationships within the data. For example, in NLP, embedding models like Word2Vec, GloVe, and BERT are trained on large text corpora to produce word embeddings. These embeddings can then be used for various downstream tasks, such as text classification, sentiment analysis, or machine translation. In this case we'll be using it for semantic similarity

We use embedding model to convert questions into vector and use vector similiarity to search semantic similiar 10-K data. The following diagram shows the flow: 

![Convert Text to Vector](./static/text2vector.png)

---

![opensearch vector store](./static/opensearch-vector-store.png)


In [None]:
#from langchain.embeddings import BedrockEmbeddings
from langchain_aws import BedrockEmbeddings

boto3_bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=f"https://bedrock-runtime.{aws_region}.amazonaws.com")
bedrock_embeddings = BedrockEmbeddings(model_id='amazon.titan-embed-text-v1',client=boto3_bedrock)

In [None]:
result = bedrock_embeddings.embed_query("This is a content of the document")
result[0:20]

In [None]:
len(result)

### Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.

Note: if you're using a region other than us-east-1, please update the region in the code below.

#### Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [None]:
import json
region = aws_region

cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "generative-ai-powered-search"

outputs = get_cfn_outputs(cloudformation_stack_name)
aos_host = outputs['OpenSearchDomainEndpoint']
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['OpenSearchSecret'])['SecretString'])

outputs

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection

auth = (aos_credentials['username'], aos_credentials['password'])
aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### Create a index in Amazon OpenSearch Service 

The OpenSearch k-NN plugin introduces a custom data type, the knn_vector, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. 

---
#### OpenSearch Approximate Nearest Neighbor Algorithms and Engines
![ANN algorithm](./static/ann-algorithm.png)

---

#### HNSW parameter tuning
![hnsw parameter tuning](./static/hnsw-parameter-tuning.png)

---

#### IVF parameter tuning
![ivf parameter tuning](./static/ivf-parameter-tuning.png)

#### How to select the engine and algorithms
![opensearch ann comparison](./static/opensearch-ann-selection.png)

In [None]:
knn_index = {
    "settings": {
        "index.knn": True,
        "index.knn.space_type": "cosinesimil"
    },
    "mappings": {
        "properties": {
            "item_vector": {
                "type": "knn_vector",
                "dimension": 1536,
                "store": True,
                "method": {
                    "name": "hnsw",
                    "space_type": "l2",
                    "engine": "nmslib",
                    "parameters": {
                      "ef_construction": 128,
                      "m": 24
                    }
                }
            },
            "item_content": {
                "type": "text",
                "store": True
            },
            "company_name": {
                "type": "text",
                "store": True
            }
        }
    }
}


Using the above index definition, we now need to create the index in Amazon OpenSearch

In [None]:
index_name="10k_financial"

exist=False
try:
    aos_client.indices.get(index=index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=index_name)
else:
    print("index does not exist.")
    
aos_client.indices.create(index=index_name,body=knn_index,ignore=400)

Let's verify the created index information

In [None]:
aos_client.indices.get(index=index_name)


###  Load the raw data into the Index
Next, let's load the financial billing data into the index we've just created.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
import pandas

from typing import Any, Dict, List, Optional, Sequence

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class PandasDataFrameLoader(BaseLoader):
    
    def __init__(self,dataframe:pandas.DataFrame):
        self.dataframe=dataframe
        
    def load(self) -> List[Document]:
        docs = []
        items=["item_1","item_1A","item_1B","item_2","item_3","item_4","item_5","item_6","item_7","item_7A","item_8","item_9","item_9A", "item_9B", "item_10", "item_11", "item_12", "item_13", "item_14", "item_15"]
        
        for index, row in self.dataframe.iterrows():
            metadata={}
            metadata["cik"]=row['cik']
            metadata["company_name"]=row['company']
            metadata["filing_date"]=row['filing_date']
            for item in items:
                content=row[item]
                metadata['item'] = item
                doc = Document(page_content=content,metadata=metadata)
                #print(doc.metadata)
                docs.append(doc)
        return docs

Use Bedrock embedding convert item content into vector and use OpenSearch bulk ingest to store data into OpenSearch index

In [None]:
import time
from opensearchpy import helpers

def ingest_downloaded_10k_into_opensearch(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    item_contents=[]
    company_name=splitted_documents[0].metadata['company_name']
    for doc in splitted_documents:
        item_contents.append(doc.page_content)
    
    print("\ncompany:" + company_name + ", item count:" + str(len(item_contents)))
    start = time.time()
    embedding_results = bedrock_embeddings.embed_documents(item_contents)
    end = time.time()
    elapsed = end - start
    #print(f"total time elapsed for Bedrock embedding: {elapsed:.2f} seconds")
        
    data = []
    i=0
    for content in item_contents:
        data.append({"_index": index_name,  "company_name": company_name, "item_content":content, "item_vector":embedding_results[i]})
        i = i+1
    aos_response= helpers.bulk(aos_client, data)
    print(f"Bulk-inserted {aos_response[0]} items.")

In [None]:
for file in company_filing_file_name_list:
    ingest_downloaded_10k_into_opensearch(file)
    print("Ingested :" + file)

To validate the load, we'll query the number of documents number in the index. 

In [None]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])

## Part 2: Different appoach to search

### 2.1 Keyword search
---
Keyword search refers to finding information one is looking for using terms or words, called "query", from among a large body of textual data. It uses exact matching of those terms, popularly called "keyword match" without considering the meaning or context behind those words. 


![Keyword Search](./static/keyword-search-flow.png)



In [None]:
query_text="What Microsoft's research and development organization is responsible for?"

Run the query and check the search result. Some irrelevant documents are returned.

In [None]:
query={
  "size": 10,
  "query": {
    "match": {
      "item_content": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_content" : {}
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content'], hit['highlight']['item_content'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content","item_content_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

In [None]:
query_text="What is Microsoft's main revenue?"

Run the query and check the search result. Some irrelevant documents are returned.

In [None]:
query={
  "size": 10,
  "query": {
    "match": {
      "item_content": query_text
    }
  },
  "highlight" : {
    "pre_tags" : ["<em>"],
    "post_tags" : ["</em>"],
    "fields" : {
      "item_content" : {}
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content'], hit['highlight']['item_content'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content","item_content_highlight"])
pd.set_option('display.max_colwidth', 500)
display(query_result_df)

### 2.2 Semantic/Vector search

---
Semantic search refers to using machine learning to understand the meaning of queries. It improves usefulness of search by understanding the intent and contextual meaning of those terms by bringing results that are hopefully more relevant than simple text search.  

![Semantic Search](./static/semantic-search-flow.png)

---


![Semantic Search Architecture](./static/semantic-search-architecture.png)



In [None]:
query_text="What Microsoft's research and development organization is responsible for?"

Run the query and check the search result. 

In [None]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

In [None]:
query_text="What is Microsoft's main revenue?"

Run the query and check the search result. 

In [None]:
result = bedrock_embeddings.embed_query(query_text)
search_vector = result

query={
    "size": 10,
    "query": {
        "knn": {
            "item_vector":{
                "vector":search_vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['company_name'],hit['_source']['item_content']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","company_name","item_content"])
display(query_result_df)

### 2.3 Retrieval Augmented Generation(RAG)

In RAG, external data can be sourced from various data sources, such as document repositories, databases, or APIs. The first step is to convert the documents and the user query into a format that enables comparison and allows for performing relevancy search. To achieve comparability for relevancy search, a document collection (knowledge library) and the user-submitted query are transformed into numerical representations using embedding language models. These embeddings are essentially numerical representations of concepts in text.

Next, based on the embedding of the user query, relevant text is identified in the document collection through similarity search in the embedding space. The prompt provided by the user is then combined with the searched relevant text and added to the context. This updated prompt, which includes relevant external data along with the original prompt, is sent to the LLM (Language Model) for processing. As a result, the model output becomes relevant and accurate due to the context containing the relevant external data.

The major components of RAG, including embedding, vector databases, augmentation, and generation:


- Embedding: Purpose: Embeddings transform text data into numerical vectors in a high-dimensional space. These vectors represent the semantic meaning of the text. Process: The embedding process typically uses pre-trained models (like BERT or a variant) to convert both the input queries and the documents in the database into dense vectors. Role in RAG: Embeddings are crucial for the retrieval component as they allow the model to compute the similarity between the query and the documents in the database efficiently.
- Vector Database: Function: A vector database stores the embeddings of a large collection of documents or passages. Construction: It is created by processing a vast corpus (like Wikipedia or other specialized datasets) through an embedding model. Usage in RAG: When a query comes in, the model searches this database to find the documents whose embeddings are most similar to the embedding of the query.
- Retrieval (Augmentation): Mechanism: The retrieval part of RAG functions by taking the input query, converting it into a vector using embeddings, and then searching the vector database to retrieve relevant documents. Result: It augments the original query with additional context by selecting documents or passages that are semantically related to the query. This augmented information is essential for generating more informed responses.
- Generation: Integration with a Language Model: The generative component, often a large language model like Amazon Titan Text, receives both the original query and the retrieved documents. Response Generation: It synthesizes information from these inputs to produce a coherent and contextually appropriate response. Training and Fine-Tuning: This component is generally pre-trained on vast amounts of text and may be further fine-tuned to optimize its performance for specific tasks or datasets.
- End-to-End Training (Optional): Joint Optimization: In RAG, both retrieval and generation components can be fine-tuned together, allowing the system to optimize the selection of documents and the generation of responses simultaneously. Feedback Loop: The model learns not only to generate relevant responses but also to retrieve the most useful documents for any given query.

---
### Architecture

![RAG](./static/RAG_Architecture.png)

---

In [None]:
langchain_index_name="10k_financial_embedding"

In [None]:
exist=False
try:
    aos_client.indices.get(index=langchain_index_name)
    exist=True
except Exception as e:
    exist=False

if exist:
    print("delete existing index before creating new one")
    aos_client.indices.delete(index=langchain_index_name)
else:
    print("index does not exist.")
    

In [None]:
from langchain.vectorstores import OpenSearchVectorSearch
from typing import Callable

os_domain_ep = 'https://'+aos_host


def ingest_10k_into_opensearch_with_langchain(file_name):
    df = pd.DataFrame([pd.read_json(file_name,typ='series')])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 8000, chunk_overlap = 200)
    pd_loader = PandasDataFrameLoader(df)
    documents = pd_loader.load()
    splitted_documents = text_splitter.split_documents(documents)
    
    OpenSearchVectorSearch.from_documents(
                index_name = langchain_index_name,
                documents=splitted_documents,
                embedding=bedrock_embeddings,
                opensearch_url=os_domain_ep,
                http_auth=auth
    )

In [None]:
for file in company_filing_file_name_list:
    ingest_10k_into_opensearch_with_langchain(file)
    print("Ingested :" + file)


In [None]:
aos_client.indices.get(index=langchain_index_name)

In [None]:

class SimiliarOpenSearchVectorSearch(OpenSearchVectorSearch):
    
    def relevance_score(self, distance: float) -> float:
        return distance
    
    def _select_relevance_score_fn(self) -> Callable[[float], float]:
        return self.relevance_score


open_search_vector_store = SimiliarOpenSearchVectorSearch(
                                    index_name=langchain_index_name,
                                    embedding_function=bedrock_embeddings,
                                    opensearch_url=os_domain_ep,
                                    http_auth=auth
                                    ) 


Initialize Bedrock LLM model with Claude

In [None]:
from langchain_aws import BedrockLLM, ChatBedrock

#bedrock_llm = BedrockLLM(model_id="anthropic.claude-v2", client=boto3_bedrock)
bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=boto3_bedrock)
#bedrock_llm = ChatBedrock(model_id="anthropic.claude-3-5-sonnet-20240620-v1:0", client=boto3_bedrock)


bedrock_llm.model_kwargs = {"temperature":0.01,"top_k":250,"top_p":1}


In [None]:
from langchain.chains import RetrievalQA
import langchain

bedrock_retriever = open_search_vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        'k': 5,
        'score_threshold': 0.005
    }
)

In [None]:
rag_qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm,
    retriever=bedrock_retriever,
    chain_type="stuff" #stuff, refine, map_reduce, and map_rerank
)

In [None]:
question="What Microsoft's research and development organization is responsible for?"

langchain.debug=True
result = rag_qa({"query": question})


In [None]:
print("Result:" + result["result"])

In [None]:
question="What is Microsoft main revenue?"

langchain.debug=False
result = rag_qa({"query": question})


In [None]:
print("Result:" + result["result"])

## Part 3: AI agent powered search

![standard rag limitation](./static/rag-limitation.png)

### What is an AI agent ?
An agentic employs a chain-of-thought reasoning process, where the LLM is prompted to think gradually through a question, interleaving its reasoning with the ability to use external tools such as search engines and APIs. This allows the LLM to retrieve relevant information that can help answer partial aspects of the question, ultimately leading to a more comprehensive and accurate final response. This approach is inspired by the "Reason and Act" (ReAct) design introduced in the paper [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/pdf/2210.03629)  which aims to synergize the reasoning capabilities of language models with the ability to interact with external resources and take actions. By combining these two facets, an agentic LLM assistant can provide more informed and well-rounded answers to complex user queries.

### Why build an AI agent?
In today's digital landscape, enterprises are inundated with a vast array of data sources, ranging from traditional PDF documents to complex SQL and NoSQL databases, and everything in between. While this wealth of information holds immense potential for gaining valuable insights and driving operational efficiency, the sheer volume and diversity of data can often pose significant challenges in terms of accessibility and utilization.

This is where the power of agentic LLM assistants comes into play. By leveraging progress in LLM design patterns such as Reason and Act (ReAct) and other traditional or novel design patterns, these intelligent assistants are capable of integrating with an enterprise's diverse data sources. Through the development of specialized tools tailored to each data source and the ability of LLM agents to identify the right tool for a given question, agentic LLM assistants can simplify how you navigate and extract relevant information, regardless of its origin or structure.

This enables a rich, multi-source conversation that promises to unlock the full potential of the entreprise data, enabling data-driven decision-making, enhancing operational efficiency, and ultimately driving productivity and growth.


### Architecture

The following is the overall architecture on agent based finincial filings analysis:

![generative ai powered search](./static/architecture.png)

---

### 3.1 Prepare other tools used by AI agent

#### 3.1.1 Ingest and query structured data in Redshift

Get Redshift Serverless username, password and endpoint

In [None]:
redshift_serverless_credentials = json.loads(kms.get_secret_value(SecretId=outputs['RedshiftServerlessSecret'])['SecretString'])
redshift_serverless_username=redshift_serverless_credentials['username']
redshift_serverless_password=redshift_serverless_credentials['password']
redshift_serverless_endpoint =  outputs['RedshiftServerlessEndpoint']

Create `stock_symbol` table and populate the table from S3. We will use this table to query company stock ticker by company name.

In [None]:
import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host=redshift_serverless_endpoint, 
port=5439,
database='dev',
username=redshift_serverless_username,
password=redshift_serverless_password
)

%sql $connect_to_db
%sql select current_user, version();

%sql CREATE TABLE IF NOT EXISTS public.stock_symbol (stock_symbol text PRIMARY KEY, company_name text NOT NULL);

stock_price_bucket = outputs["s3BucketStock"]
s3_location = f's3://{stock_price_bucket}/stock-price/'
print(s3_location)
!aws s3 sync ./stock-price/ $s3_location

stock_symbol_s3_location = f's3://{stock_price_bucket}/stock-price/stock_symbol.csv'
quoted_stock_symbol_s3_location = "'" + stock_symbol_s3_location + "'"

%sql COPY STOCK_SYMBOL FROM $quoted_stock_symbol_s3_location iam_role default IGNOREHEADER 1 CSV;

def query_stock_ticker(company_name):
    url = URL.create(
    drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
    host=redshift_serverless_endpoint, 
    port=5439,
    database='dev',
    username=redshift_serverless_username,
    password=redshift_serverless_password
    )

    engine = sa.create_engine(url)
    cnn = engine.connect()

    strSQL = "SELECT stock_symbol FROM stock_symbol WHERE lower(company_name) ILIKE '%" + company_name + "%'"
    stock_ticker=''
    try:
        result = cnn.execute(strSQL)
        df = pd.DataFrame(result)
        stock_ticker=df['stock_symbol'][0]
    except Exception as e:
        print(e)
    return stock_ticker


In [None]:
query_stock_ticker("Amazon")

#### 3.1.2 Download 10-K filing from SEC
---
https://sec-api.io

Create a new account and get free API key.



In [None]:
!pip install sec-api

##### Replace your sec-api key in the following line

In [None]:
sec_api_key="{security_api_key}"

In [None]:
from sec_api import ExtractorApi, QueryApi
import json
import os

def get_filings(ticker):
    global sec_api_key

    # Finding Recent Filings with QueryAPI
    queryApi = QueryApi(api_key=sec_api_key)
    query = {
      "query": f"ticker:{ticker} AND formType:\"10-K\"",
      "from": "0",
      "size": "1",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }
    response = queryApi.get_filings(query)

    # Getting 10-K URL
    filing_url = response["filings"][0]["linkToFilingDetails"]
    filing_type=response['filings'][0]['formType']
    cik=response['filings'][0]['cik']
    company=response['filings'][0]['companyName']
    filing_date=response['filings'][0]['filedAt']
    period_of_report=response['filings'][0]['periodOfReport']
    filing_html_index=response['filings'][0]['linkToFilingDetails']
    complete_text_filing_link=response['filings'][0]['linkToTxt']

    # Extracting Text with ExtractorAPI
    extractorApi = ExtractorApi(api_key=sec_api_key)
    
    one_text = extractorApi.get_section(filing_url, "1", "text")       #Section 1 - Business
    onea_text = extractorApi.get_section(filing_url, "1A", "text")     # Section 1A - Risk Factors
    oneb_text = extractorApi.get_section(filing_url, "1B", "text")     # Section 1B - Unresolved Staff Comments
    two_text = extractorApi.get_section(filing_url, "2", "text")       # Section 2 - Properties
    three_text = extractorApi.get_section(filing_url, "3", "text")     # Section 3 - Legal Proceedings
    four_text = extractorApi.get_section(filing_url, "4", "text")      # Section 4 - Mine Safety Disclosures
    five_text = extractorApi.get_section(filing_url, "5", "text")      # Section 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
    six_text = extractorApi.get_section(filing_url, "6", "text")       # Section 6 - Selected Financial Data (prior to February 2021)
    seven_text = extractorApi.get_section(filing_url, "7", "text")     # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
    sevena_text = extractorApi.get_section(filing_url, "7A", "text")   # Section 7A - Quantitative and Qualitative Disclosures about Market Risk
    eight_text = extractorApi.get_section(filing_url, "8", "text")     # Section 8 - Financial Statements and Supplementary Data
    nine_text = extractorApi.get_section(filing_url, "9", "text")      # Section 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
    ninea_text = extractorApi.get_section(filing_url, "9A", "text")    # Section 9A - Controls and Procedures
    nineb_text = extractorApi.get_section(filing_url, "9B", "text")    # Section 9B - Other Information
    ten_text = extractorApi.get_section(filing_url, "10", "text")      # Section 10 - Directors, Executive Officers and Corporate Governance
    eleven_text = extractorApi.get_section(filing_url, "11", "text")   # Section 11 - Executive Compensation
    twelve_text = extractorApi.get_section(filing_url, "12", "text")   # Section 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
    thirteen_text = extractorApi.get_section(filing_url, "13", "text") # Section 13 - Certain Relationships and Related Transactions, and Director Independence
    fourteen_text = extractorApi.get_section(filing_url, "14", "text") # Section 14 - Principal Accountant Fees and Services
    fifteen_text = extractorApi.get_section(filing_url, "15", "text")  # Section 15 - Exhibits and Financial Statement Schedules
    
    data = {}
    data['filing_url'] = filing_url
    data['filing_type'] = filing_type
    data['cik'] = cik
    data['company'] = company
    data['filing_date'] = filing_date
    data['period_of_report'] = period_of_report
    data['filing_html_index'] = filing_html_index
    data['complete_text_filing_link'] = complete_text_filing_link
    
    
    data['item_1'] = one_text
    data['item_1A'] = onea_text
    data['item_1B'] = oneb_text
    data['item_2'] = two_text
    data['item_3'] = three_text
    data['item_4'] = four_text
    data['item_5'] = five_text
    data['item_6'] = six_text
    data['item_7'] = seven_text
    data['item_7A'] = sevena_text
    data['item_8'] = eight_text
    data['item_9'] = nine_text
    data['item_9A'] = ninea_text
    data['item_9B'] = nineb_text
    data['item_10'] = ten_text
    data['item_11'] = eleven_text
    data['item_12'] = twelve_text
    data['item_13'] = thirteen_text
    data['item_14'] = fourteen_text
    data['item_15'] = fifteen_text
    
    json_data = json.dumps(data)
    
    
    if not os.path.exists("./download_filings"):
        os.makedirs("./download_filings")
    
    try:
        file_name = filing_url.split("/")[-2] + "-" + filing_url.split("/")[-1].split(".")[0]+".json"
        download_to = "./download_filings/" + file_name
        with open(download_to, "w") as f:
          json.dump(data, f, ensure_ascii=False, indent=4)
    except Exception as e:
        print("Problem with {url}".format(url=url))
        print(e)
    
    return file_name

In [None]:
#downloaded_file=get_filings("AMZN")
#ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file)

### 3.2 Create AI agent

#### Define methods used by AI agent

One popular architecture for building agents is ReAct. ReAct combines reasoning and acting in an iterative process - in fact the name "ReAct" stands for "Reason" and "Act".

The general flow looks like this:

- The model will "think" about what step to take in response to an input and any previous observations.
- The model will then choose an action from available tools (or choose to respond to the user).
- The model will generate arguments to that tool.
- The agent runtime (executor) will parse out the chosen tool and call it with the generated arguments.
- The executor will return the results of the tool call back to the model as an observation.
- This process repeats until the agent chooses to respond.

In [None]:
from langchain.prompts.chat import ChatPromptTemplate
from langchain.chains import LLMChain

def is_stock_related_query(human_input):
    template = """You are a helpful assistant to judge if the human input is stock related question.
    If it is stock related, answer \"yes\". Otherwise answer \"no\"."""
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )
    stock_related = llm_chain({"text":human_input})['text'].strip()
    return stock_related, human_input

def get_company_name(human_input):
    template = """You are a helpful assistant who extract company name from the human input.Please only output the company"""
    human_template = "{text}"

    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", template),
        ("human", human_template),
    ])

    llm_chain = LLMChain(
        llm=bedrock_llm,
        prompt=chat_prompt
    )

    company_name=llm_chain({"text":human_input})['text'].strip()
    return company_name
    
def semantic_search_and_check(human_input, k=10,with_post_filter=True):

    company_name=get_company_name(human_input)
    
    search_vector = bedrock_embeddings.embed_query(human_input)

    no_post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        }
    }

    post_filter_search_query={
        "size": k,
        "query": {
            "knn": {
                "item_vector":{
                    "vector":search_vector,
                    "k":k
                }
            }
        },
        "post_filter": {
           "match": { 
               "company_name":company_name
           }
        }
    }
    
    search_query=no_post_filter_search_query
    if with_post_filter:
        search_query=post_filter_search_query
    
    res = aos_client.search(index=index_name, 
                       body=search_query,
                       stored_fields=["company_name","item_category","item_content"])
    
    query_result=[]
    for hit in res['hits']['hits']:
        hit_company=hit['fields']['company_name'][0]
        print("semantic search hit company: " + hit_company)
        row=[hit['fields']['item_content'][0]]
        query_result.append(row)

    query_result_df = pd.DataFrame(data=query_result,columns=["item_content"])
    return query_result_df

def search_for_similiar_content_in_10k_filing(human_input):
    company_statements = semantic_search_and_check(human_input)
    return company_statements

def get_stock_ticker(human_input):
    company_name=get_company_name(human_input)
    company_ticker = query_stock_ticker(company_name)
    return company_name, company_ticker



---
![OpenSearch KNN Filter](./static/opensearch-knn-filter.png)

### Note

Uncomment the line `downloaded_file=get_filings(company_stock_ticker)` if you have sec-api security key so that you can download 10-K from SEC. In the meanwhile, comment the line 'downloaded_file="000101872424000008-amzn-20231231.json"`


In [None]:
def download_10k_filing_from_sec_and_ingest_into_opensearch(company_stock_ticker):
    #downloaded_file=get_filings(company_stock_ticker)
    downloaded_file="000101872424000008-amzn-20231231.json"
    ingest_downloaded_10k_into_opensearch("./download_filings/" + downloaded_file)

#### Define tools used by agent

In [None]:
from langchain.agents import Tool

annual_report_tools=[
    Tool(
        name="is_stock_related_query",
        func=is_stock_related_query,
        description="Use this tool when you need to know whether this is stock related query. This tool will output whether human input is stock related and human input. Human orginal query is the input to this tool."
    ),
    Tool(
        name="search_for_similiar_content_in_10k_filing",
        func=search_for_similiar_content_in_10k_filing,
        description="Use this tool to get financial statement of the company. With the help of this data, companys historic performance can be evaluated. Human orginal query is the input to this tool."
    ),
    Tool(
        name="download_10k_filing_from_sec_and_ingest_into_opensearch",
        func=download_10k_filing_from_sec_and_ingest_into_opensearch,
        description="Only use this tool when \"search_for_similiar_content_in_10k_filing\" tool does not return any result. Company stock ticker is the input to this tool."
    ),
    Tool(
        name="get_stock_ticker",
        func=get_stock_ticker,
        description="Only use this tool when you need to call \"download_10k_filing_from_sec_and_ingest_into_opensearch\" tool. This tool will output company name and stock ticker."
    )
]


#### Create the ReAct Agent using LLM, tools and prompt

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain.agents import AgentExecutor, create_react_agent


annual_report_prompt="""
System: You are a financial advisor. Give stock recommendations for given query based on following instructions. 

<instructions>
Answer the following questions as best you can. You have access to the following tools:

{tools}

<steps>
Note- if you fail in satisfying any of the step below, Just move to next one
1) Use "is_stock_related_query" tool to judge if the input query is stock related or not. If the input query is not stock related query. Ansower \"I can not asnwer this question\" and terminte the dialog. Output - stock related and input query
2) Use "search_for_similiar_content_in_10k_filing" tool to get company's historic financial statement. Output- Financial statement
3) Use "download_10k_filing_from_sec_and_ingest_into_opensearch" tool to download company 10-K filing from SEC and ingest the data into OpenSearch if "search_for_similiar_content_in_10k_filing" does not return any result. 
This tool input is company stock ticker, you need to use "get_stock_ticker" tool to get company stock ticker.
After downloading company 10-K filing from SEC, you shall use "search_for_similiar_content_in_10k_filing" tool to search similiar content in OpenSearch. 
4) Analyze the stock based on gathered data and give detail analysis for investment choice. Provide numbers and reasons to justify your answer. Output- Detailed stock Analysis
</steps>

</instructions>


Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do, Also try to follow steps mentioned above
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Human: {input}

Assistant:
{agent_scratchpad}

"""

annual_report_prompt = PromptTemplate.from_template(annual_report_prompt)
annual_report_agent = create_react_agent(bedrock_llm, annual_report_tools, annual_report_prompt)
annual_report_agent_executor = AgentExecutor(agent=annual_report_agent, tools=annual_report_tools, verbose=True)


### 3.3 Use AI agent

#### Example 1:

Ask the queustion "Is Microsoft a good investment choice right now?". The agent will run the following process:

1. is stock related query
2. get company name
3. use semantic search get related information from 10k financial filing data

Combine all the above information and generate answer.

In [None]:
import warnings

warnings.filterwarnings("ignore")

response = annual_report_agent_executor.invoke({"input": "Is Microsoft a good investment choice right now?"})

In [None]:
print(response["output"])

#### Example 2

"Is Amazon a good investment choice right now?". This is stock related quesiton. However there is no Amazon 10-K filing data in OpenSearch. The agent will download Amazon 10-K from SEC by calling SEC API and ingest data into OpenSearch. The agent will run the following process:

1. is stock related query
2. get company name
3. use semantic search get related information from 10k financial filing data
4. get stock ticker
5. download 10-k data from SEC with API
6. use semantic search get related information from 10k financial filing data

In [None]:
response = annual_report_agent_executor.invoke({"input": "Is Amazon a good investment choice right now?"})

In [None]:
print(response["output"])

#### Example 3

This is not stock related query. The agent will run the following process:

1. is stock related query

In [None]:
response = annual_report_agent_executor.invoke({"input": "What is SageMaker?"})

In [None]:
print(response["output"])

# Test

In [None]:
delete_query={
    "query":{
           "match": { 
               "company_name":"Amazon"
           }
    }
}

aos_client.delete_by_query(index=index_name, body=delete_query)

In [None]:
search_query={
    "query":{
           "match": { 
               "company_name":"Amazon"
           }
    }
}

aos_client.search(index=index_name, body=search_query)

In [None]:
download_10k_filing_from_sec_and_ingest_into_opensearch('AMZN')