# Building  Text-to-SQL capability to Amazon Athena using Amazon Bedrock

- **Use of amazon.titan-embed-text-v1 for creating embedding**
- **Use of Amazon OpenSearch as a vector database**
- **Use of anthropic.claude-v2:1 as base LLM Model**



## Contents

1. [Objective](#Objective)
1. [Background](#Background-(Problem-Description-and-Approach))
1. [Overall Workflow](#Overall-Workflow)
1. [Conclusion](#Conclusion)


## Objective

This notebook shows how to leverage Bedrock to invoke an LLM that can convert natural language inputs to SQL queries. The LLM-generated SQL is then executed using Athena.

## Background (Problem Description and Approach)

- **Problem statement**: 

Using LLMs for information retrieval tasks (such as question-answering) requires converting the knowledge corpus as well as user questions into vector embeddings. We want to generate these vector embeddings using an LLM 

Here for small metadata we have used  For converting large amounts of data (TBs or PBs) we need a scalable system which can accomplish both converting the documents into embeddings, storing them in a vector database and provide low latency similarity search

- **Our approach**: 

[`RAG`]The RAG approach offers several advantages. First, it gives up-to-date, precise responses. Rather than relying only on fixed, outdated training data, RAG utilizes current external sources to formulate its answers.

[`Vector Store`] Amazon OpenSearch offers three vector engines to choose from, each catering to different use cases.Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors.This code bases used FAISS for similiarity search.

[`Bedrock`] Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

- *[The langchain OpenSearch documentation](https://python.langchain.com/en/latest/ecosystem/opensearch.html)*
- *[Amazon OpenSearch service documentation](https://docs.aws.amazon.com/opensearch-service/index.html)*
- *[Amazon OpenSearch supports efficient vector](https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-opensearch-service-vector-query-filters-faiss/)*
- *[Amazon Bedrock](https://aws.amazon.com/bedrock/)*


---

## Overall Workflow

**Prerequisite**

The following are prerequisites that needs to be accomplised before executing this notebook.
- A Sagemaker instance with a role having access to bedrock, glue,athena,s3,lakeformation
- Glue Database and tables. Provided spark notebook to create.
- An Amazon OpenSearch cluster for storing embeddings.Here Opensearch credenitals are in notebooks. However Opensearch cluster's access credentials (username and password) can be stored in AWS Secrets Mananger by following steps described [here](https://docs.aws.amazon.com/secretsmanager/latest/userguide/managing-secrets.html).

**The overall workflow for this notebook is as follows:**
1. Download data from source https://developer.imdb.com/non-commercial-datasets/#titleakastsvgz and upload to S3.
1. Create database and load datasets in Glue. Make sure see of the you are able to query through athena. 
1. Install the required Python packages (specifically boto version mentioned)
1. Create embedding and vector store.Do a similarity search with embeddings stored in the OpenSearch index for an input query.
1. Execute this notebook to generate sql..

## Step 1: Setup
Install the required packages.

In [None]:
# !pip3 install boto3==1.34.8
# !pip3 install jq

## Step 2: Import all modules. There are some modules in other folder.

In [1]:
import boto3
from botocore.config import Config
from langchain.llms.bedrock import Bedrock
from langchain.embeddings import BedrockEmbeddings

In [2]:
import logging 
import json
import os,sys
import re
sys.path.append("/home/ec2-user/SageMaker/llm_bedrock_v0/")
import time
import pandas as pd
import io

In [3]:
from boto_client import Clientmodules
from llm_basemodel import LanguageModel
from athena_execution import AthenaQueryExecute
from vector_embedding import EmbeddingBedrock
from openSearchVCEmbedding import EmbeddingBedrockOpenSearch

In [4]:
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

## Step 3:Checking access to Bedrock

In [5]:
session = boto3.session.Session()
bedrock_client = session.client('bedrock')
print(bedrock_client.list_foundation_models()['modelSummaries'][0])

2024-01-15 00:10:25,506,credentials,MainProcess,INFO,Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large', 'modelId': 'amazon.titan-tg1-large', 'modelName': 'Titan Text Large', 'providerName': 'Amazon', 'inputModalities': ['TEXT'], 'outputModalities': ['TEXT'], 'responseStreamingSupported': True, 'customizationsSupported': [], 'inferenceTypesSupported': ['ON_DEMAND'], 'modelLifecycle': {'status': 'ACTIVE'}}


## Step 4:Invoking athena and bedrock embedding utility

In [6]:
rqstath=AthenaQueryExecute()

2024-01-15 00:10:25,658,credentials,MainProcess,INFO,Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
athena client created 
2024-01-15 00:10:25,696,boto_client,MainProcess,INFO,athena client created 
2024-01-15 00:10:25,711,credentials,MainProcess,INFO,Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
s3 client created !!
2024-01-15 00:10:25,775,boto_client,MainProcess,INFO,s3 client created !!


In [7]:
ebr=EmbeddingBedrock()

2024-01-15 00:10:25,875,credentials,MainProcess,INFO,Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
bedrock runtime client created 
2024-01-15 00:10:26,027,boto_client,MainProcess,INFO,bedrock runtime client created 


In [8]:
ebropen=EmbeddingBedrockOpenSearch()

2024-01-15 00:10:26,594,credentials,MainProcess,INFO,Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
bedrock runtime client created 
2024-01-15 00:10:26,628,boto_client,MainProcess,INFO,bedrock runtime client created 


## Step 5: Core logic
1. getEmbeddding : Take the input user query and vector search to find the schema from vector db created.
2. generate_sql: Taking the input prompt, generate sql . syntax_checker helps to check the sql syntax.


In [9]:
class RequestQueryBedrock:
    def __init__(self):
        # self.model_id = "anthropic.claude-v2"
        self.bedrock_client = Clientmodules.createBedrockRuntimeClient()
        self.language_model = LanguageModel(self.bedrock_client)
        self.llm = self.language_model.llm
    def getEmbeddding(self,user_question):
        vector_store_path=os.getcwd()+'/'+'vectorstore/'+ '03012024225156.vs'
        print("vector_store_path :",vector_store_path)
        vs=ebr.load_local_vector_store(vector_store_path)
        required_metadata = vs.similarity_search_with_score(user_question)
        docs, scores = zip(*required_metadata)
        return ebr.format_metadata(docs)
    def getOpenSearchEmbedding(self,index_name,user_query):
        vcindxdoc=ebropen.getDocumentfromIndex(index_name=index_name)
        documnet=ebropen.getSimilaritySearch(user_query,vcindxdoc)
        return ebropen.format_metadata(documnet)
        
    def generate_sql(self,prompt, max_attempt=4) ->str:
            """
            Generate and Validate SQL query.

            Args:
            - prompt (str): Prompt is user input and metadata from Rag to generating SQL.
            - max_attempt (int): Maximum number of attempts correct the syntax SQL.

            Returns:
            - string: Sql query is returned .
            """
            attempt = 0
            error_messages = []
            prompts = [prompt]

            while attempt < max_attempt:
                logger.info(f'Sql Generation attempt Count: {attempt+1}')
                try:
                    logger.info(f'we are in Try block to generate the sql and count is :{attempt+1}')
                    generated_sql = self.llm.predict(prompt)
                    query_str = generated_sql.split("```")[1]
                    query_str = " ".join(query_str.split("\n")).strip()
                    sql_query = query_str[3:] if query_str.startswith("sql") else query_str
                    # return sql_query
                    syntaxcheckmsg=rqstath.syntax_checker(sql_query)
                    if syntaxcheckmsg=='Passed':
                        logger.info(f'syntax checked for query passed in attempt number :{attempt+1}')
                        return sql_query
                    else:
                        prompt = f"""{prompt}
                        This is syntax error: {syntaxcheckmsg}. 
                        To correct this, please generate an alternative SQL query which will correct the syntax error.
                        The updated query should take care of all the syntax issues encountered.
                        Follow the instructions mentioned above to remediate the error. 
                        Update the below SQL query to resolve the issue:
                        {sqlgenerated}
                        Make sure the updated SQL query aligns with the requirements provided in the initial question."""
                        prompts.append(prompt)
                        attempt += 1
                except Exception as e:
                    logger.error('FAILED')
                    msg = str(e)
                    error_messages.append(msg)
                    attempt += 1
            return sql_query

In [10]:
rqst=RequestQueryBedrock()

2024-01-15 00:10:28,158,credentials,MainProcess,INFO,Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
bedrock runtime client created 
2024-01-15 00:10:28,191,boto_client,MainProcess,INFO,bedrock runtime client created 


In [11]:
index_name = 'llm_vector_db_metadata_indx2'

In [12]:
def userinput(user_query):
    logger.info(f'Searching metadata from vector store')
    # vector_search_match=rqst.getEmbeddding(user_query)
    vector_search_match=rqst.getOpenSearchEmbedding(index_name,user_query)
    # print(vector_search_match)
    details="It is important that the SQL query complies with Athena syntax. During join if column name are same please use alias ex llm.customer_id in select statement. It is also important to respect the type of columns: if a column is string, the value should be enclosed in quotes. If you are writing CTEs then include all the required columns. While concatenating a non string column, make sure cast the column to string. For date columns comparing to string , please cast the string input."
    final_question = "\n\nHuman:"+details + vector_search_match + user_query+ "n\nAssistant:"
    answer = rqst.generate_sql(final_question)
    return answer

## Step 6: User input in Natular Language

In [16]:
user_query='show me all the title in US region'

In [17]:
querygenerated=userinput(user_query)

Searching metadata from vector store
2024-01-15 00:17:48,805,1460622309,MainProcess,INFO,Searching metadata from vector store
2024-01-15 00:17:49,716,base,MainProcess,INFO,POST https://search-llm-vectordb-1-jsdrnnhchl6rsh7s4biqregpuq.us-east-1.es.amazonaws.com:443/llm_vector_db_metadata_indx2/_search [status:200 request:0.536s]
Sql Generation attempt Count: 1
2024-01-15 00:17:49,718,1749336078,MainProcess,INFO,Sql Generation attempt Count: 1
we are in Try block to generate the sql and count is :1
2024-01-15 00:17:49,719,1749336078,MainProcess,INFO,we are in Try block to generate the sql and count is :1


Executing: Explain   WITH titles AS (   SELECT title, region   FROM imdb_stg.title ) SELECT title  FROM titles WHERE region = 'US'
 I am checking the syntax here
execution_id: d545d657-cc67-45e1-8b5a-f5c98a4a2108


syntax checked for query passed in attempt number :1
2024-01-15 00:18:00,317,1749336078,MainProcess,INFO,syntax checked for query passed in attempt number :1


Status : {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2024, 1, 15, 0, 17, 57, 260000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2024, 1, 15, 0, 17, 57, 873000, tzinfo=tzlocal())}


## Step 7: Sql Query and Query Execution output

In [18]:
import pprint
my_printer = pprint.PrettyPrinter()
my_printer.pprint(querygenerated)

(' WITH titles AS (   SELECT title, region   FROM imdb_stg.title ) SELECT '
 "title  FROM titles WHERE region = 'US'")


In [19]:
QueryOutput=rqstath.execute_query(querygenerated)

checking for file :athena_output/f593ee48-788f-4efa-a689-c4b02f82ed3d.csv
2024-01-15 00:20:34,510,athena_execution,MainProcess,INFO,checking for file :athena_output/f593ee48-788f-4efa-a689-c4b02f82ed3d.csv


Calling download fine with params /tmp/athena_output/f593ee48-788f-4efa-a689-c4b02f82ed3d.csv, {'OutputLocation': 's3://llm-athena-output/athena_output'}


In [20]:
print(QueryOutput)

                                                     title
0                                    This Is Parris Island
1                                        Genius obsessions
2                                             Carpe Duorum
3                                         Blatantly Bianka
4                                              FilmFrights
...                                                    ...
1490529                                   The White Orchid
1490530                                        The Oficina
1490531  State of Origin Australian Rules Football 1992...
1490532  Riot Games Calls League of Legends Fans 'Manba...
1490533                                         Endangered

[1490534 rows x 1 columns]



## Cleanup

To avoid incurring future charges, delete the resources.


## Conclusion
In this notebook we were able to see how to use bedrock to deploy LLM Model to generate embeddings,then ingest those embeddings into OpenSearch and finally do a similarity search for user input to the documents (embeddings) stored in OpenSearch. We used langchain as an abstraction layer to do workflow management. Claude model us used to create sql and athena for query execution.