# Using LangChain and LLMs to Analyze Data in Amazon RDS

Demonstration of [LangChain SQL Chain](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html) (`SQLDatabaseChain` and `SQLDatabaseSequentialChain`) and [SQL Database Agent](https://python.langchain.com/en/latest/modules/agents/toolkits/examples/sql_database.html) to analyze the data in an [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/) database. Demonstration uses OpenAI's LLMs via an API.

Author: Gary A. Stafford  
Date: 2023-05-29  
License: MIT  
Kernal: `conda_python3`  
References:
- [LangChain Documentation: SQL Chain example](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html#sql-chain-example)
- [LangChain Blog: LLMs and SQL](https://blog.langchain.dev/llms-and-sql/)
- [How do davinci and text-davinci-003 differ?
](https://help.openai.com/en/articles/6643408-how-do-davinci-and-text-davinci-003-differ)
- [How do text-davinci-002 and text-davinci-003 differ?
](https://help.openai.com/en/articles/6779149-how-do-text-davinci-002-and-text-davinci-003-differ)

## Prerequisites

1. Import [The Museum of Modern Art (MoMA) Collection database](https://github.com/MuseumofModernArt/collection), found on GitHub, into an [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/) database.

2. Create a new [Amazon SageMaker notebook instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html) for this demonstration. Make sure your RDS instance is accessible to your SageMaker Notebook environment.

3. `git clone` this post's GitHub project to your Amazon SageMaker notebook instance.

4. Create or update the `.env` file, used by `dotenv`, using the terminal in your SageMaker Notebook environment. A sample `env.txt` file in the project.

5. Add your RDS database credentials to the file: `RDS_ENDPOINT`, `RDS_PORT`, `RDS_USERNAME`, `RDS_PASSWORD`, `RDS_DB_NAME`. See this post's GitHub project for an example.

6. Create an OpenAI account and update the `.env` file to include your OpenAI API Key.

__NOTE__: When using `dotenv`, credentials will be stored in plain text. The recommended and more secure method is to use [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html).

## Required for ChromaDB in Amazon Jumpstart environment

In [None]:
!apt-get update -qq && apt-get install -y build-essential -qq

## Install Required Packages

In [None]:
# Optional: update pip
%pip install pip -Uq

# Install latest versions of required packages
%pip install ipywidgets langchain openai python-dotenv SQLAlchemy psycopg2-binary chromadb -Uq
%pip install pyyaml -q

# Avoid issues with install
# https://github.com/aws/amazon-sagemaker-examples/issues/1890#issuecomment-758871546
%pip install sentence-transformers -Uq --no-cache-dir #--force-reinstall

In [None]:
# Optional: restart kernel to update packages
import os

os._exit(00)

In [None]:
# Check verions of LangChain and OpenAI
%pip list | grep "langchain\|openai\|sentence-transformers\|SQLAlchemy"

## Setup Environment Variable

Use `dotenv` to load the OpenAI and RDS environment variables. __NOTE__: credentials will be stored in plain text. The recommended, more secure method is to use [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html).

In [None]:
import os

# Avoid huggingface/tokenizers parallelism error
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# Load env vars from .env file
%load_ext dotenv

# %reload_ext dotenv

%dotenv

In [None]:
# SQLAlchemy 2.0 reference: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
# URI format: postgresql+psycopg2://user:pwd@hostname:port/dbname

RDS_DB_NAME = os.environ.get("RDS_DB_NAME")
RDS_ENDPOINT = os.environ.get("RDS_ENDPOINT")
RDS_PASSWORD = os.environ.get("RDS_PASSWORD")
RDS_PORT = os.environ.get("RDS_PORT")
RDS_USERNAME = os.environ.get("RDS_USERNAME")
RDS_URI = f"postgresql+psycopg2://{RDS_USERNAME}:{RDS_PASSWORD}@{RDS_ENDPOINT}:{RDS_PORT}/{RDS_DB_NAME}"

# print URI
RDS_URI_PRINT = RDS_URI.replace(
    RDS_ENDPOINT, "******.******.us-east-1.rds.amazonaws.com"
)
RDS_URI_PRINT = RDS_URI_PRINT.replace(RDS_PASSWORD, "******")
print(RDS_URI_PRINT)

## LangChain with OpenAI's LLMs

Use OpenAI's `text-davinci-003` or `gpt-3.5-turbo` LLMs. See OpenAI's [Models Overview](https://platform.openai.com/docs/models/overview) for model information.

In [None]:
from langchain import SQLDatabase, SQLDatabaseChain, OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import SQLDatabaseSequentialChain

In [None]:
# llm = OpenAI(model_name="text-davinci-003", temperature=0, verbose=True)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, verbose=True)

## Using LangChain's SQL Chain

Next, we will use LangChain's [SQLDatabaseChain](https://python.langchain.com/en/latest/modules/chains/examples) and [SQLDatabaseSequentialChain](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html#sqldatabasesequentialchain) for answering questions of the MoMA database.

In [None]:
# A few sample questions
QUESTION_01 = "How many artists are there?"
QUESTION_02 = "How many artworks are there?"
QUESTION_03 = "How many rows are in the artists table?"
QUESTION_04 = "How many rows are in the artworks table?"
QUESTION_05 = "How many artists are there whose nationality is French?"
QUESTION_06 = "How many artworks were created by artists whose nationality is Spanish?"
QUESTION_07 = "How many artist names start with 'M'?"
QUESTION_08 = "What nationality produced the most number of artworks?"
QUESTION_09 = "How many artworks are by Claude Monet?"
QUESTION_10 = "What is the oldest artwork in the collection?"

In [None]:
from sqlalchemy.exc import ProgrammingError

db = SQLDatabase.from_uri(RDS_URI)

db_chain = SQLDatabaseSequentialChain.from_llm(
    llm, db, verbose=True, use_query_checker=True
)

try:
    db_chain(QUESTION_05)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

## More Options: Custom Table Info and Query Checker

According to LangChain's [documentation](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html#custom-table-info), "_In some cases, it can be useful to provide custom table information instead of using the automatically generated table definitions and the first sample_rows_in_table_info sample rows._" Of course, this is impractical when dealing with a large number of tables.

"_Sometimes the Language Model generates invalid SQL with small mistakes that can be self-corrected using the same technique used by the SQL Database Agent to try and fix the SQL using the LLM. You can simply specify this option when creating the chain:

_
According to LangChain's [documentation](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html#adding-example-rows-from-each-table), "_Sometimes the Language Model generates invalid SQL with small mistakes that can be self-corrected using the same technique used by the SQL Database Agent to try and fix the SQL using the LLM._"

In [None]:
custom_table_info = {
    "artists": """CREATE TABLE artists (
        artist_id integer NOT NULL,
        name character varying(200),
        nationality character varying(50),
        gender character varying(25),
        birth_year integer,
        death_year integer,
        CONSTRAINT artists_pk PRIMARY KEY (artist_id))

/*
3 rows from artists table:
"artist_id"	"name"	"nationality"	"gender"	"birth_year"	"death_year"
12	"Jüri Arrak"	"Estonian"	"Male"	1936	
19	"Richard Artschwager"	"American"	"Male"	1923	2013
22	"Isidora Aschheim"	"Israeli"	"Female"		
*/""",
    "artworks": """CREATE TABLE artworks (
        artwork_id integer NOT NULL,
        title character varying(500),
        artist_id integer NOT NULL,
        name character varying(500),
        date integer,
        medium character varying(250),
        dimensions text,
        acquisition_date text,
        credit text,
        catalogue character varying(250),
        department character varying(250),
        classification character varying(250),
        object_number text,
        diameter_cm text,
        circumference_cm text,
        height_cm text,
        length_cm text,
        width_cm text,
        depth_cm text,
        weight_kg text,
        durations integer,
        CONSTRAINT artworks_pk PRIMARY KEY (artwork_id))

/*
3 rows from artworks table:
"artwork_id"	"title"	"artist_id"	"name"	"date"	"medium"	"dimensions"	"acquisition_date"	"credit"	"catalogue"	"department"	"classification"	"object_number"	"diameter_cm"	"circumference_cm"	"height_cm"	"length_cm"	"width_cm"	"depth_cm"	"weight_kg"	"durations"
102312	"Watching the Game"	2422	"John Gutmann"	1934	"Gelatin silver print"	"9 3/4 x 6 7/16' (24.8 x 16.4 cm)"	"2006-05-11"	"Purchase"	"N"	"Photography"	"Photograph"	"397.2006"			"24.8"		"16.4"			
103321	"Untitled (page from Sump)"	25520	"Jerome Neuner"	1994	"Page with chromogenic color print and text"	"12 x 9 1/2' (30.5 x 24.1 cm)"	"2006-05-11"	"E.T. Harmax Foundation Fund"	"N"	"Photography"	"Photograph"	"415.2006.12"			"30.4801"		"24.13"			
10	"The Manhattan Transcripts Project, New York, New York, Episode 1: The Park"	7056	"Bernard Tschumi"		"Gelatin silver photograph"	"14 x 18' (35.6 x 45.7 cm)"	"1995-01-17"	"Purchase and partial gift of the architect in honor of Lily Auchincloss"	"Y"	"Architecture & Design"	"Architecture"	"3.1995.11"			"35.6"		"45.7"			
*/""",
}

In [None]:
db = SQLDatabase.from_uri(
    RDS_URI,
    include_tables=["artists", "artworks"],
    sample_rows_in_table_info=3,
    custom_table_info=custom_table_info,
)

db_chain = SQLDatabaseSequentialChain.from_llm(
    llm, db, verbose=True, use_query_checker=True, top_k=3
)

try:
    db_chain(QUESTION_05)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

## Customize Prompt and Return Intermediate Steps

For this part of the demonstration, we will also use a `PromptTemplate`. LangChain's [Prompt Templates](https://python.langchain.com/en/latest/modules/prompts/prompt_templates.html). According to LangChain, "_A prompt template refers to a reproducible way to generate a prompt. It contains a text string (“the template”), that can take in a set of parameters from the end user and generate a prompt._"

According to LangChain's [documentation](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html#return-intermediate-steps), "_You can also return the intermediate steps of the `SQLDatabaseChain`. This allows you to access the SQL statement that was generated, as well as the result of running that against the SQL Database._"

In [None]:
from langchain.prompts.prompt import PromptTemplate

_DEFAULT_TEMPLATE = """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Use the following format:

Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"

Only use the following tables:

{table_info}

If someone asks for the art table, they really mean the artworks table.

Only single quotes in the SQLQuery.

Question: {input}"""

PROMPT = PromptTemplate(
    input_variables=["input", "table_info", "dialect"], template=_DEFAULT_TEMPLATE
)

# Revert to db without custom_table_info
# Could overflow context window (max prompt+completion length) of 4097
db = SQLDatabase.from_uri(RDS_URI)

db_chain = SQLDatabaseChain.from_llm(
    llm,
    db,
    prompt=PROMPT,
    verbose=True,
    use_query_checker=True,
    return_intermediate_steps=True,
)

try:
    result = db_chain(QUESTION_05)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

result["intermediate_steps"]

## Using Few-shot Learning

To improve the accuracy of the SQL query, LangChain allows us to use few-shot learning (aka few-shot prompting). According to [Wikipedia](https://en.wikipedia.org/wiki/In-context_learning_(natural_language_processing), "_In natural language processing, in-context learning, few-shot learning or few-shot prompting is a prompting technique that allows a model to process examples before attempting a task. The method was popularized after the advent of GPT-3 and is considered to be an emergent property of large language models._"

In [None]:
from typing import Dict
import yaml

chain = SQLDatabaseChain.from_llm(
    llm, db, verbose=True, return_intermediate_steps=True, use_query_checker=True
)


def _parse_example(result: Dict) -> Dict:
    sql_cmd_key = "sql_cmd"
    sql_result_key = "sql_result"
    table_info_key = "table_info"
    input_key = "input"
    final_answer_key = "answer"

    _example = {
        "input": result.get("query"),
    }

    steps = result.get("intermediate_steps")
    answer_key = sql_cmd_key  # the first one
    for step in steps:
        # The steps are in pairs, a dict (input) followed by a string (output).
        # Unfortunately there is no schema but you can look at the input key of the
        # dict to see what the output is supposed to be
        if isinstance(step, dict):
            # Grab the table info from input dicts in the intermediate steps once
            if table_info_key not in _example:
                _example[table_info_key] = step.get(table_info_key)

            if input_key in step:
                if step[input_key].endswith("SQLQuery:"):
                    answer_key = sql_cmd_key  # this is the SQL generation input
                if step[input_key].endswith("Answer:"):
                    answer_key = final_answer_key  # this is the final answer input
            elif sql_cmd_key in step:
                _example[sql_cmd_key] = step[sql_cmd_key]
                answer_key = sql_result_key  # this is SQL execution input
        elif isinstance(step, str):
            # The preceding element should have set the answer_key
            _example[answer_key] = step
    return _example


example: any
try:
    result = chain(QUESTION_05)
    print("\n*** Query succeeded")
    example = _parse_example(result)
except Exception as exc:
    print("\n*** Query failed")
    result = {"query": QUESTION_05, "intermediate_steps": exc.intermediate_steps}
    example = _parse_example(result)


# print for now, in reality you may want to write this out to a YAML file or database for manual fix-ups offline
yaml_example = yaml.dump(example, allow_unicode=True)
print("\n" + yaml_example)

In [None]:
# Use the corrected examples for few shot prompt examples
SQL_SAMPLES = None

with open("../few_shot_examples/sql_examples_postgresql.yaml", "r") as stream:
    SQL_SAMPLES = yaml.safe_load(stream)

print(yaml.dump(SQL_SAMPLES[0], allow_unicode=True))

In [None]:
from langchain import FewShotPromptTemplate, PromptTemplate
from langchain.chains.sql_database.prompt import _postgres_prompt, PROMPT_SUFFIX
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.prompts.example_selector.semantic_similarity import (
    SemanticSimilarityExampleSelector,
)
from langchain.vectorstores import Chroma

example_prompt = PromptTemplate(
    input_variables=["table_info", "input", "sql_cmd", "sql_result", "answer"],
    template="{table_info}\n\nQuestion: {input}\nSQLQuery: {sql_cmd}\nSQLResult: {sql_result}\nAnswer: {answer}",
)

examples_dict = SQL_SAMPLES

local_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

example_selector = SemanticSimilarityExampleSelector.from_examples(
    # This is the list of examples available to select from.
    examples_dict,
    # This is the embedding class used to produce embeddings which are used to measure semantic similarity.
    local_embeddings,
    # This is the VectorStore class that is used to store the embeddings and do a similarity search over.
    Chroma,  # type: ignore
    # This is the number of examples to produce and include per prompt
    k=min(3, len(examples_dict)),
)

few_shot_prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix=_postgres_prompt + "Here are some examples:",
    suffix=PROMPT_SUFFIX,
    input_variables=["table_info", "input", "top_k"],
)

In [None]:
db_chain = SQLDatabaseChain.from_llm(
    llm,
    db,
    prompt=few_shot_prompt,
    use_query_checker=True,
    verbose=True,
    return_intermediate_steps=True,
)

try:
    result = db_chain(QUESTION_05)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

## LangChain SQL Database Agent

According to LangChain [documentation](https://python.langchain.com/en/latest/modules/agents/toolkits/examples/sql_database.html#sql-database-agent), the SQL Database Agent "_builds off of `SQLDatabaseChain` and is designed to answer more general questions about a database, as well as recover from errors._" __NOTE__: _it is not guaranteed that the agent won’t perform DML statements on your database given certain questions. Be careful running it on sensitive data!_"

In [None]:
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase

In [None]:
# Example of describing a table using the agent
toolkit = SQLDatabaseToolkit(db=db, llm=llm)

agent_executor = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)

try:
    agent_executor.run("Describe the artists table.")
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

In [None]:
# Example of running queries using the agent
try:
    agent_executor.run(QUESTION_05)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")