# Build a Question and Answer system over SQL data
Adapted from [Build a Question/Answering system over SQL data](https://python.langchain.com/v0.2/docs/tutorials/sql_qa/)

Enabling a LLM system to query structured data can be qualitatively different from unstructured text data. Whereas in the latter it is common to generate text that can be searched against a vector database, the approach for structured data is often for the LLM to write and execute queries in a DSL, such as SQL. In this guide we'll go over the basic ways to create a Q&A system over tabular data in databases. We will cover implementations using both chains and agents. These systems will allow us to ask a question about the data in a database and get back a natural language answer. The main difference between the two is that our agent can query the database in a loop as many times as it needs to answer the question.

## Security note
Building Q&A systems of SQL databases requires executing model-generated SQL queries. There are inherent risks in doing this. Make sure that your database connection permissions are always scoped as narrowly as possible for your chain/agent's needs. This will mitigate though not eliminate the risks of building a model-driven system. For more on general security best practices, see [here](https://python.langchain.com/v0.2/docs/security/).

## Architecture
At a high-level, the steps of these systems are:

1. Convert question to DSL query: Model converts user input to a SQL query.
2. Execute SQL query: Execute the query.
3. Answer the question: Model responds to user input using the query results.

Note that querying data in CSVs can follow a similar approach. See our [how-to guide on question-answering over CSV data](https://python.langchain.com/v0.2/docs/how_to/sql_csv/) for more detail.

## Setup
First, install the required packages and set environment variables:

A non-comprehensive list of dependencies used in this examples are listed here
```bash
bs4==0.0.2
langchain==0.2.1
langchain-chroma==0.1.1
langchain-community==0.2.1
langchain-core==0.2.1
langchain-openai==0.1.7
langchain-text-splitters==0.2.0
langchainhub==0.1.20
langgraph==0.0.60
langserve==0.2.1
langsmith==0.1.63
python-dotenv==1.0.1
faiss-cpu==1.8.0.post1
```

In [1]:
%%capture --no-stderr
%pip install --upgrade --quiet  langchain langchain-community langchain-openai

We will use an OpenAI model in this guide

Langchain is also used for tracing

### Setting credentials with python-dot-env
Load credentials from a `.env` file and the [python-dotenv package](https://pypi.org/project/python-dotenv/)

In [2]:
import os
from dotenv import load_dotenv

os.environ["LANGCHAIN_TRACING_V2"] = "true"

load_dotenv()
assert os.environ["LANGCHAIN_API_KEY"]
assert os.environ["OPENAI_API_KEY"]

### Install the Chinook database and SQLite3

You will first need to install sqlite3
```bash
sudo apt-get install sqlite3
```

The below example will use a SQLite connection with Chinook database. Follow these installation steps to create Chinook.db in the same directory as this notebook:

* Save [this file](https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql) as Chinook.sql
* Run `sqlite3 Chinook.db`
* Run `.read Chinook.sql`
* Test `SELECT * FROM Artist LIMIT 10;`

Now, `Chinhook.db` is in our directory and we can interface with it using the SQLAlchemy-driven `SQLDatabase` class:

In [4]:
from langchain_community.utilities import SQLDatabase

db = SQLDatabase.from_uri("sqlite:///Chinook.db") # Reads from file Chinook.db from the same directory

print(db.dialect)
print(db.get_usable_table_names())
db.run("SELECT * FROM Artist LIMIT 10;")

sqlite
['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']


"[(1, 'AC/DC'), (2, 'Accept'), (3, 'Aerosmith'), (4, 'Alanis Morissette'), (5, 'Alice In Chains'), (6, 'Antônio Carlos Jobim'), (7, 'Apocalyptica'), (8, 'Audioslave'), (9, 'BackBeat'), (10, 'Billy Cobham')]"

__API Reference__: [SQLDatabase](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.sql_database.SQLDatabase.html)

Great! We've got a SQL database that we can query. Now let's try hooking it up to an LLM.

## Chains
Chains (i.e., compositions of LangChain [Runnables](https://python.langchain.com/v0.2/docs/concepts/#langchain-expression-language-lcel)) support applications whose steps are predictable. We can create a simple chain that takes a question and does the following:

* convert the question into a SQL query;
* execute the query;
* use the result to answer the original question.

There are scenarios not supported by this arrangement. For example, this system will execute a SQL query for any user input-- even "hello". Importantly, as we'll see below, some questions require more than one query to answer. We will address these scenarios in the Agents section.

### Convert question to SQL query
The first step in a SQL chain or agent is to take the user input and convert it to a SQL query. LangChain comes with a built-in chain for this: [create_sql_query_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.sql_database.query.create_sql_query_chain.html).

In [5]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

In [6]:
from langchain.chains import create_sql_query_chain

chain = create_sql_query_chain(llm, db)
response = chain.invoke({"question": "How many employees are there"})
response

'SELECT COUNT("EmployeeId") AS "TotalEmployees" FROM "Employee"\nLIMIT 1;'

API Reference: [create_sql_query_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.sql_database.query.create_sql_query_chain.html)

We can execute the query to ensure it is valid.

In [7]:
db.run(response)

'[(8,)]'

We can look at the [LangSmith trace](https://smith.langchain.com/public/c8fa52ea-be46-4829-bde2-52894970b830/r) to get a better understanding of what this chain is doing. We can also inspect the chain directly for its prompts. Looking at the prompt (below), we can see that it is:

* Dialect-specific. In this case it references SQLite explicitly.
* Has definitions for all the available tables.
* Has three examples rows for each table.

This technique is inspired by papers like [this](https://arxiv.org/pdf/2204.00498.pdf), which suggest showing examples rows and being explicit about tables improves performance. We can also inspect the full prompt like so:

In [8]:
chain.get_prompts()[0].pretty_print()

You are a SQLite expert. Given an input question, first create a syntactically correct SQLite query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most 5 results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use date('now') function to get the current date, if the question involves "today".

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result

### Execute SQL query
Now that we've generated a SQL query, we'll want to execute it. This is __the most dangerous part of creating a SQL chain.__ Consider carefully if it is OK to run automated queries over your data. Minimize the database connection permissions as much as possible. Consider adding a human approval step to you chains before query execution (see below).

We can use the `QuerySQLDatabaseTool` to easily add query execution to our chain:

In [11]:
from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool

execute_query = QuerySQLDataBaseTool(db=db)    # Create a tool to query the DB with
write_query = create_sql_query_chain(llm, db)  # Create a chain where the DB is read by the LLM to write queries
chain = write_query | execute_query            # Create a chain where the written query is execute
chain.invoke({"question": "How many employees are there"})

'[(8,)]'

__API Reference__: [QuerySQLDataBaseTool](https://api.python.langchain.com/en/latest/tools/langchain_community.tools.sql_database.tool.QuerySQLDataBaseTool.html)

### Answer the question
Now that we've got a way to automatically generate and execute queries, we just need to combine the original question and SQL query result to generate a final answer. We can do this by passing question and result to the LLM once more:

In [16]:
from operator import itemgetter

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.

Question: {question}
SQL Query: {query}
SQL Result: {result}
Answer: """
) # Create a Prompttemplate

# Create a chain
# the LLM writing a DB query
# extract the query by key
# Run the query
chain = (
    RunnablePassthrough \
        .assign(query=write_query) \
        .assign(
        result=itemgetter("query") \
        | execute_query                
    )
    | answer_prompt                    # Send the query to the PromptTemplate
    | llm                              # Send the rendered Template the LLM
    | StrOutputParser()                # StrOutputParser extracts only the MachineAnswer
)

chain.invoke({"question": "How many employees are there"})

'There are a total of 8 employees in the database.'

Let's review what is happening in the above LCEL. Suppose this chain is invoked.

* After the first `RunnablePassthrough.assign`, we have a runnable with two elements:
`{"question": question, "query": write_query.invoke(question)}` Where `write_query` will generate a SQL query in service of answering the question.
* After the second R`unnablePassthrough.assign`, we have add a third element `"result"` that contains `execute_query.invoke(query)`, where query was computed in the previous step.
* These three inputs are formatted into the prompt and passed into the LLM. The `StrOutputParser()` plucks out the string content of the output message.
* Note that we are composing LLMs, tools, prompts, and other chains together, but because each implements the Runnable interface, their inputs and outputs can be tied together in a reasonable way.

### Next steps
For more complex query-generation, we may want to create few-shot prompts or add query-checking steps. For advanced techniques like this and more check out:

* [Prompting strategies](https://python.langchain.com/v0.2/docs/how_to/sql_prompting/): Advanced prompt engineering techniques.
* [Query checking](https://python.langchain.com/v0.2/docs/how_to/sql_query_checking/): Add query validation and error handling.
* [Large databases](https://python.langchain.com/v0.2/docs/how_to/sql_large_db/): Techniques for working with large databases.

## Agents
LangChain has a SQL Agent which provides a more flexible way of interacting with SQL Databases than a chain. The main advantages of using the SQL Agent are:

* It can answer questions based on the databases' schema as well as on the databases' content (like describing a specific table).
* It can recover from errors by running a generated query, catching the traceback and regenerating it correctly.
* It can query the database as many times as needed to answer the user question.
* It will save tokens by only retrieving the schema from relevant tables.

To initialize the agent we'll use the `SQLDatabaseToolkit` to create a bunch of tools:

* Create and execute queries
* Check query syntax
* Retrieve table descriptions
* ... and more

In [17]:
from langchain_community.agent_toolkits import SQLDatabaseToolkit

toolkit = SQLDatabaseToolkit(db=db, llm=llm)

tools = toolkit.get_tools()

tools

[QuerySQLDataBaseTool(description="Input to this tool is a detailed and correct SQL query, output is a result from the database. If the query is not correct, an error message will be returned. If an error is returned, rewrite the query, check the query, and try again. If you encounter an issue with Unknown column 'xxxx' in 'field list', use sql_db_schema to query the correct table fields.", db=<langchain_community.utilities.sql_database.SQLDatabase object at 0x7fc7265f8b50>),
 InfoSQLDatabaseTool(description='Input to this tool is a comma-separated list of tables, output is the schema and sample rows for those tables. Be sure that the tables actually exist by calling sql_db_list_tables first! Example Input: table1, table2, table3', db=<langchain_community.utilities.sql_database.SQLDatabase object at 0x7fc7265f8b50>),
 ListSQLDatabaseTool(db=<langchain_community.utilities.sql_database.SQLDatabase object at 0x7fc7265f8b50>),
 QuerySQLCheckerTool(description='Use this tool to double check

__API Reference__: [SQLDatabaseToolkit](https://api.python.langchain.com/en/latest/agent_toolkits/langchain_community.agent_toolkits.sql.toolkit.SQLDatabaseToolkit.html)

### System Prompt
We will also want to create a system prompt for our agent. This will consist of instructions for how to behave.

In [18]:
from langchain_core.messages import SystemMessage

SQL_PREFIX = """You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct SQLite query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most 5 results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the below tools. Only use the information returned by the below tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

To start you should ALWAYS look at the tables in the database to see what you can query.
Do NOT skip this step.
Then you should query the schema of the most relevant tables."""

system_message = SystemMessage(content=SQL_PREFIX)

__API Reference__: [SystemMessage](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.system.SystemMessage.html)

### Initializing agent

We will required the package __LangGraph__

We will use a prebuilt [LangGraph](https://python.langchain.com/v0.2/docs/concepts/#langgraph) agent to build our agent

In [19]:
from langchain_core.messages import HumanMessage
from langgraph.prebuilt import create_react_agent

# Construct the Agent
agent_executor = create_react_agent(llm,    # Can access LLM
                                    tools,  # Can access tools
                                    messages_modifier=system_message) # Messages will always be modified with System Message

__API Reference__: [HumanMessage](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.human.HumanMessage.html)

Consider how the agent responds to the below question:

In [20]:
for s in agent_executor.stream(                                                          # Streaming output
    {"messages": [HumanMessage(content="Which country's customers spent the most?")]}    # dict with messages, one of which is HumanMessage
):
    print(s)
    print("----")

{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_VoSFC89YoaOo242G3cRDGCKk', 'function': {'arguments': '{"table_names":"customers"}', 'name': 'sql_db_schema'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 16, 'prompt_tokens': 557, 'total_tokens': 573}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-3ddbb967-cb28-4bb4-a738-11001cbf8773-0', tool_calls=[{'name': 'sql_db_schema', 'args': {'table_names': 'customers'}, 'id': 'call_VoSFC89YoaOo242G3cRDGCKk'}], usage_metadata={'input_tokens': 557, 'output_tokens': 16, 'total_tokens': 573})]}}
----
{'tools': {'messages': [ToolMessage(content="Error: table_names {'customers'} not found in database", name='sql_db_schema', tool_call_id='call_VoSFC89YoaOo242G3cRDGCKk')]}}
----
{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_sJlxEfkyhjRPyETZa8cZjX8p', 'fun

Note that the agent executes multiple queries until it has the information it needs:

List available tables;
1. Retrieves the schema for three tables;
2. Queries multiple of the tables via a join operation.
3. The agent is then able to use the result of the final query to generate an answer to the original question.

The agent can similarly handle qualitative questions:

In [21]:
for s in agent_executor.stream(
    {"messages": [HumanMessage(content="Describe the playlisttrack table")]}
):
    print(s)
    print("----")

{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_bHaXSqC6J6IdnTjlxiq3ZeN0', 'function': {'arguments': '{"table_names":"playlisttrack"}', 'name': 'sql_db_schema'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 554, 'total_tokens': 571}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-20c5cfc4-cf13-4137-82bc-b58309aaed0e-0', tool_calls=[{'name': 'sql_db_schema', 'args': {'table_names': 'playlisttrack'}, 'id': 'call_bHaXSqC6J6IdnTjlxiq3ZeN0'}], usage_metadata={'input_tokens': 554, 'output_tokens': 17, 'total_tokens': 571})]}}
----
{'tools': {'messages': [ToolMessage(content="Error: table_names {'playlisttrack'} not found in database", name='sql_db_schema', tool_call_id='call_bHaXSqC6J6IdnTjlxiq3ZeN0')]}}
----
{'agent': {'messages': [AIMessage(content='I apologize, it seems there was an error in retrieving the schema for

### Dealing with high-cardinality columns

In [26]:
import ast
import re


def query_as_list(db, query):
    res = db.run(query)
    res = [el for sub in ast.literal_eval(res) for el in sub if el]
    res = [re.sub(r"\b\d+\b", "", string).strip() for string in res]
    return list(set(res))


artists = query_as_list(db, "SELECT Name FROM Artist")  # A list of Artist names, these are proper nouns
print(artists[:5])

albums = query_as_list(db, "SELECT Title FROM Album")  # A list of Album names, these are proper nouns
print(albums[:5])

['Green Day', 'Ozzy Osbourne', 'Emanuel Ax, Eugene Ormandy & Philadelphia Orchestra', 'Pedro Luís & A Parede', 'Simply Red']
['Minha História', 'Santana Live', 'Battlestar Galactica, Season', 'Allegri: Miserere', 'Bach: Goldberg Variations']


Using this function, we can create a retriever tool that the agent can execute at its discretion.

In [30]:
from langchain.agents.agent_toolkits import create_retriever_tool
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

vector_db = FAISS.from_texts(artists + albums,
                             OpenAIEmbeddings()) # A VectorDB built from the list of artists and albums, with OpenAI embeddings

retriever = vector_db.as_retriever(search_kwargs={"k": 5}) # A retriever that gets the top 5 values form the vector DB


description = """Use to look up values to filter on. Input is an approximate spelling of the proper noun, output is \
valid proper nouns. Use the noun most similar to the search."""
retriever_tool = create_retriever_tool(
    retriever,
    name="search_proper_nouns",
    description=description,
) # A tool built from the vector_db retriever

__API Reference__: [create_retriever_tool](https://api.python.langchain.com/en/latest/tools/langchain_core.tools.create_retriever_tool.html) | [FAISS](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.faiss.FAISS.html) | [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html)

Let's try it out

In [32]:
print(retriever_tool.invoke("Alice Chains")) # Return the top 5 nouns which match "Alice Chains"

Alice In Chains

Alanis Morissette

Pearl Jam

Pearl Jam

Audioslave


This way, if the agent determines it needs to write a filter based on an artist along the lines of "Alice Chains", it can first use the retriever tool to observe relevant values of a column.

Putting this together:

In [34]:
system = """You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct SQLite query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most 5 results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the given tools. Only use the information returned by the tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

You have access to the following tables: {table_names}

If you need to filter on a proper noun, you must ALWAYS first look up the filter value using the "search_proper_nouns" tool!
Do not try to guess at the proper name - use this function to find similar ones.""".format(
    table_names=db.get_usable_table_names() # Retrieve table names from the DB
)

system_message = SystemMessage(content=system) # Construct the System Message

tools.append(retriever_tool)                   # Add the retriever tool to the tools list

agent = create_react_agent(llm,
                           tools,
                           messages_modifier=system_message) # Construct the agent

In [35]:
for s in agent.stream(
    {"messages": [HumanMessage(content="How many albums does alis in chain have?")]} # Note the typo
):
    print(s)
    print("----")

{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_CojheqSBN0OOCBUrlxXzx22U', 'function': {'arguments': '{"query":"alis in chain"}', 'name': 'search_proper_nouns'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 736, 'total_tokens': 755}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-21ea5d58-34f4-4b67-a9b0-602e3f30b2cb-0', tool_calls=[{'name': 'search_proper_nouns', 'args': {'query': 'alis in chain'}, 'id': 'call_CojheqSBN0OOCBUrlxXzx22U'}], usage_metadata={'input_tokens': 736, 'output_tokens': 19, 'total_tokens': 755})]}}
----
{'tools': {'messages': [ToolMessage(content='Alice In Chains\n\nAisha Duo\n\nXis\n\nDa Lama Ao Caos\n\nA-Sides', name='search_proper_nouns', tool_call_id='call_CojheqSBN0OOCBUrlxXzx22U')]}}
----
{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_KMpcqO

As we can see, the agent used the `search_proper_nouns` tool in order to check how to correctly query the database for this specific artist.