<a href="https://colab.research.google.com/github/duyvm/funny_stuff_with_llm/blob/main/learning-rag/Langchain_dealing_with_large_db_in_SQL_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install --quiet --upgrade langchain-chroma langchain[openai] langchain langchain-community langgraph langchain-core langchain-text-splitters> /dev/null

In [None]:
from google.colab import userdata
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGSMITH_PROJECT"] = f"langchain-learning-rag"
os.environ["LANGSMITH_API_KEY"] = userdata.get('LANGSMITH_API_KEY')
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Overview

Guide: [How to deal with large databases when doing SQL question-answering](https://python.langchain.com/docs/how_to/sql_large_db/)

## Objectives

- It is wasted to include full db schema/tables/columns/row samples in every llm calls.

- Need to find a dynamically identify/fetch and feed only a subset of tables/columns/row samples that relevant to the query in llm calls

- Nowaday, the token size of llm is large so we can fit large context in prompt. But there is study stated that large context often lead to model reasoning degradation and poor performance.
  - [Here](https://community.openai.com/t/reasoning-degradation-in-llms-with-long-context-windows-new-benchmarks/906891)

  - [Here](https://osf.io/cf8v2)

  - [And here](https://arxiv.org/html/2410.18745v1)

# Load data

Load the Chinook db sql file and build the db

In [None]:
!sudo apt-get install sqlite3
!curl -s https://raw.githubusercontent.com/duyvm/funny_stuff_with_llm/refs/heads/main/learning-rag/db/Chinook_Sqlite.sql | sqlite3 Chinook.db

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
  sqlite3-doc
The following NEW packages will be installed:
  sqlite3
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 769 kB of archives.
After this operation, 1,873 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 sqlite3 amd64 3.37.2-2ubuntu0.4 [769 kB]
Fetched 769 kB in 4s (214 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package sqlite3.
(Reading dat

In [None]:
from langchain_community.utilities import SQLDatabase

db = SQLDatabase.from_uri("sqlite:///Chinook.db", sample_rows_in_table_info=3)
print(f"dialect: {db.dialect}")
print(f"table_names: {db.get_usable_table_names()}")
print(f"Test 'SELECT * FROM Artist LIMIT 10;': {db.run('SELECT * FROM Artist LIMIT 10;')}")

dialect: sqlite
table_names: ['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']
Test 'SELECT * FROM Artist LIMIT 10;': [(1, 'AC/DC'), (2, 'Accept'), (3, 'Aerosmith'), (4, 'Alanis Morissette'), (5, 'Alice In Chains'), (6, 'Antônio Carlos Jobim'), (7, 'Apocalyptica'), (8, 'Audioslave'), (9, 'BackBeat'), (10, 'Billy Cobham')]


In [None]:
from langchain.chat_models import init_chat_model

llm = init_chat_model("openai:gpt-4o-mini")

## Using only relevant tables

1. LLM call to get relevant table names (work best with table names + description + column names)

2. LLM cal to generate query with provided relevant table schema

In [None]:
from langchain_core.output_parsers.openai_tools import PydanticToolsParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

class Table(BaseModel):
    """Table in SQL database"""
    name: str = Field(description="Name of table in SQL database.")

table_names = "\n".join(db.get_usable_table_names())

print(table_names)

Album
Artist
Customer
Employee
Genre
Invoice
InvoiceLine
MediaType
Playlist
PlaylistTrack
Track


In [None]:
system_prompt = f"""
Return the names of ALL the SQL tables that MIGHT be relevant to answer the question.
The tables are:

{table_names}

Remember to include ALL POTENTIALLY RELEVANT tables, even if you are not sure that they are needed.
"""

prompt = ChatPromptTemplate.from_messages(
    [("system", system_prompt), ("human", "{input}")]
)

# bind tools to llm
llm_with_tool = llm.bind_tools([Table])
output_parser = PydanticToolsParser(tools=[Table])

chain = prompt | llm_with_tool | output_parser

In [None]:
chain.invoke("What are all the genres of Alanis Morissette songs?")

[Table(name='Artist'),
 Table(name='Album'),
 Table(name='Genre'),
 Table(name='Track')]

After dynamically got the relevant tables, we add query synthesis step

In [None]:
from operator import itemgetter
from langchain.chains import create_sql_query_chain
from langchain_core.runnables import RunnablePassthrough

query_chain = create_sql_query_chain(llm, db)

table_chain = {"input": itemgetter("question")} | chain | (lambda x: [ table.name for table in x])

full_chain = RunnablePassthrough.assign(
    table_names_to_use=table_chain
) | query_chain | (lambda x: x.split("SQLQuery: ")[-1])

In [None]:
query = full_chain.invoke(
    {"question": "What are all the genres of Alanis Morissette songs?"}
)
print(query)


```sql
SELECT DISTINCT "Genre"."Name" 
FROM "Track" 
JOIN "Album" ON "Track"."AlbumId" = "Album"."AlbumId" 
JOIN "Artist" ON "Album"."ArtistId" = "Artist"."ArtistId" 
JOIN "Genre" ON "Track"."GenreId" = "Genre"."GenreId" 
WHERE "Artist"."Name" = 'Alanis Morissette' 
LIMIT 5;
```


## High-cardinality columns

- High-cardinality columns often store Proper Noun (name, address)

- To efficiently query data from these columns, we must check spelling of input Proper Noun before generating query

- Naive strategy: create a vector store with all the distinct proper nouns that exist in the database -> query that vector store each user input and inject the most relevant proper nouns into the prompt.