<a href="https://colab.research.google.com/github/cooolbabu/GoogleGemini101/blob/main/OpenAI/11_LangChain_SQL_Queries_Postgres_Claude_Gotcha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Playing with Langchain and Claude, i found an issue. Highlighting this issue to get better understanding of the underlying problem

# Justin Grammens LetsLearnLangChain presentation.
- [Course content on Github](https://github.com/justingrammens/LetsLearnLangChain)
- [Youtube link](https://www.youtube.com/watch?v=QT3wALFDZBo)

# Data for Chinook
- Chinook is a sample database contain sales of music tracks for different genres
- [Luis Rocha Chinook-Database](https://github.com/lerocha/chinook-database>


## Issue
- The question is "Get total sales by quarter for years 2023 and 2024 for each genre. Analyse the results and give me a summary."
  - Langchain-SQLDatabasechain performs the query and retrives the results. These results are then presented to Anthropic/Claude to analyse and present a summary.
  - Claude does a fantastic job of summarizing the results.
  - However, it misses the point that in the Genre Rock had the highest sales in 2023


## Installation


In [1]:
!pip install langchain langchain_experimental langchain-anthropic -q
!pip install supabase -q
!pip install pyodbc -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m273.9/273.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.9/86.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m851.6/851.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

## Helper functions

In [2]:
#
# Retrieve a file from Github Repository

def get_ChinookDB_ER_from_github(file_url):

  # Send a GET request to the file URL
  response = requests.get(file_url)

  # Check if the request was successful
  if response.status_code == 200:
      # Get the content of the file
      file_content = response.text
      return file_content
  else:
      return(f'Failed to retrieve the file: {response.status_code}')

## Get API Keys and setup

In [3]:
from langchain import PromptTemplate, LLMChain, ConversationChain
from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain.agents import create_sql_agent
from langchain.agents.agent_types import AgentType
from langchain.sql_database import SQLDatabase

from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.agent_toolkits import SQLDatabaseToolkit

from langchain_anthropic import ChatAnthropic

from supabase import create_client, Client

from sqlalchemy import create_engine

import requests
import json
import pandas as pd
import pyodbc

from google.colab import userdata

# setup supabase
supabase_url = userdata.get('SUPABASE_URL')
supabase_key = userdata.get('SUPABASE_KEY')
supabase_client = create_client(supabase_url, supabase_key)

# OpenAI key
claude_api_key = userdata.get('Claude_api_key')

In [4]:
# Connecting to Supabase using SQL Alchemy

# Make sure the connection string is a proper string within quotes
supabase_conn_str = userdata.get('supabase_conn_str')
#supabase_conn_str = "postgresql+psycopg2://[USERID:PASSWORD]@[HOSTNAME:PORT_NUMBER]/[DATABASE_NAME]"


# Use the connection string with SQLAlchemy's create_engine function
db_engine = create_engine(supabase_conn_str)
db = SQLDatabase(db_engine)

# test the connection
print(db.dialect)
print(db.get_usable_table_names())


postgresql
['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track', 'countries']


## Setup context and build the prompt
- The ER diagram for ChinookDB is a Mermaid.JS script.
- dba_instructions and ER diagram are built into system prompt.
- User prompt will ask questions about the data.

In [5]:
# Build the prompt template

chinookDB_ER_from_github = get_ChinookDB_ER_from_github("https://raw.githubusercontent.com/cooolbabu/GoogleGemini101/main/OpenAI/ChinookDB_ER.txt")
# Markdown(chinookDB_ER_from_github) ## uncoment to view file from Github

## Build the prompt
dba_instructions="""
1. You are a Data Architect for ChinookDB.
2. An entiry-relationship diagram for ChinookDB, in the form of a Mermaid.js script is provide to you
3. You will use this ER diagram to deliver SQL queries.
4. Database dialect is PostgreSQL
5. Use quotes for database object names in SQL statements

"""

dba_system_message = SystemMessage(content=dba_instructions + chinookDB_ER_from_github)

prompt = ChatPromptTemplate.from_messages([
    dba_system_message,
    ("user", "{input}")
])


## Lets run some chains
- Setup the chain
- Ask questions

In [6]:
# Setup the chain

llm = ChatAnthropic(model="claude-3-sonnet-20240229", temperature=0, max_tokens=1024, anthropic_api_key= claude_api_key)
db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True, top_k=200)


In [8]:
result = db_chain.invoke("For the Genre Rock get total sales by quarter in the year 2023")



[1m> Entering new SQLDatabaseChain chain...[0m
For the Genre Rock get total sales by quarter in the year 2023
SQLQuery:[32;1m[1;3mSQLQuery:
SELECT
    EXTRACT(QUARTER FROM "InvoiceDate") AS "Quarter",
    SUM("InvoiceLine"."UnitPrice" * "InvoiceLine"."Quantity") AS "Total Sales"
FROM
    "Invoice"
    JOIN "InvoiceLine" ON "Invoice"."InvoiceId" = "InvoiceLine"."InvoiceId"
    JOIN "Track" ON "InvoiceLine"."TrackId" = "Track"."TrackId"
    JOIN "Genre" ON "Track"."GenreId" = "Genre"."GenreId"
WHERE
    "Genre"."Name" = 'Rock'
    AND DATE_PART('year', "InvoiceDate") = 2023
GROUP BY
    EXTRACT(QUARTER FROM "InvoiceDate")
ORDER BY
    "Quarter";[0m
SQLResult: [33;1m[1;3m[(Decimal('1'), Decimal('58.41')), (Decimal('2'), Decimal('35.64')), (Decimal('3'), Decimal('32.67')), (Decimal('4'), Decimal('29.70'))][0m
Answer:[32;1m[1;3mThe total sales for the Rock genre by quarter in the year 2023 are:

Quarter 1: $58.41
Quarter 2: $35.64 
Quarter 3: $32.67
Quarter 4: $29.70[0m
[1m> F

### In the above result,  notice that Genre-Rock, had the highest sales in Q1, 2023

In [9]:
result = db_chain.invoke("Get total sales by quarter for years 2023 and 2024 for each genre. Analyse the results and give me a summary.")



[1m> Entering new SQLDatabaseChain chain...[0m
Get total sales by quarter for years 2023 and 2024 for each genre. Analyse the results and give me a summary.
SQLQuery:[32;1m[1;3mSQLQuery:
SELECT
    g."Name" AS "Genre",
    EXTRACT(YEAR FROM i."InvoiceDate") AS "Year",
    EXTRACT(QUARTER FROM i."InvoiceDate") AS "Quarter",
    SUM(il."UnitPrice" * il."Quantity") AS "Total Sales"
FROM
    "Genre" g
    JOIN "Track" t ON g."GenreId" = t."GenreId"
    JOIN "InvoiceLine" il ON t."TrackId" = il."TrackId"
    JOIN "Invoice" i ON il."InvoiceId" = i."InvoiceId"
WHERE
    EXTRACT(YEAR FROM i."InvoiceDate") IN (2023, 2024)
GROUP BY
    g."Name", EXTRACT(YEAR FROM i."InvoiceDate"), EXTRACT(QUARTER FROM i."InvoiceDate")
ORDER BY
    "Year", "Quarter", "Genre"
LIMIT 200;[0m
SQLResult: [33;1m[1;3m[('Alternative & Punk', Decimal('2023'), Decimal('1'), Decimal('20.79')), ('Blues', Decimal('2023'), Decimal('1'), Decimal('7.92')), ('Hip Hop/Rap', Decimal('2023'), Decimal('1'), Decimal('0.99')),

In [None]:
print(result)

{'query': 'Get total sales by quarter for years 2023 and 2024 for each genre. Analyse the results and give me a summary', 'result': "The query provides a breakdown of total sales by genre, year, and quarter for the years 2023 and 2024. Here's a summary of the results:\n\nIn 2023:\n- The Rock genre had the highest sales in all four quarters, with the highest sales in Q2 ($35.64) and Q3 ($32.67).\n- The Latin genre also had consistently high sales across all quarters, ranging from $11.88 in Q1 to $24.75 in Q3.\n- Other genres with notable sales include Alternative & Punk, Metal, and TV Shows.\n\nIn 2024:\n- The Rock genre again had the highest sales in Q1 ($43.56), Q2 ($48.51), and Q3 ($51.48).\n- The Metal genre had high sales in Q1 ($43.56), while the Latin genre had consistent sales across all quarters.\n- Genres like TV Shows, Comedy, Drama, and Sci Fi & Fantasy also had notable sales in certain quarters.\n\nOverall, the results show that the Rock and Latin genres were the top-sellin

# So what do you guys's think. What do you think is the toughest query. Post it on the Applied AI Slack channel.