# LangChain Dem - Querying SQL data


The most common type of data in the world sits in tabular form (ok, ok, besides unstructured data). It is super powerful to be able to query this data with LangChain and pass it through to an LLM


For futher reading check out "Agents + Tabular Data" (Pandas, SQL, CSV)

**Resources**
> - https://python.langchain.com/en/latest/use_cases/tabular.html
> - https://python.langchain.com/docs/modules/chains/popular/sqlite.html
 
<div class="alert alert-block alert-warning"> TODO 
    
SQL Agent, 
Pandas Agent, 
CSV Agent

</div>


## This notebook

This notebook collects Python examples using SQL.


This notebook has been tested in June 2023 on AWS SageMaker using DataScience 3.0 image.

Test environment:
> - AWS SageMaker Studio's notebook 
>> - Kernel image Data Science 3.0
>> - t3.medium 2CPU - 4GB
>> - Python 3.9.15
>> - Linux default 4.14.304-226.531.amzn2.x86_64

More informatioon about Langchain and dedicated examples in other notebooksof the same folder.

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">NOTEBOOK SETUP</div>



**Instructions**

All setups are at the top of the notebook so that you can run all this section initialize the notebook.

Notebook chapters are not dependant on each other and may be run in isolation.

Before running the setup you may need to create the following resources
- request an OpenAI API keys. OpenAI APIs are not free.

Additonal requirements for some examples
- request a Kaggle API

Confer to the setup sections for instruction on how to create those resources.

---
## API keys and environment

Langchain will get the API keys from environment variables or function parameters.

**Instructions**

- Never show the keys in shared notebooks, whether it part of the code or a log. A simple way to avoid key leakage, is to use environement variables.  You set the environment variable in the terminal or some local configuration. If so you do not have to set the key here.

- If it is easier for you to set the key here by assigning the value, do not forget to empty the string right after you run this block. The environment will be kept in memory as long as the kernel runs.

- Be careful when printing the keys. Ensure that you remove the outputs. 

- Before sharing check that the keys are not printed out by some features of the libraries. Avoid to print libraries' objects. They often hold the API keys as a property and may disclose the key value.


I Store API keys and configuration information in AWS Secrets Manager. The code below retrieves the secret holding the keys. The secret is a JSON string consisting in key/value pairs. It will be used later to set various environnement variables.

When using Notebooks and SageMaker do not forget to give permissions to read this secret to SageMaker execution role.

In [2]:
!apt-get update && apt-get install -y jq 1>/dev/null

Get:1 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Hit:2 http://deb.debian.org/debian bullseye InRelease     
Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [252 kB]
Fetched 344 kB in 0s (910 kB/s)   
Reading package lists... Done


In [3]:
%%bash --out secrets 
# using AWS's Secret Manager to store keys
# garb the keys and store it into a Pytthon variable
export RESPONSE=$(aws secretsmanager get-secret-value --secret-id 'salvia/labbench/tests' )
export SECRETS=$( echo $RESPONSE | jq '.SecretString | fromjson')

echo $SECRETS

---
## pip upgrade and C++ compiler

In [4]:
!pip install --upgrade pip  1>/dev/null

[0m

In [5]:
!apt-get update && apt-get install -y build-essential 1>/dev/null

Hit:1 http://security.debian.org/debian-security bullseye-security InRelease
Hit:2 http://deb.debian.org/debian bullseye InRelease
Hit:3 http://deb.debian.org/debian bullseye-updates InRelease
Reading package lists... Done


---
## LangChain Setup

**Resources**
> - [LangChain GetStarted](https://python.langchain.com/docs/get_started/quickstart)

In [6]:
!pip install langchain==0.0.230 1>/dev/null

[0m

---
## OpenAI Setup

**Resources**
> - [OpenAI tutorial on API keys](https://platform.openai.com/docs/quickstart)
> - [OpenAI package on Pypi](https://pypi.org/project/openai/)

In [7]:
import os

os.environ["OPENAI_API_KEY"] = eval(secrets)["OPENAI_API_KEY"]


In [8]:
!pip install openai==0.27.8 1>/dev/null

[0m

---
## SQL database setup

- sqlite3: db engine
- sqlalchemy: ORM for databases
- ipython-sql: SQL magic function
- pandas:  data science/data analysis

In [9]:
!pip install pysqlite3==0.5.1 1>/dev/null

[0m

In [10]:
!pip install pandas==1.4.4 1>/dev/null

[0m

In [11]:
!pip install sqlalchemy==2.0.18 1>/dev/null

[0m

In [12]:
!pip install ipython-sql==0.5.0 1>/dev/null

[0m

---
## Setup additional datasets tools
<div class="alert alert-block alert-warning"> 
    TODO <br>
</div>

Kaggle is used to get some datasets

Setup the folowing API Keys
- os.environ['KAGGLE_USERNAME'] = 'YOUR_USERNAME'
- os.environ['KAGGLE_KEY'] = 'YOUR_KEY'

<br/>

**Resources**

> - https://lindevs.com/set-up-kaggle-api



In [13]:
# Get An API Token
os.environ["KAGGLE_USERNAME"] = eval(secrets)["KAGGLE_USERNAME"]
os.environ["KAGGLE_KEY"] = eval(secrets)["KAGGLE_KEY"]

In [14]:
!pip install kaggle==1.5.15 1>/dev/null

[0m

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">MOVIES DATABASE SETUP</div>


In [15]:
!mkdir -p work/sql/{data,sqlite}

In [17]:
notebook_folder = "work/sql"
data_folder = f"{notebook_folder}/data"
sqlite_folder = f"{notebook_folder}/sqlite"

print(data_folder)

work/sql/data


Sample datasets

https://scikit-learn.org/stable/datasets/toy_dataset.html

https://github.com/mwaskom/seaborn-data

https://www.kaggle.com/datasets/


## The movies dataset

IMDB Movies dataset from Kaggle
> - https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

**Resources**
> - https://www.kaggle.com/docs/datasets

In [18]:
import os

db_name = "work/sql/sqlite/movies.db"
database_url = f"sqlite:///{db_name}"
os.environ["DATABASE_URL"] = database_url

table_name = "movies"

## Download the dataset

Requires kaggle is installed and api keys are setup. Check the first part of the notebbok if need be.

**Resources**
> - https://lindevs.com/set-up-kaggle-api

In [26]:
!kaggle datasets download -d harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows -p work/sql/data

Downloading imdb-dataset-of-top-1000-movies-and-tv-shows.zip to work/sql/data
100%|█████████████████████████████████████████| 175k/175k [00:00<00:00, 753kB/s]
100%|█████████████████████████████████████████| 175k/175k [00:00<00:00, 659kB/s]


In [27]:
!unzip work/sql/data/imdb-dataset-of-top-1000-movies-and-tv-shows.zip -d work/sql/data

Archive:  work/sql/data/imdb-dataset-of-top-1000-movies-and-tv-shows.zip
  inflating: work/sql/data/imdb_top_1000.csv  


## Create the movies database

create the database 
- %sql + env variable DATABASE_URL)

Do not forget to close the %sql connection. Othrwise db will not be readable by pandas.

You mayy have to restart kernel for the magic %sql to work

Valid SQLite URL forms are:
 sqlite:///:memory: (or, sqlite://)
 sqlite:///relative/path/to/file.db
 sqlite:////absolute/path/to/file.db

%sql require sqlite, sqlalchemy and ipython-sql installed at the beginning of the notebook

In [31]:
from sqlalchemy.engine import create_engine
import sqlite3
from pandas.io import sql
import subprocess
import os
%load_ext sql


The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [32]:
%sql

In [33]:
#%%

conn = %sql l / --connections

print(conn)

conn[database_url].close(database_url)

{'sqlite:///work/sql/sqlite/movies.db': <sql.connection.Connection object at 0x7f22cf990e20>}


## Load data from the csv file

it prints the number of line inserted

In [34]:
import pandas
from sqlite3 import connect

csv_file_name = "work/sql/data/imdb_top_1000.csv"

#conn = sqlite3.connect(db_url) 
conn = sqlite3.connect(db_name)
df_input = pandas.read_csv(csv_file_name)
df_input.to_sql(table_name, con=conn, if_exists='append', index=False)
conn.close()

## Check the data and close conection

In [35]:
import pandas
import sqlite3

conn = sqlite3.connect(db_name)
df = pandas.read_sql('select * from movies', conn)  
display(df.head(2))
conn.close()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411


In [36]:
print(df['Series_Title'])

0      The Shawshank Redemption
1                 The Godfather
2               The Dark Knight
3        The Godfather: Part II
4                  12 Angry Men
                 ...           
995      Breakfast at Tiffany's
996                       Giant
997       From Here to Eternity
998                    Lifeboat
999                The 39 Steps
Name: Series_Title, Length: 1000, dtype: object


In [None]:
conn.close()

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">LANGCHAIN QUERYING MOVIES</div>



## Run SQLDatabaseChain against the database

How does it work?

It generates a SQL request corresponding to the text and then place the request.

Steps:

- Find which table to use
- Find which column to use
- Construct the correct sql query
- Execute that query
- Get the result
- Return a natural language reponse back


In [37]:
from langchain.llms import OpenAI
from langchain.sql_database import SQLDatabase
from langchain.chains import SQLDatabaseChain

llm = OpenAI(temperature=0)

# db connection information
#sqlite_db_path = 'data/San_Francisco_Trees.db'
# db_uri = ":memory:" # not supported
#db = SQLDatabase.from_uri(f"sqlite:///{sqlite_db_path}")
db = SQLDatabase.from_uri(database_url)

# setup the chain
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)





In [38]:
# query the chain
db_chain.run("How many movies are there?")



[1m> Entering new  chain...[0m
How many movies are there?
SQLQuery:[32;1m[1;3mSELECT COUNT(*) FROM movies;[0m
SQLResult: [33;1m[1;3m[(1000,)][0m
Answer:[32;1m[1;3mThere are 1000 movies.[0m
[1m> Finished chain.[0m


'There are 1000 movies.'

In [39]:
# query the chain
# please note that diredctor is a field

db_chain.run("How many distinct directors are there?")



[1m> Entering new  chain...[0m
How many distinct directors are there?
SQLQuery:

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 8.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 10.0 seconds as it raised RateLimitError: You exceeded your current quota, please ch

RateLimitError: You exceeded your current quota, please check your plan and billing details.

In [None]:
# query the chain
# please note that diredctor is a field

db_chain.run("Who is the director of The GodFather")

In [None]:
# query the chain

db_chain.run("What is the schema of the database")

In [None]:
# query the chain

db_chain.run("List 50 movies ranked by score")

## Check responses

In [None]:
#%%

conn = %sql l / --connections

print(conn)

if database_url in conn:
    conn[database_url].close(database_url)

Response evaluation

In [None]:
import sqlite3
import pandas as pd

# Connect to the SQLite database
connection = sqlite3.connect(db_name)

In [None]:
# Define your SQL query
query = "SELECT count(distinct Series_Title) FROM movies"

# Read the SQL query into a Pandas DataFrame
df = pd.read_sql_query(query, connection)


# Display the result in the first column first cell
print(df.iloc[0,0])

In [None]:
# Define your SQL query
query = "SELECT count(distinct Director) FROM movies"

# Read the SQL query into a Pandas DataFrame
df = pd.read_sql_query(query, connection)


# Display the result in the first column first cell
print(df.iloc[0,0])

In [None]:
# Close the connection
connection.close()


---

In [None]:
print('DONE'