# Best-practices for prompt engineering Text-to-SQL on Llama3.1
---

## Introduction

In this notebook we will set up two differnet database products
 - RDS for MySQL
 - RDS for PostgreSQL

In each of the above, we will create a single database with two tables and some sample data.

## Contents

1. [Getting Started](#Getting-Started)
    + [Install Dependencies](#Step-0-Install-Dependencies)
    + [Setup Database](#Step-1-Set-up-database)
    + [Build Database](#Step-2-Build-Database)
    + [Cleanup Resources](#Step-3-Cleanup-Resources)

---

## Pre-requisites:

1. Use kernel either `conda_python3`, `conda_pytorch_p310` or `conda_tensorflow2_p310`.
2. Install the required packages.

## Getting Started

### Step 0 Install Dependencies

Here, we will install all the required dependencies to run this notebook.

In [1]:
!pip install boto3==1.34.127 -qU --force --quiet --no-warn-conflicts
!pip install mysql-connector-python==8.4.0 -qU --force --quiet --no-warn-conflicts
!pip install psycopg2==2.9.9 -qU --force --quiet --no-warn-conflicts

**Note:** *When installing libraries using the pip, you may encounter errors or warnings during the installation process. These are generally not critical and can be safely ignored. However, after installing the libraries, it is recommended to restart the kernel or computing environment you are working in. Restarting the kernel ensures that the newly installed libraries are loaded properly and available for use in your code or workflow.*

<div class='alert alert-block alert-info'><b>NOTE:</b> Restart the kernel with the updated packages that are installed through the dependencies above</div>

In [None]:
# Restart the kernel
import os
os._exit(00)

#### Import the required modules to run the notbook

In [1]:
import boto3
import json
import mysql.connector as MySQLdb
from typing import Dict, List, Any
import yaml
import psycopg2 as PGdb
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT


### Step 1 Set up database

Here, we retrieve the services that are already deployed as a part of the cloudformation template to be used in building the application. The services include,

+ Secret ARN with RDS for MySQL and RDS for PostgreSQL Database credentials
+ Database Endpoints

In [2]:
stackname = "text2sql"  # If your stack name differs from "text2sql", please modify.

In [3]:
cfn = boto3.client('cloudformation')

response = cfn.describe_stack_resources(
    StackName=stackname
)
cfn_outputs = cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']

# Get rds secret arn and database endpoint from cloudformation outputs
for output in cfn_outputs:
    if 'SecretArnMySQL' in output['OutputKey']:
        mySQL_secret_id = output['OutputValue']

    if 'DatabaseEndpointMySQL' in output['OutputKey']:
        mySQL_db_host = output['OutputValue']

    if 'SecretArnPG' in output['OutputKey']:
        pg_secret_id = output['OutputValue']

    if 'DatabaseEndpointPG' in output['OutputKey']:
        pg_db_host = output['OutputValue']

In [4]:
secrets_client = boto3.client('secretsmanager')

# Get MySQL credentials from Secrets Manager
credentials = json.loads(secrets_client.get_secret_value(SecretId=mySQL_secret_id)['SecretString'])

# Get password and username from secrets
mySQL_db_password = credentials['password']
mySQL_db_user = credentials['username']
mySQL_db_name = "airline_db"

# Get PostgreSQL credentials from Secrets Manager
credentials = json.loads(secrets_client.get_secret_value(SecretId=pg_secret_id)['SecretString'])

# Get password and username from secrets
pg_db_password = credentials['password']
pg_db_user = credentials['username']
pg_db_name = "airline_db"


##### Establish the database connection (MySQL DB)

In [5]:
mySQL_db_conn = MySQLdb.connect(
    host=mySQL_db_host,
    user=mySQL_db_user,
    password=mySQL_db_password
)


##### Check connection (MySQL DB)

In [6]:
mySQL_db_cursor = mySQL_db_conn.cursor()

mySQL_db_cursor.execute("SHOW DATABASES")

for tmp_db_name in mySQL_db_cursor:
    print(tmp_db_name)
    

('information_schema',)
('mysql',)
('performance_schema',)
('sys',)


##### Establish the database connection (PostgresSQL DB)

In [7]:
# PostgreSQL DB Setup
pg_db_conn = PGdb.connect(
    host=pg_db_host,
    user=pg_db_user,
    password=pg_db_password
)

# In this experiment we will create database in PostgreSQL programmatically
# this step is added only to avoid transaction errors.
pg_db_conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)


##### Check connection (PostgreSQL DB)

In [8]:
pg_db_cursor = pg_db_conn.cursor()

pg_db_cursor.execute("SELECT datname FROM pg_database;")

for tmp_db_name in pg_db_cursor:
    print(tmp_db_name)
    

('template0',)
('template1',)
('postgres',)
('rdsadmin',)


### Step 2 Build Database
Now the notebook will drop the test table and also the test database if it exists. It then proceeds with creation of the table.
Then it will insert test data for use in our prompting examples.

#### Load table schema settings

In [9]:
def load_settings(file_path):
    """
    Reads a YAML file and returns its contents as a Python object.

    Args:
        file_path (str): The path to the YAML file.

    Returns:
        obj: The contents of the YAML file as a Python object.
    """
    try:
        with open(file_path, 'r') as file:
            data = yaml.safe_load(file)
        return data
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' does not exist.")
    except yaml.YAMLError as exc:
        print(f"Error: Failed to parse the YAML file '{file_path}': {exc}")
        

In [10]:
# MySQL Table Setup

# Load table settings
settings_airplanes = load_settings('schemas/airplanes.yml')
table_airplanes = settings_airplanes['table_name']
table_schema_airplanes = settings_airplanes['table_schema']

# Load table settings
settings_flights = load_settings('schemas/flights.yml')
table_flights = settings_flights['table_name']
table_schema_flights = settings_flights['table_schema']

# Load table settings
settings_airplane_flights = load_settings('schemas/airplanes-flights.yml')


In [11]:
# PostgreSQL Table Setup

# Load table settings
pg_settings_airplanes = load_settings('schemas/airplanes-pg.yml')
pg_table_airplanes = pg_settings_airplanes['table_name']
pg_table_schema_airplanes = pg_settings_airplanes['table_schema']

# Load table settings
pg_settings_flights = load_settings('schemas/flights-pg.yml')
pg_table_flights = pg_settings_flights['table_name']
pg_table_schema_flights = pg_settings_flights['table_schema']

# Load table settings
pg_settings_airplane_flights = load_settings('schemas/airplanes-flights-pg.yml')


#### Clean up database

In [12]:
# Delete flights' table
mySQL_db_cursor.execute(f"DROP TABLE IF EXISTS {mySQL_db_name}.{table_flights}")
# Delete airplanes' table
mySQL_db_cursor.execute(f"DROP TABLE IF EXISTS {mySQL_db_name}.{table_airplanes}")
# Delete database
mySQL_db_cursor.execute(f"DROP DATABASE IF EXISTS {mySQL_db_name}")


In [13]:
# PostgreSQL DB Setup
# Delete flights' table
pg_db_cursor.execute(f"DROP TABLE IF EXISTS {pg_db_name}.{table_flights}")
# Delete airplanes' table
pg_db_cursor.execute(f"DROP TABLE IF EXISTS {pg_db_name}.{table_airplanes}")
# Delete database
pg_db_cursor.execute(f"DROP DATABASE IF EXISTS {pg_db_name}")

# PostgreSQL DB Setup
# Delete flights' table
pg_db_cursor.execute(f"DROP TABLE IF EXISTS {table_flights}")
# Delete airplanes' table
pg_db_cursor.execute(f"DROP TABLE IF EXISTS {table_airplanes}")
# Delete database
pg_db_cursor.execute(f"DROP DATABASE IF EXISTS {pg_db_name}")


#### Create database and tables 

##### MySQL DB

In [14]:
# Create database `airline_db` - MySQL DB
mySQL_db_cursor.execute(f"CREATE DATABASE {mySQL_db_name}")

# Create table to hold data on fictitious airplanes information called `airplanes`
mySQL_db_cursor.execute(table_schema_airplanes)

# Create table to hold data on fictitious flights information called `flights`
mySQL_db_cursor.execute(table_schema_flights)


##### PostgreSQL DB

In [15]:
# Create database `airline_db` - PostgresSQL DB
pg_db_cursor.execute(f"CREATE DATABASE {pg_db_name};")
pg_db_cursor.close()
pg_db_conn.close()

# Reconnect to the database.
pg_db_conn = PGdb.connect(
    host=pg_db_host,
    database=pg_db_name,
    user=pg_db_user,
    password=pg_db_password
)

# PostgreSQL DB Setup
pg_db_conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
pg_db_cursor = pg_db_conn.cursor()

# Create table to hold data on fictitious airplanes information called `airplanes`
pg_db_cursor.execute(pg_table_schema_airplanes)

# Create table to hold data on fictitious flights information called `flights`
pg_db_cursor.execute(pg_table_schema_flights)


#### Read sample data

In [16]:
# Read sample data for the airplanes' table
with open('sample_data/airplanes.json', 'r') as f:
    data_airplanes = json.load(f)
    

In [17]:
# Read sample data for the flights' table
with open('sample_data/flights.json', 'r') as f:
    data_flights = json.load(f)
    

#### Ingest sample data into database

##### MySQL DB

In [18]:
# Insert airplanes' data into MySQL database
for data in data_airplanes:
    sql = f"""
        INSERT INTO {mySQL_db_name}.{table_airplanes} 
        (Airplane_id, Producer, Type) 
        VALUES (
        {data['Airplane_id']},
        '{data['Producer']}',
        '{data['Type']}'
        )
        """
    mySQL_db_cursor.execute(sql)
mySQL_db_conn.commit()


In [19]:
# Insert flights' data into MySQL database
for data in data_flights:
    sql = f"""
        INSERT INTO {mySQL_db_name}.{table_flights}
        (Flight_number, Arrival_time, Arrival_date, Departure_time, Departure_date, Destination, Airplane_id) 
        VALUES (
        '{data['Flight_number']}',
        '{data['Arrival_time']}',
        '{data['Arrival_date']}',
        '{data['Departure_time']}',
        '{data['Departure_date']}',
        '{data['Destination']}',
        {data['Airplane_id']}
        )
        """
    mySQL_db_cursor.execute(sql)
mySQL_db_conn.commit()


##### PostgreSQL DB

In [20]:
# Insert airplanes' data into PG database
for data in data_airplanes:
    sql = f"""
        INSERT INTO {table_airplanes} 
        (Airplane_id, Producer, Type) 
        VALUES (
        {data['Airplane_id']},
        '{data['Producer']}',
        '{data['Type']}'
        )
        """

    pg_db_cursor.execute(sql)
pg_db_conn.commit()


In [21]:
# Insert flights' data into PG database
for data in data_flights:
    sql = f"""
        INSERT INTO {table_flights}
        (Flight_number, Arrival_time, Arrival_date, Departure_time, Departure_date, Destination, Airplane_id) 
        VALUES (
        '{data['Flight_number']}',
        '{data['Arrival_time']}',
        '{data['Arrival_date']}',
        '{data['Departure_time']}',
        '{data['Departure_date']}',
        '{data['Destination']}',
        {data['Airplane_id']}
        )
        """
    pg_db_cursor.execute(sql)
pg_db_conn.commit()


#### Verify our database connection works and we can retrieve records from our table.

##### MySQL DB

In [22]:
# MySQL Database
mySQL_db_cursor.execute(f"SELECT * FROM {mySQL_db_name}.{table_airplanes}")
sql_data = mySQL_db_cursor.fetchall()

for record in sql_data:
    print(record)
    

(1, 'Boeing', '737')
(2, 'Airbus', 'A320')
(3, 'Embraer', 'E195')
(4, 'Bombardier', 'CRJ900')
(5, 'Boeing', '777')
(6, 'Airbus', 'A330')
(7, 'Embraer', 'E175')
(8, 'Bombardier', 'Q400')
(9, 'Boeing', '787')
(10, 'Airbus', 'A350')
(11, 'Embraer', 'E190')
(12, 'Bombardier', 'CRJ700')
(13, 'Boeing', '757')
(14, 'Airbus', 'A380')
(15, 'Embraer', 'E170')
(16, 'Bombardier', 'CRJ200')
(17, 'Boeing', '747')
(18, 'Airbus', 'A321')
(19, 'Embraer', 'E145')
(20, 'Bombardier', 'CRJ1000')


In [23]:
# MySQL Database
mySQL_db_cursor.execute(f"SELECT * FROM {mySQL_db_name}.{table_flights}")
sql_data = mySQL_db_cursor.fetchall()

for record in sql_data:
    print(record)
    

('AA123', '2023-06-15T10:30:00', '2023-06-15', '2023-06-15T08:00:00', '2023-06-15', 'Los Angeles', 1)
('AA234', '2023-07-02T21:15:00', '2023-07-02', '2023-07-02T18:30:00', '2023-07-02', 'Tampa', 20)
('AA890', '2023-06-24T18:40:00', '2023-06-24', '2023-06-24T16:10:00', '2023-06-24', 'Atlanta', 5)
('AS345', '2023-06-19T21:00:00', '2023-06-19', '2023-06-19T18:30:00', '2023-06-19', 'Seattle', 7)
('AS789', '2023-06-27T15:50:00', '2023-06-27', '2023-06-27T13:20:00', '2023-06-27', 'Phoenix', 7)
('DL123', '2023-06-25T22:00:00', '2023-06-25', '2023-06-25T19:30:00', '2023-06-25', 'Las Vegas', 10)
('DL345', '2023-06-29T07:30:00', '2023-06-29', '2023-06-29T05:00:00', '2023-06-29', 'Philadelphia', 6)
('DL567', '2023-07-03T09:40:00', '2023-07-03', '2023-07-03T07:10:00', '2023-07-03', 'San Diego', 19)
('DL789', '2023-06-17T18:20:00', '2023-06-17', '2023-06-17T16:00:00', '2023-06-17', 'Miami', 10)
('DL901', '2023-06-21T13:20:00', '2023-06-21', '2023-06-21T10:50:00', '2023-06-21', 'Boston', 6)
('JB012'

##### PostgreSQL DB

In [24]:
# PostgreSQL Database
pg_db_cursor.execute(f"SELECT * FROM {table_airplanes}")
sql_data = pg_db_cursor.fetchall()

for record in sql_data:
    print(record)
    

(1, 'Boeing', '737')
(2, 'Airbus', 'A320')
(3, 'Embraer', 'E195')
(4, 'Bombardier', 'CRJ900')
(5, 'Boeing', '777')
(6, 'Airbus', 'A330')
(7, 'Embraer', 'E175')
(8, 'Bombardier', 'Q400')
(9, 'Boeing', '787')
(10, 'Airbus', 'A350')
(11, 'Embraer', 'E190')
(12, 'Bombardier', 'CRJ700')
(13, 'Boeing', '757')
(14, 'Airbus', 'A380')
(15, 'Embraer', 'E170')
(16, 'Bombardier', 'CRJ200')
(17, 'Boeing', '747')
(18, 'Airbus', 'A321')
(19, 'Embraer', 'E145')
(20, 'Bombardier', 'CRJ1000')


In [25]:
# PostgreSQL Database
pg_db_cursor.execute(f"SELECT * FROM {table_flights}")
sql_data = pg_db_cursor.fetchall()

for record in sql_data:
    print(record)
    

('AA123', '2023-06-15T10:30:00', '2023-06-15', '2023-06-15T08:00:00', '2023-06-15', 'Los Angeles', 1)
('UA456', '2023-06-16T14:45:00', '2023-06-16', '2023-06-16T12:15:00', '2023-06-16', 'New York', 6)
('DL789', '2023-06-17T18:20:00', '2023-06-17', '2023-06-17T16:00:00', '2023-06-17', 'Miami', 10)
('WN012', '2023-06-18T11:10:00', '2023-06-18', '2023-06-18T09:30:00', '2023-06-18', 'Chicago', 2)
('AS345', '2023-06-19T21:00:00', '2023-06-19', '2023-06-19T18:30:00', '2023-06-19', 'Seattle', 7)
('JB678', '2023-06-20T07:45:00', '2023-06-20', '2023-06-20T05:15:00', '2023-06-20', 'San Francisco', 8)
('DL901', '2023-06-21T13:20:00', '2023-06-21', '2023-06-21T10:50:00', '2023-06-21', 'Boston', 6)
('SW234', '2023-06-22T16:35:00', '2023-06-22', '2023-06-22T14:05:00', '2023-06-22', 'Dallas', 3)
('UA567', '2023-06-23T09:15:00', '2023-06-23', '2023-06-23T06:45:00', '2023-06-23', 'Houston', 9)
('AA890', '2023-06-24T18:40:00', '2023-06-24', '2023-06-24T16:10:00', '2023-06-24', 'Atlanta', 5)
('DL123', '2

### Step 3: Conclusion

We can observe that ChromaDB and `Amazon Titan Embedding` model were able to retrieve the correct table schemas for the previous examples.  After successfully implementing these solutions, the issue of incorrectly retrieved table schemas due to foreign key confusions was effectively addressed. The data retrieval process became more accurate and reliable, ensuring that the correct table schemas were consistently retrieved, even in the presence of complex table relationships and foreign key references.

### Step 3 Cleanup Resources

In [26]:
%%time
# Cleanup Cursor and connection objects.
mySQL_db_cursor.close()
mySQL_db_conn.close()

pg_db_cursor.close()
pg_db_conn.close()


CPU times: user 2.77 ms, sys: 0 ns, total: 2.77 ms
Wall time: 2.24 ms


#### Thank you!
In this part we have set up the database.