This assignment focuses on improving the latency and accuracy of a text-to-SQL engine when dealing with a large number of tables in a database schema. The tasks involve building a flow to streamline the process and comparing the accuracy of two approaches: directly querying the engine and querying after extracting relevant tables.
To create the database, follow these steps:
- Execute the SQL script
HRTech_db_schema_config.sqlto set up the PostgreSQL database. - Make sure you have PostgreSQL installed and configured on your system.
Before running the code, ensure the following environment variables are set:
OPENAI_API_KEY: API key required for using OpenAI services.- PostgreSQL authentication details:
username,password,host,port.
The ideal20ques.csv file contains a dataset consisting of ideal question-query pairs. It includes three columns: question, query, and db_name, with 20 entries.
To run the code, use the following command-line arguments:
python main.py --output_file <output_file_name> --questions_file <path_to_ideal20ques.csv> --num_questions <number_of_questions> --db_type <database_type>output_file: Name of the output file to store the results.questions_file: Path to theideal20ques.csvfile.num_questions(optional, default = 20): Number of questions to evaluate.db_type(optional, default = postgres): Type of database being used (e.g., postgres, mysql).
I utilized a script called extract_tables.py to extract relevant table schemas based on the user's question. This script employs an AI agent designed to identify and extract relevant table schemas from a provided database schema, depending on the user input.
The extract_tables.py script includes a class called SchemaExtractorAgent, which encapsulates the functionality for extracting table schemas. Here's a breakdown of its key components:
sys_promptMethod: This method generates a system prompt instructing the AI agent on its task. It provides guidance on how to identify and extract relevant table schemas based on the user's question. The database schema configuration is included to help the agent understand the available tables and columns.toolsMethod: This method defines a tool calledextract_relevant_schemas, which specifies the function for extracting relevant table schemas. It describes the input parameters required for the function, such as the schemas extracted from the database.extract_schemasMethod: This method initiates the extraction process based on the user's question. It utilizes OpenAI's GPT-3.5 model to generate responses. The system prompt and user question are provided as input messages to the model. The tool defined in thetoolsmethod is used for extracting relevant schemas from the database based on the user query.
To extract relevant table schemas, the extract_schemas method is called with the user's question as input. The method interacts with the GPT-3.5 model to generate a response containing the extracted schemas. Additionally, it measures the latency of the extraction process.
This approach streamlines the extraction of relevant table schemas, enhancing the efficiency and accuracy of the text-to-SQL engine.
The performance of both approaches was evaluated using the sql-eval framework. The results are summarized in the following image:
The detailed output, including the generated SQL queries and their respective evaluations, can be found in the output.csv file:
This CSV file contains the results for the 20 questions, providing insights into the accuracy and efficiency of the query generation process.
