Skip to content

am4ever/Text2SQL-CodeQwen1.5-Finetune

 
 

Repository files navigation

DTS-SQL with CodeQwen1.5: Text2SQL Implementation

This project uses CodeQwen1.5, fine-tuned in a DTS-SQL approach, and performs inference with VLLM to implement a text-to-SQL generation pipeline. Here's how to set up and use this project.

Installation

First, clone the repository from GitHub:

git clone https://github.com/MohammadrezaPourreza/DTS-SQL/tree/main
cd DTS-SQL

Install the necessary dependencies:

pip install -U accelerate bitsandbytes peft transformers==4.39 datasets trl git-lfs wandb flash-attn sql-metadata scipy sqlglot

Note: Avoid using `transformers==4.40` as it leads to errors with CodeQwen.

Data Preparation

To set up the dataset for this project, follow these steps:

  1. Download the dataset from Google Drive.

  2. Unzip the downloaded file if it is compressed.

  3. Place the unzipped dataset folder into the spider directory at the root of this project.

This project utilizes the Spider 1.0 dataset, which can be found on the Spider dataset website.

Alternatively, you can skip the script execution and directly download the pre-built datasets:

These links contain all the necessary files to get started with the project.

If you dont't want to download, o prepare the training and validation datasets, you have the option to use the provided scripts. Please note that you may need to adjust the dataset paths in the scripts accordingly.

python finetuning_dataset_creator.py

The file with 'filtered' in its name contains the ideal results (directly extracted information from SQL answers).

Fine-tuning and Inference

Stage 1: Schema Linking

  1. Execute the Jupyter notebook for schema linking fine-tuning:
    jupyter notebook schema_linking_generation_finetuning.ipynb
  2. Run the inference notebook:
    jupyter notebook schema_linking_generation_inference.ipynb
    The output of this step is a trained model capable of extracting potentially relevant table names from a given question and its related database schema.

Stage 2: SQL Generation

  1. Train the SQL generation model:
    jupyter notebook sql_generation_finetuning.ipynb
  2. Perform inference to generate SQL queries:
    jupyter notebook sql_generation_inference.ipynb
    For an upperbound result, replace the file `generated_test_schema_links.csv` with `./validation/filtered_spider_syn_dataset.csv` in the inference notebook.

The output of this step will be the final checkpoint for the first stage LoRA layer and the stage1 model's inference results on the test set (final_checkpoint_part1). Additionally, data_processing_scheme2inference.py is integrated to generate the data (useful_val_dataset1.csv) used for final SQL generation inference.

Batch Execution

Run the batch script to execute all notebooks automatically:

sh run.sh

Using VLLM for Faster Inference

For faster inference, you can use the VLLM model by running:

jupyter notebook inference_vllm.ipynb

Note: The results with VLLM may not be as accurate as with the fully fine-tuned models.

Models

Download the required models from the following links:

Results

Evaluation on Spider (https://yale-lily.github.io/spider)

Expected results are:

  • Exact Match (EM): 0.736
  • Execution Accuracy (EX): 0.813
  • Upperbound result: 0.876 with ideal data setup

About

As one of the authors, this is my mirror repository.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.6%
  • Python 1.4%