Skip to content

Commit

Permalink
Merge pull request #28 from eosphoros-ai/lora
Browse files Browse the repository at this point in the history
update: Updates the readme document and optimizes the code structure
  • Loading branch information
csunny committed Jul 30, 2023
2 parents a06a55a + be4e9f6 commit 0761bfd
Show file tree
Hide file tree
Showing 18 changed files with 726 additions and 852 deletions.
33 changes: 19 additions & 14 deletions README.md
@@ -1,6 +1,6 @@
# DB-GPT-Hub: Text-to-SQL parsing with LLMs

[**简体中文**](README.zh.md) |[**Discord**](https://discord.gg/rBgtJW8U)|[**Wechat**](https://github.com/csunny/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC)
[**简体中文**](README.zh.md) |[**Discord**](https://discord.gg/c2xxQ8Rq)|[**Wechat**](https://github.com/csunny/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC)

## 1. What is DB-GPT-Hub

Expand Down Expand Up @@ -45,12 +45,13 @@ The approximate hardware resources required to quantize and fine-tune the model

### 2.3. Fine-tuning methods

#### Spider+QLoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)
#### Spider+QLoRA/LoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)

This experimental project builds a dataset by adding table structure information, adjusting the parameters of the language model and then fine-tuning the LLM with QLoRA, aiming to reduce the cost of fine-tuning while increasing the accuracy and speed of SQL generation. This can be executed with the following command:
This experimental project builds a dataset by adding table structure information, adjusting the parameters of the language model and then fine-tuning the LLM with QLoRA/LoRA, aiming to reduce the cost of fine-tuning while increasing the accuracy and speed of SQL generation. This can be executed with the following command:

```shell
sh . /scripts/spider_qlora_finetune.sh
sh scripts/qlora/qlora.sh
sh scripts/lora/lora.sh
```

## 3. Usage
Expand All @@ -71,7 +72,7 @@ Put the model files under the new Model folder here

DB-GPT-HUB uses the information matching generation method for data preparation, i.e. the SQL + Repository generation method that combines table information. This method combines data table information to better understand the structure and relationships of the data table, and is suitable for generating SQL statements that meet the requirements.

Before running, you need to create a new data directory, download the dataset and place it in that directory. Here is an example of a spider dataset. The spider dataset contains three main parts:
Before running, you need to download the SQL data set and put it in this directory. Here, take the spider data set as an example. The spider data set consists of three main parts:

* train_spide.json: each text-to-SQL QA data and database related data is stored as a json file
* db_id: the name of the database
Expand Down Expand Up @@ -115,7 +116,7 @@ This data is then expressed in natural language, e.g:
The code implementation of the above data pre-processing section is as follows:

```bash
python src/sql_data_process.py
python dbgpt_hub/utils/sql_data_process.py
```

When fine-tuning the model, we also customize the prompt dict to optimize the input:
Expand All @@ -138,32 +139,36 @@ SQL_PROMPT_DICT = {

### 3.3. Model fine-tuning

Model fine-tuning uses the QLoRA method, where we can run the following command to fine-tune the model:
Model fine-tuning uses the QLoRA/LoRA method, where we can run the following command to fine-tune the model:

```bash
python src/train/train_qlora.py --model_name_or_path <path_or_name>
python train_qlora.py --model_name_or_path <path_or_name>
```
The fine-tuned model weights are saved under the adapter folder by default. The full training script is in scripts/qlora/qlora.sh.For multi-card runs, scripts/spider_qlora_finetune.sh is based on QLoRA by default, so it is recommended to specify the GPU number to run at the beginning. e.g. from `python src/train/train_qlora.py` to `CUDA_VISIBLE_DEVICES=0,1,2,3 python src/train/train_qlora.py`

The fine-tuned model weights will be saved to the output folder by default
```bash
python train_lora.py --model_name_or_path <path_or_name>
```
The full training script is in scripts/lora/.

### 3.4. Merge weights

Run the following command to generate the final merged model:

```bash
python src/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>
python dbgpt_hub/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>
```

## 4. RoadMap

The whole process we will divide into three phases:

* Stage 1:
- [ ] LLaMa/LLaMa2
- [ ] LoRA
- [x] LLaMa/LLaMa2
- [x] LoRA
- [x] QLoRA
- [ ] Falcon
- [ ] LoRA
- [x] Falcon
- [x] LoRA
- [x] QLoRA
- [ ] ChatGLM
- [ ] BLOOM
Expand Down
40 changes: 24 additions & 16 deletions README.zh.md
@@ -1,6 +1,6 @@
# DB-GPT-Hub:利用LLMs实现Text-to-SQL

[**英文**](README.md) |[**Discord**](https://discord.gg/rBgtJW8U)|[**Wechat**](https://github.com/csunny/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC)
[**英文**](README.md) |[**Discord**](https://discord.gg/c2xxQ8Rq)|[**Wechat**](https://github.com/csunny/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC)

## 一、什么是DB-GPT-Hub

Expand Down Expand Up @@ -45,12 +45,13 @@ DB-GPT-HUB目前支持的base模型有:

### 2.3、微调方法

#### Spider+QLoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)
#### Spider+QLoRA/LoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)

该实验项目通过加入表结构信息、调整语言模型的参数等方式构建数据集,然后用QLoRA对LLM模型进行微调,旨在降低微调成本的同时提高SQL生成的准确性和速度。可以通过以下命令来执行:
该实验项目通过加入表结构信息、调整语言模型的参数等方式构建数据集,然后用QLoRA/LoRA对LLM模型进行微调,旨在降低微调成本的同时提高SQL生成的准确性和速度。可以通过以下命令来执行:

```shell
sh ./scripts/spider_qlora_finetune.sh
sh scripts/qlora/qlora.sh
sh scripts/lora/lora.sh
```

## 三、使用方法
Expand All @@ -65,13 +66,13 @@ conda activate dbgpt_hub
pip install -r requirements.txt
mkdir model
```
将下载的大模型文件放在这里的新建model文件夹下面
你可以将下载的大模型文件放在新建model文件夹下面

### 3.2、数据准备

DB-GPT-HUB使用的是信息匹配生成法进行数据准备,即结合表信息的 SQL + Repository 生成方式,这种方式结合了数据表信息,能够更好地理解数据表的结构和关系,适用于生成符合需求的 SQL 语句。

运行前需要新建 data 目录,将数据集下载后放在该目录下。这里以spider数据集为例,spider数据集主要包含三部分:
运行前需要将SQL数据集下载后放在该目录下。这里以spider数据集为例,spider数据集主要包含三部分:

* train_spide.json:每条text-to-SQL的QA数据与数据库相关数据存储为json文件
* db_id:数据库名称
Expand Down Expand Up @@ -115,7 +116,7 @@ DB-GPT-HUB使用的是信息匹配生成法进行数据准备,即结合表信
以上数据预处理部分的代码实现如下:

```bash
python src/sql_data_process.py
python dbgpt_hub/utils/sql_data_process.py
```

在模型微调时,我们还定制了prompt dict以优化输入:
Expand All @@ -137,34 +138,41 @@ SQL_PROMPT_DICT = {

### 3.3、模型微调

模型微调使用的是qlora方法,我们可以运行以下命令来微调模型:
模型微调使用的是qlora和lora方法,我们可以运行以下命令来微调模型:

```bash
python src/train/train_qlora.py --model_name_or_path <path_or_name>
python train_qlora.py --model_name_or_path <path_or_name>
```

微调后的模型权重会默认保存到output文件夹下面。
对应的脚本在scripts/spider_qlora_finetune.sh中,可以增加参数如“--output_dir ./adapter \”来进行指定输出路径。
微调后的模型权重会默认保存到adapter文件夹下面。完整的训练脚本在scripts/qlora/qlora.sh中。
对于多卡运行,scripts/spider_qlora_finetune.sh中由于默认是基于QLoRA,建议在一开始就指定运行的GPU编号。如由`python src/train/train_qlora.py` 改为`CUDA_VISIBLE_DEVICES=0,1,2,3 python src/train/train_qlora.py`

当使用lora微调时,我们可以用以下指令:

```bash
python train_lora.py --model_name_or_path <path_or_name>
```
完整的训练脚本在scripts/lora/中。

### 3.4、合并权重

运行以下命令来生成最终合并的模型:

```bash
python src/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>
python dbgpt_hub/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>
```
在3.3中生成的输出路径对应此3.4中的“--peft_model_path”参数,如其值默认为“./adapter/checkpoint-10/adapter_model”,其他相关参数的默认值也均在merge_peft_adapters.py中的get_arg函数中。

## 四、发展路线

整个过程我们会分为三个阶段:

* 阶段一:
- [ ] LLaMa/LLaMa2
- [ ] LoRA
- [x] LLaMa/LLaMa2
- [x] LoRA
- [x] QLoRA
- [ ] Falcon
- [ ] LoRA
- [x] Falcon
- [x] LoRA
- [x] QLoRA
- [ ] ChatGLM
- [ ] BLOOM
Expand Down
6 changes: 0 additions & 6 deletions data/data_info.yaml
Expand Up @@ -18,12 +18,6 @@ self-instruct:
dataset_format: self-instruct
multi_turn: False

guanaco:
hf_hub_url: JosephusCheung/GuanacoDataset
local_path: ''
dataset_format: guanaco
multi_turn: False


openassistant-guanaco:
hf_hub_url: timdettmers/openassistant-guanaco
Expand Down
8 changes: 7 additions & 1 deletion dbgpt_hub/configs/__init__.py
@@ -1,5 +1,11 @@
from .data_args import DataArguments
from .gen_args import GenerationArguments
from .lora_args import LoraArguments
from .model_args import ModelArguments
from .quant_args import QuantArguments
from .train_args import TrainingArguments

__all__ = ['DataArguments', 'ModelArguments','TrainingArguments']
__all__ = [
'DataArguments', 'GenerationArguments', 'ModelArguments',
'TrainingArguments', 'LoraArguments','QuantArguments'
]
20 changes: 18 additions & 2 deletions dbgpt_hub/configs/data_args.py
Expand Up @@ -39,6 +39,7 @@ class DataArguments:
metadata={
'help': 'Which dataset to finetune on. See datamodule for options.'
})

dataset_dir: str = field(
default=None,
metadata={
Expand All @@ -57,10 +58,9 @@ class DataArguments:
'help':
'Which template to use for constructing prompts in multi-turn dataset training and inference.'
})

eval_dataset_size: Optional[float] = field(
default=0.1, metadata={'help': 'Size of validation dataset.'})

max_train_samples: Optional[int] = field(
default=None,
metadata={
Expand All @@ -69,6 +69,22 @@ class DataArguments:
'value if set.'
},
)
source_max_len: int = field(
default=1024,
metadata={"help": "Maximum source sequence length. Sequences will be right padded (and possibly truncated)."},
)
target_max_len: int = field(
default=256,
metadata={"help": "Maximum target sequence length. Sequences will be right padded (and possibly truncated)."},
)
dataset: str = field(
default='spider',
metadata={"help": "Which dataset to finetune on. See datamodule for options."}
)
dataset_format: Optional[str] = field(
default="spider",
metadata={"help": "Which dataset format is used. [alpaca|chip2|self-instruct|hh-rlhf]"}
)

max_eval_samples: Optional[int] = field(
default=None,
Expand Down
35 changes: 35 additions & 0 deletions dbgpt_hub/configs/gen_args.py
@@ -0,0 +1,35 @@
from dataclasses import asdict, dataclass, field
from typing import Any, Dict, Optional


@dataclass
class GenerationArguments:
# For more hyperparameters check:
# https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig
# Length arguments
max_new_tokens: Optional[int] = field(
default=256,
metadata={"help": "Maximum number of new tokens to be generated in evaluation or prediction loops"
"if predict_with_generate is set."}
)
min_new_tokens : Optional[int] = field(
default=None,
metadata={"help": "Minimum number of new tokens to generate."}
)

# Generation strategy
do_sample: Optional[bool] = field(default=False)
num_beams: Optional[int] = field(default=1)
num_beam_groups: Optional[int] = field(default=1)
penalty_alpha: Optional[float] = field(default=None)
use_cache: Optional[bool] = field(default=False)

# Hyperparameters for logit manipulation
temperature: Optional[float] = field(default=1.0)
top_k: Optional[int] = field(default=50)
top_p: Optional[float] = field(default=1.0)
typical_p: Optional[float] = field(default=1.0)
diversity_penalty: Optional[float] = field(default=0.0)
repetition_penalty: Optional[float] = field(default=1.0)
length_penalty: Optional[float] = field(default=1.0)
no_repeat_ngram_size: Optional[int] = field(default=0)
16 changes: 16 additions & 0 deletions dbgpt_hub/configs/lora_args.py
@@ -0,0 +1,16 @@
from dataclasses import dataclass, field


@dataclass
class LoraArguments:
# Number of columns of matrix A and number of rows of matrix B in Lora
lora_r: int = field(default=64, metadata={'help': 'Lora R dimension.'})
# Scaling factor
lora_alpha: float = field(default=16, metadata={'help': ' Lora alpha.'})
lora_dropout: float = field(default=0.0,
metadata={'help': 'Lora dropout.'})
# Size of memory available on each GPU, in MB. The default is 80GB1 for the high-end version of the A100
max_memory_MB: int = field(default=8000,
metadata={'help': 'Free memory per gpu.'})
lora_weight_path: str = ''
bias: str = 'none'
34 changes: 34 additions & 0 deletions dbgpt_hub/configs/quant_args.py
@@ -0,0 +1,34 @@
from dataclasses import dataclass, field


@dataclass
class QuantArguments:
# With 8-bit adam, can you adjust to LION or Sophia, and even deepspeed offers multiple 1-bit optimizer options0
adam8bit: bool = field(default=False, metadata={'help': 'Use 8-bit adam.'})
# Whether to use quadratic quantization
double_quant: bool = field(
default=True,
metadata={
'help':
'Compress the quantization statistics through double quantization.'
})
# Quantization type, you can choose fp4 or nf4
quant_type: str = field(
default='nf4',
metadata={
'help':
'Quantization data type to use. Should be one of `fp4` or `nf4`.'
})
# Bit width used, default is 4.
bits: int = field(default=4, metadata={'help': 'How many bits to use.'})

def __post_init__(self):
if self.bits is not None:
assert self.bits in [
4, 8
], 'We only accept 4-bit or 8-bit quantization.'

if self.quant_type is not None:
assert self.quant_type in [
'nf4', 'fp4'
], 'We only accept `nf4` or `fp4` quantization type.'

0 comments on commit 0761bfd

Please sign in to comment.