Merge pull request #28 from eosphoros-ai/lora

update: Updates the readme document and optimizes the code structure
eosphoros-ai · Jul 30, 2023 · 0761bfd · 0761bfd
2 parents a06a55a + be4e9f6
commit 0761bfd
Show file tree

Hide file tree

Showing 18 changed files with 726 additions and 852 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # DB-GPT-Hub: Text-to-SQL parsing with LLMs
 
-[**简体中文**](README.zh.md) |[**Discord**](https://discord.gg/rBgtJW8U)|[**Wechat**](https://github.com/csunny/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC)
+[**简体中文**](README.zh.md) |[**Discord**](https://discord.gg/c2xxQ8Rq)|[**Wechat**](https://github.com/csunny/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC)
 
 ## 1. What is DB-GPT-Hub
 
@@ -45,12 +45,13 @@ The approximate hardware resources required to quantize and fine-tune the model
 
 ### 2.3. Fine-tuning methods
 
-#### Spider+QLoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)
+#### Spider+QLoRA/LoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)
 
-This experimental project builds a dataset by adding table structure information, adjusting the parameters of the language model and then fine-tuning the LLM with QLoRA, aiming to reduce the cost of fine-tuning while increasing the accuracy and speed of SQL generation. This can be executed with the following command:
+This experimental project builds a dataset by adding table structure information, adjusting the parameters of the language model and then fine-tuning the LLM with QLoRA/LoRA, aiming to reduce the cost of fine-tuning while increasing the accuracy and speed of SQL generation. This can be executed with the following command:
 
 ```shell
-sh . /scripts/spider_qlora_finetune.sh
+sh scripts/qlora/qlora.sh
+sh scripts/lora/lora.sh
 ```
 
 ## 3. Usage
@@ -71,7 +72,7 @@ Put the model files under the new Model folder here
 
 DB-GPT-HUB uses the information matching generation method for data preparation, i.e. the SQL + Repository generation method that combines table information. This method combines data table information to better understand the structure and relationships of the data table, and is suitable for generating SQL statements that meet the requirements.
 
-Before running, you need to create a new data directory, download the dataset and place it in that directory. Here is an example of a spider dataset. The spider dataset contains three main parts:
+Before running, you need to download the SQL data set and put it in this directory. Here, take the spider data set as an example. The spider data set consists of three main parts:
 
 * train_spide.json: each text-to-SQL QA data and database related data is stored as a json file
   * db_id: the name of the database
@@ -115,7 +116,7 @@ This data is then expressed in natural language, e.g:
 The code implementation of the above data pre-processing section is as follows:
 
 ```bash
-python src/sql_data_process.py
+python dbgpt_hub/utils/sql_data_process.py
 ```
 
 When fine-tuning the model, we also customize the prompt dict to optimize the input: 
@@ -138,32 +139,36 @@ SQL_PROMPT_DICT = {
 
 ### 3.3. Model fine-tuning
 
-Model fine-tuning uses the QLoRA method, where we can run the following command to fine-tune the model:
+Model fine-tuning uses the QLoRA/LoRA method, where we can run the following command to fine-tune the model:
 
 ```bash
-python src/train/train_qlora.py --model_name_or_path <path_or_name>
+python train_qlora.py --model_name_or_path <path_or_name>
 ```
+The fine-tuned model weights are saved under the adapter folder by default. The full training script is in scripts/qlora/qlora.sh.For multi-card runs, scripts/spider_qlora_finetune.sh is based on QLoRA by default, so it is recommended to specify the GPU number to run at the beginning. e.g. from `python src/train/train_qlora.py` to `CUDA_VISIBLE_DEVICES=0,1,2,3 python src/train/train_qlora.py` 
 
-The fine-tuned model weights will be saved to the output folder by default
+```bash
+python train_lora.py --model_name_or_path <path_or_name>
+```
+The full training script is in scripts/lora/.
 
 ### 3.4. Merge weights
 
 Run the following command to generate the final merged model:
 
 ```bash
-python src/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>
+python dbgpt_hub/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>
 ```
 
 ## 4. RoadMap 
 
 The whole process we will divide into three phases:
 
 * Stage 1:
-  - [ ] LLaMa/LLaMa2
-    - [ ] LoRA
+  - [x] LLaMa/LLaMa2
+    - [x] LoRA
     - [x] QLoRA
-  - [ ] Falcon
-    - [ ] LoRA
+  - [x] Falcon
+    - [x] LoRA
     - [x] QLoRA
   - [ ] ChatGLM
   - [ ] BLOOM

diff --git a/README.zh.md b/README.zh.md
@@ -1,6 +1,6 @@
 # DB-GPT-Hub:利用LLMs实现Text-to-SQL
 
-[**英文**](README.md) |[**Discord**](https://discord.gg/rBgtJW8U)|[**Wechat**](https://github.com/csunny/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC)
+[**英文**](README.md) |[**Discord**](https://discord.gg/c2xxQ8Rq)|[**Wechat**](https://github.com/csunny/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC)
 
 ## 一、什么是DB-GPT-Hub
 
@@ -45,12 +45,13 @@ DB-GPT-HUB目前支持的base模型有：
 
 ### 2.3、微调方法
 
-#### Spider+QLoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)
+#### Spider+QLoRA/LoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)
 
-该实验项目通过加入表结构信息、调整语言模型的参数等方式构建数据集，然后用QLoRA对LLM模型进行微调，旨在降低微调成本的同时提高SQL生成的准确性和速度。可以通过以下命令来执行：
+该实验项目通过加入表结构信息、调整语言模型的参数等方式构建数据集，然后用QLoRA/LoRA对LLM模型进行微调，旨在降低微调成本的同时提高SQL生成的准确性和速度。可以通过以下命令来执行：
 
 ```shell
-sh ./scripts/spider_qlora_finetune.sh
+sh scripts/qlora/qlora.sh
+sh scripts/lora/lora.sh
 ```
 
 ## 三、使用方法
@@ -65,13 +66,13 @@ conda activate dbgpt_hub
 pip install -r requirements.txt 
 mkdir model 
 ```
-将下载的大模型文件放在这里的新建model文件夹下面
+你可以将下载的大模型文件放在新建model文件夹下面
 
 ### 3.2、数据准备
 
 DB-GPT-HUB使用的是信息匹配生成法进行数据准备，即结合表信息的 SQL + Repository 生成方式，这种方式结合了数据表信息，能够更好地理解数据表的结构和关系，适用于生成符合需求的 SQL 语句。
 
-运行前需要新建 data 目录，将数据集下载后放在该目录下。这里以spider数据集为例，spider数据集主要包含三部分：
+运行前需要将SQL数据集下载后放在该目录下。这里以spider数据集为例，spider数据集主要包含三部分：
 
 * train_spide.json：每条text-to-SQL的QA数据与数据库相关数据存储为json文件
   * db_id：数据库名称
@@ -115,7 +116,7 @@ DB-GPT-HUB使用的是信息匹配生成法进行数据准备，即结合表信
 以上数据预处理部分的代码实现如下：
 
 ```bash
-python src/sql_data_process.py
+python dbgpt_hub/utils/sql_data_process.py
 ```
 
 在模型微调时，我们还定制了prompt dict以优化输入：
@@ -137,34 +138,41 @@ SQL_PROMPT_DICT = {
 
 ### 3.3、模型微调
 
-模型微调使用的是qlora方法，我们可以运行以下命令来微调模型：
+模型微调使用的是qlora和lora方法，我们可以运行以下命令来微调模型：
 
 ```bash
-python src/train/train_qlora.py --model_name_or_path <path_or_name>
+python train_qlora.py --model_name_or_path <path_or_name>
 ```
 
-微调后的模型权重会默认保存到output文件夹下面。
-对应的脚本在scripts/spider_qlora_finetune.sh中，可以增加参数如“--output_dir ./adapter \”来进行指定输出路径。
+微调后的模型权重会默认保存到adapter文件夹下面。完整的训练脚本在scripts/qlora/qlora.sh中。
 对于多卡运行，scripts/spider_qlora_finetune.sh中由于默认是基于QLoRA，建议在一开始就指定运行的GPU编号。如由`python src/train/train_qlora.py` 改为`CUDA_VISIBLE_DEVICES=0,1,2,3 python src/train/train_qlora.py` 。
 
+当使用lora微调时，我们可以用以下指令：
+
+```bash
+python train_lora.py --model_name_or_path <path_or_name>
+```
+完整的训练脚本在scripts/lora/中。
+
 ### 3.4、合并权重
 
 运行以下命令来生成最终合并的模型：
 
 ```bash
-python src/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>
+python dbgpt_hub/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>
 ```
 在3.3中生成的输出路径对应此3.4中的“--peft_model_path”参数，如其值默认为“./adapter/checkpoint-10/adapter_model”，其他相关参数的默认值也均在merge_peft_adapters.py中的get_arg函数中。
+
 ## 四、发展路线
 
 整个过程我们会分为三个阶段：
 
 * 阶段一：
-  - [ ] LLaMa/LLaMa2
-    - [ ] LoRA
+  - [x] LLaMa/LLaMa2
+    - [x] LoRA
     - [x] QLoRA
-  - [ ] Falcon
-    - [ ] LoRA
+  - [x] Falcon
+    - [x] LoRA
     - [x] QLoRA
   - [ ] ChatGLM
   - [ ] BLOOM

diff --git a/data/data_info.yaml b/data/data_info.yaml
@@ -18,12 +18,6 @@ self-instruct:
   dataset_format: self-instruct
   multi_turn: False
 
-guanaco:
-  hf_hub_url: JosephusCheung/GuanacoDataset
-  local_path: ''
-  dataset_format: guanaco
-  multi_turn: False
-
 
 openassistant-guanaco:
   hf_hub_url: timdettmers/openassistant-guanaco

diff --git a/dbgpt_hub/configs/__init__.py b/dbgpt_hub/configs/__init__.py
@@ -1,5 +1,11 @@
 from .data_args import DataArguments
+from .gen_args import GenerationArguments
+from .lora_args import LoraArguments
 from .model_args import ModelArguments
+from .quant_args import QuantArguments
 from .train_args import TrainingArguments
 
-__all__ = ['DataArguments', 'ModelArguments','TrainingArguments']
+__all__ = [
+    'DataArguments', 'GenerationArguments', 'ModelArguments',
+    'TrainingArguments',  'LoraArguments','QuantArguments'
+            ]
diff --git a/dbgpt_hub/configs/data_args.py b/dbgpt_hub/configs/data_args.py
@@ -39,6 +39,7 @@ class DataArguments:
         metadata={
             'help': 'Which dataset to finetune on. See datamodule for options.'
         })
+
     dataset_dir: str = field(
         default=None,
         metadata={
@@ -57,10 +58,9 @@ class DataArguments:
             'help':
             'Which template to use for constructing prompts in multi-turn dataset training and inference.'
         })
-
     eval_dataset_size: Optional[float] = field(
         default=0.1, metadata={'help': 'Size of validation dataset.'})
-    
+
     max_train_samples: Optional[int] = field(
         default=None,
         metadata={
@@ -69,6 +69,22 @@ class DataArguments:
             'value if set.'
         },
     )
+    source_max_len: int = field(
+        default=1024,
+        metadata={"help": "Maximum source sequence length. Sequences will be right padded (and possibly truncated)."},
+    )
+    target_max_len: int = field(
+        default=256,
+        metadata={"help": "Maximum target sequence length. Sequences will be right padded (and possibly truncated)."},
+    )
+    dataset: str = field(
+        default='spider',
+        metadata={"help": "Which dataset to finetune on. See datamodule for options."}
+    )
+    dataset_format: Optional[str] = field(
+        default="spider",
+        metadata={"help": "Which dataset format is used. [alpaca|chip2|self-instruct|hh-rlhf]"}
+    )
 
     max_eval_samples: Optional[int] = field(
         default=None,

diff --git a/dbgpt_hub/configs/gen_args.py b/dbgpt_hub/configs/gen_args.py
@@ -0,0 +1,35 @@
+from dataclasses import asdict, dataclass, field
+from typing import Any, Dict, Optional
+
+
+@dataclass
+class GenerationArguments:
+    # For more hyperparameters check:
+    # https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig
+    # Length arguments
+    max_new_tokens: Optional[int] = field(
+        default=256,
+        metadata={"help": "Maximum number of new tokens to be generated in evaluation or prediction loops"
+                          "if predict_with_generate is set."}
+    )
+    min_new_tokens : Optional[int] = field(
+        default=None,
+        metadata={"help": "Minimum number of new tokens to generate."}
+    )
+
+    # Generation strategy
+    do_sample: Optional[bool] = field(default=False)
+    num_beams: Optional[int] = field(default=1)
+    num_beam_groups: Optional[int] = field(default=1)
+    penalty_alpha: Optional[float] = field(default=None)
+    use_cache: Optional[bool] = field(default=False)
+
+    # Hyperparameters for logit manipulation
+    temperature: Optional[float] = field(default=1.0)
+    top_k: Optional[int] = field(default=50)
+    top_p: Optional[float] = field(default=1.0)
+    typical_p: Optional[float] = field(default=1.0)
+    diversity_penalty: Optional[float] = field(default=0.0)
+    repetition_penalty: Optional[float] = field(default=1.0)
+    length_penalty: Optional[float] = field(default=1.0)
+    no_repeat_ngram_size: Optional[int] = field(default=0)
diff --git a/dbgpt_hub/configs/lora_args.py b/dbgpt_hub/configs/lora_args.py
@@ -0,0 +1,16 @@
+from dataclasses import dataclass, field
+
+
+@dataclass
+class LoraArguments:
+    #  Number of columns of matrix A and number of rows of matrix B in Lora
+    lora_r: int = field(default=64, metadata={'help': 'Lora R dimension.'})
+    # Scaling factor
+    lora_alpha: float = field(default=16, metadata={'help': ' Lora alpha.'})
+    lora_dropout: float = field(default=0.0,
+                                metadata={'help': 'Lora dropout.'})
+    # Size of memory available on each GPU, in MB. The default is 80GB1 for the high-end version of the A100
+    max_memory_MB: int = field(default=8000,
+                               metadata={'help': 'Free memory per gpu.'})
+    lora_weight_path: str = ''
+    bias: str = 'none'
diff --git a/dbgpt_hub/configs/quant_args.py b/dbgpt_hub/configs/quant_args.py
@@ -0,0 +1,34 @@
+from dataclasses import dataclass, field
+
+
+@dataclass
+class QuantArguments:
+    # With 8-bit adam, can you adjust to LION or Sophia, and even deepspeed offers multiple 1-bit optimizer options0
+    adam8bit: bool = field(default=False, metadata={'help': 'Use 8-bit adam.'})
+    # Whether to use quadratic quantization
+    double_quant: bool = field(
+        default=True,
+        metadata={
+            'help':
+            'Compress the quantization statistics through double quantization.'
+        })
+    # Quantization type, you can choose fp4 or nf4
+    quant_type: str = field(
+        default='nf4',
+        metadata={
+            'help':
+            'Quantization data type to use. Should be one of `fp4` or `nf4`.'
+        })
+    # Bit width used, default is 4.
+    bits: int = field(default=4, metadata={'help': 'How many bits to use.'})
+
+    def __post_init__(self):
+        if self.bits is not None:
+            assert self.bits in [
+                4, 8
+            ], 'We only accept 4-bit or 8-bit quantization.'
+
+        if self.quant_type is not None:
+            assert self.quant_type in [
+                'nf4', 'fp4'
+            ], 'We only accept `nf4` or `fp4` quantization type.'