Skip to content

Finetune LLaMA-7B with Chinese instruction datasets

License

Notifications You must be signed in to change notification settings

fujianai/alpaca-7b-chinese

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🦙🧋🇹🇼 Finetune LLaMA-7B with Chinese instruction datasets

This repository is a tutorial for finetuning LLaMA-7B with Chinese datasets! I survey and combine the dataset & method for finetuning my own LLM for complex NLP tasks such as summarization, question answering, text generation, custom data augmentation, etc.

Since the original Stanford Alpaca-7B finetune need lots of GPU resources, I focus on surveying the method with low GPU consumption.

So here's how to reproduce:

Installation

  1. Install requirements
$ pip install -r requirements.txt
  1. Install PyTorch at compatible version with CUDA
$ pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

Datasets

This repository combined all datasets using English-instruction, Chinese-output construction:

  1. alpaca_data.json: Original dataset from Stanford Alpaca
  2. alpaca_data_cleansed.json: Cleansing by gururise/AlpacaDataCleaned
  3. alpaca-zhCN.json: Translate by carbonz0/alpaca-chinese-dataset
  4. alpaca-zhTW.json: Translate to Traditional Chinese using OpenCC
  5. alpaca-en-zh.json: Combine the English instruction/input and Chinese output by ntunlplab/traditional-chinese-alpaca: (Traditional Chinese dataset translate by ChatGPT API (gpt-3.5-turbo) by ntunlplab/traditional-chinese-alpaca (Update at 2023.03.29))

Finetune

Reference finetune method provide by tloen/alpaca-lora

  1. Run on 1 GPU with Colab: https://colab.research.google.com/drive/1QvtrJpikkkNKSbwwG766SIGbBw2TQRd5?usp=sharing

  2. Use torchrun for distributed training on Multi-GPUs

$ cd finetune/
$ torchrun --standalone --nnodes=1 --nproc_per_node=4 finetune.py

Finetune Domain Tasks

(In progress, welcome to discuss together: jiunyi.yang.abao@gmail.com. I'd like to try tasks from different domains such as investment, fraud, e-commerce, law, healthcare, ...)

Model Serving

To serve your own model service through API & simple website UI!

  1. Model API

    $ cd serve/
    $ python api.py
  2. demo UI

    $ cd serve/
    $ python ui.py

Learn More

I curated lots of method that try to run large language models with fewer GPU resources:

  • PEFT
  • LoRA
  • FlexGen ...

See full list: chatgpt-alternatives

@misc{alpaca-7b-chinese,
  author = {JiunYi Yang},
  title = {Alpaca-7B Chinese: Finetune LLaMA-7B with Chinese instruction datasets},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/A-baoYang/alpaca-7b-chinese}},
}

About

Finetune LLaMA-7B with Chinese instruction datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%