# Step0: setup
Firstly, we should install the required dependency and source code of the project:

In [None]:
!pip install transformers==4.28.1 sentencepiece==0.1.97 google protobuf deepspeed==0.9.2 datasets -i https://pypi.tuna.tsinghua.edu.cn/simple  --trusted-host pypi.tuna.tsinghua.edu.cn

Install Chinese-LLaMA-Alpaca library.

In [None]:
!git clone https://github.com/ymcui/Chinese-LLaMA-Alpaca.git ../../Chinese-LLaMA-Alpaca
!git -C ../../Chinese-LLaMA-Alpaca checkout 7bc1f3d7c426e3685d14eb1e5614066650f94838

Install peft library.

In [None]:
!git clone https://github.com/huggingface/peft.git ../../peft
!git -C ../../peft checkout 13e53fc
!pip install ../../peft -i https://pypi.tuna.tsinghua.edu.cn/simple  --trusted-host pypi.tuna.tsinghua.edu.cn

Then we should setup the project by adding directories, downloading the model, and preprocessing data.

In [None]:
!mkdir ../../cache
!mkdir ../../output

In [None]:
model_dir = "/workspace/llama-7b-hf"
!ln -s {model_dir} ../../

We use a textbook of computer system for pretraining. Now let us clean the data simply. The original data has been placed on `../../data/book`

In [19]:
import os
import re

def clean_txt_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8', errors='ignore') as file:
        data = file.readlines()

    cleaned_lines = []
    for line in data:
        line = line.strip()
        if line:
            cleaned_lines.append(line)

    cleaned_data = '\n'.join(cleaned_lines)

    with open(output_file, 'w', encoding='utf-8', errors='ignore') as file:
        file.write(cleaned_data)

def clean_txt_files_in_directory(in_directory, out_directory):
    for filename in os.listdir(in_directory):
        if filename.endswith(".txt"):
            input_file = os.path.join(in_directory, filename)
            output_file = os.path.join(out_directory, "cleaned_" + filename)
            clean_txt_file(input_file, output_file)

in_directory = '../../data/book'  
out_directory = '../../data/clean_book/'  

clean_txt_files_in_directory(in_directory, out_directory)

Now we have all preparation ready. Let us define some key directories for following usage:

In [1]:
llama_tokenizer_dir="../../llama-7b-hf"
chinese_sp_model_file="../../Chinese-LLaMA-Alpaca/scripts/chinese_sp.model"
output_dir="../../output"
script_dir="../../Chinese-LLaMA-Alpaca/scripts"

# Step 1: merge tokens
Secondly, let us merge the chinese vocabulary with the original vocabulary.

In [27]:
!python {script_dir}/merge_tokenizers.py --llama_tokenizer_dir {llama_tokenizer_dir} --chinese_sp_model_file {chinese_sp_model_file}
!mv merged_tokenizer_hf {output_dir}
!mv merged_tokenizer_sp {output_dir}

32000 20000
['<s>', '</s>', '<unk>']
[1, 2, 0]
{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
32000
Before:32000
New model pieces: 49953
Chinese-LLaMA tokenizer has been saved to merged_tokenizer_hf
['<s>', '</s>', '<unk>']
[1, 2, 0]
{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
Test text:
 白日依山尽，黄河入海流。欲穷千里目，更上一层楼。
The primary use of LLaMA is research on large language models, including
Tokenized by LLaMA tokenizer:['▁', '白', '日', '<0xE4>', '<0xBE>', '<0x9D>', '山', '<0xE5>', '<0xB0>', '<0xBD>', '，', '黄', '河', '入', '海', '流', '。', '<0xE6>', '<0xAC>', '<0xB2>', '<0xE7>', '<0xA9>', '<0xB7>', '千', '里', '目', '，', '更', '上', '一', '<0xE5>', '<0xB1>', '<0x82>', '<0xE6>', '<0xA5>', '<0xBC>', '。', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including']
Tokenized by Chinese-LLaMA tokenizer:['▁白', '日', '依', '山', '尽', '，', '黄河', '入', '海', '流', '。', '欲', '穷', '千里', '目', '，', '更', '上

# Step2: pretrain
Now let us pretrain the model using the prepared data.

In [1]:
# here we adjust the lora saving path in the original pretrain script
!cp run_clm_pt_with_peft.py {script_dir}

Before start training, we should first learn to manage our training arguments using class inheritance.

In [2]:
from typing import Optional, List, Dict, Any, Mapping
from dataclasses import dataclass, field
from transformers import TrainingArguments, MODEL_FOR_CAUSAL_LM_MAPPING
from transformers.utils.versions import require_version

MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
    """

    model_name_or_path: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch."
            )
        },
    )
    tokenizer_name_or_path: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "The tokenizer for weights initialization.Don't set if you want to train a model from scratch."
            )
        },
    )
    model_type: Optional[str] = field(
        default=None,
        metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
    )
    config_overrides: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "Override some existing default config settings when a model is trained from scratch. Example: "
                "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
            )
        },
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": (
                "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
                "with private models)."
            )
        },
    )
    torch_dtype: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "Override the default `torch.dtype` and load the model under this dtype. If `auto` is passed, the "
                "dtype will be automatically derived from the model's weights."
            ),
            "choices": ["auto", "bfloat16", "float16", "float32"],
        },
    )

    def __post_init__(self):
        if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None):
            raise ValueError(
                "--config_overrides can't be used in combination with --config_name or --model_name_or_path"
            )

@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    dataset_dir: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
    validation_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of training examples to this "
                "value if set."
            )
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
                "value if set."
            )
        },
    )
    streaming: bool = field(default=False, metadata={"help": "Enable streaming mode"})
    block_size: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "Optional input sequence length after tokenization. "
                "The training dataset will be truncated in block of this size for training. "
                "Default to the model max input length for single sentence inputs (take into account special tokens)."
            )
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    validation_split_percentage: Optional[float] = field(
        default=0.05,
        metadata={
            "help": "The percentage of the train set used as validation set in case there's no validation split"
        },
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    keep_linebreaks: bool = field(
        default=True, metadata={"help": "Whether to keep line breaks when using TXT files or not."}
    )
    data_cache_dir: Optional[str] = field(default="./", metadata={"help": "The datasets processed stored"})

    def __post_init__(self):
        if self.streaming:
            require_version("datasets>=2.0.0", "The streaming feature requires `datasets>=2.0.0`")

@dataclass
class MyTrainingArguments(TrainingArguments):
    trainable : Optional[str] = field(default="q_proj,v_proj")
    lora_rank : Optional[int] = field(default=8)
    lora_dropout : Optional[float] = field(default=0.1)
    lora_alpha : Optional[float] = field(default=32.)
    modules_to_save : Optional[str] = field(default=None)
    debug_mode : Optional[bool] = field(default=False)
    peft_path : Optional[str] = field(default=None)

We proivde a basic version of training arguments, you can create your own arguments by inheriting it. 

In [52]:
model_args = ModelArguments(model_name_or_path="../../llama-7b-hf/", 
                            tokenizer_name_or_path="../../output/merged_tokenizer_hf",
                            torch_dtype="float16")

data_args = DataTrainingArguments(dataset_dir="../../data/clean_book",
                                  data_cache_dir="../../cache",
                                  validation_split_percentage=0.001,
                                  block_size=512,
                                  preprocessing_num_workers=8)

deepspeed_config_file = script_dir+"/"+"ds_zero2_no_offload.json" # we can not include deepspeed config path directly here otherwise we will init deepspeed environment in advance
training_args =  MyTrainingArguments(per_device_train_batch_size=1,
                                    per_device_eval_batch_size=1,
                                    do_train=True,
                                    seed=100,
                                    fp16=True,
                                    max_steps=100,
                                    lr_scheduler_type="cosine",
                                    learning_rate=2e-4,
                                    warmup_ratio=0.05,
                                    weight_decay=0.01,
                                    logging_strategy="steps",
                                    logging_steps=10,
                                    logging_first_step=True,
                                    save_strategy="steps",
                                    save_steps=500,
                                    save_total_limit=3,
                                    gradient_accumulation_steps=1,
                                    gradient_checkpointing=True,
                                    ddp_find_unused_parameters=False,
                                    ddp_timeout=30000,
                                    output_dir="../../output/llama-zh",
                                    overwrite_output_dir=True,
                                    lora_rank=8,
                                    lora_alpha=32,
                                    trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj",
                                    modules_to_save="embed_tokens,lm_head",
                                    lora_dropout=0.05)

Now let us start pretraining using the prepared arguments.

In [10]:
!torchrun --nnodes 1 --nproc_per_node 8 {script_dir}/run_clm_pt_with_peft.py \
    --deepspeed {deepspeed_config_file} \
    --model_name_or_path {model_args.model_name_or_path} \
    --tokenizer_name_or_path {model_args.tokenizer_name_or_path} \
    --dataset_dir {data_args.dataset_dir} \
    --data_cache_dir {data_args.data_cache_dir} \
    --validation_split_percentage {data_args.validation_split_percentage} \
    --per_device_train_batch_size {training_args.per_device_train_batch_size} \
    --per_device_eval_batch_size {training_args.per_device_eval_batch_size} \
    --do_train {training_args.do_train}\
    --seed {training_args.seed} \
    --fp16 {training_args.fp16}\
    --max_steps {training_args.max_steps} \
    --lr_scheduler_type {training_args.lr_scheduler_type} \
    --learning_rate {training_args.learning_rate} \
    --warmup_ratio {training_args.warmup_ratio} \
    --weight_decay {training_args.weight_decay} \
    --logging_strategy {training_args.logging_strategy} \
    --logging_steps {training_args.logging_steps} \
    --save_strategy {training_args.save_strategy} \
    --save_total_limit {training_args.save_total_limit} \
    --save_steps {training_args.save_steps} \
    --gradient_accumulation_steps {training_args.gradient_accumulation_steps} \
    --preprocessing_num_workers {data_args.preprocessing_num_workers} \
    --block_size {data_args.block_size} \
    --output_dir {training_args.output_dir} \
    --overwrite_output_dir {training_args.overwrite_output_dir} \
    --ddp_timeout {training_args.ddp_timeout} \
    --logging_first_step {training_args.logging_first_step} \
    --lora_rank {training_args.lora_rank} \
    --lora_alpha {training_args.lora_alpha} \
    --trainable {training_args.trainable} \
    --modules_to_save {training_args.modules_to_save} \
    --lora_dropout {training_args.lora_dropout} \
    --torch_dtype {model_args.torch_dtype} \
    --gradient_checkpointing {training_args.gradient_checkpointing} \
    --ddp_find_unused_parameters {training_args.ddp_find_unused_parameters}

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.8/site-packages/bitsandbytes-0.39.0-py3.8.egg/bitsandbytes/libbitsandbytes_cuda114.so
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  war

# Step3: merge pretrained lora
Having trained the lora model, we need to merge it into the original llama model. 

In [None]:
# here we expose an interface to change the save path of tokenizers
!cp merge_llama_with_chinese_lora.py {script_dir}
!python {script_dir}/merge_llama_with_chinese_lora.py --base_model {llama_tokenizer_dir} --tokenizer_path {output_dir}/merged_tokenizer_hf --lora_model {output_dir}/llama-zh/lora --output_type huggingface --output_dir {output_dir}/merge-lora-hf

# Step 4: Superviser fine-tuning
Now we can start fine-tuning the model using instructions. You should reprepare the arguments similar to that in the pretrain step.

In [24]:
# Similarly we adjust the lora saving path in the original pretrain script
!cp run_clm_sft_with_peft.py {script_dir}

In [53]:
@dataclass
class SftDataTrainingArguments(DataTrainingArguments):
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """
    max_seq_length: Optional[int] = field(default=512)

sft_model_args, sft_data_args, sft_training_args = model_args, SftDataTrainingArguments(data_args), training_args
# we only need to adjust some parameters
sft_model_args.model_name_or_path = "../../output/merge-lora-hf/"
sft_model_args.tokenizer_name_or_path = "../../output/merged_tokenizer_hf/"

sft_data_args.dataset_dir = "../../Chinese-LLaMA-Alpaca/data/"
sft_data_args.validation_file = "../../data/alpaca_data.json"
sft_data_args.block_size = None
sft_data_args.max_seq_length = 512

sft_training_args.learning_rate = 1e-4
sft_training_args.output_dir = "../../output/llama-alpaca-zh"
sft_training_args.warmup_ratio = 0.03
sft_training_args.weight_decay = 0
sft_training_args.eval_steps = 250
sft_training_args.evaluation_strategy = "steps"
sft_training_args.do_eval = True

In [56]:
!torchrun --nnodes 1 --nproc_per_node 8 {script_dir}/run_clm_sft_with_peft.py \
    --deepspeed {deepspeed_config_file} \
    --model_name_or_path {sft_model_args.model_name_or_path} \
    --tokenizer_name_or_path {sft_model_args.tokenizer_name_or_path} \
    --dataset_dir {sft_data_args.dataset_dir} \
    --per_device_train_batch_size {sft_training_args.per_device_train_batch_size} \
    --per_device_eval_batch_size {sft_training_args.per_device_eval_batch_size} \
    --do_train {sft_training_args.do_train}\
    --do_eval {sft_training_args.do_eval} \
    --seed {sft_training_args.seed} \
    --fp16 {sft_training_args.fp16}\
    --max_steps {sft_training_args.max_steps} \
    --max_seq_length {sft_data_args.max_seq_length} \
    --lr_scheduler_type {sft_training_args.lr_scheduler_type} \
    --learning_rate {sft_training_args.learning_rate} \
    --warmup_ratio {sft_training_args.warmup_ratio} \
    --weight_decay {sft_training_args.weight_decay} \
    --logging_strategy {sft_training_args.logging_strategy} \
    --logging_steps {sft_training_args.logging_steps} \
    --save_strategy {sft_training_args.save_strategy} \
    --save_total_limit {sft_training_args.save_total_limit} \
    --save_steps {sft_training_args.save_steps} \
    --evaluation_strategy {sft_training_args.evaluation_strategy} \
    --eval_steps {sft_training_args.eval_steps} \
    --gradient_accumulation_steps {sft_training_args.gradient_accumulation_steps} \
    --preprocessing_num_workers {sft_data_args.preprocessing_num_workers} \
    --overwrite_output_dir {sft_training_args.overwrite_output_dir} \
    --ddp_timeout {sft_training_args.ddp_timeout} \
    --logging_first_step {sft_training_args.logging_first_step} \
    --lora_rank {sft_training_args.lora_rank} \
    --lora_alpha {sft_training_args.lora_alpha} \
    --trainable {sft_training_args.trainable} \
    --modules_to_save {sft_training_args.modules_to_save} \
    --lora_dropout {sft_training_args.lora_dropout} \
    --torch_dtype {sft_model_args.torch_dtype} \
    --gradient_checkpointing {sft_training_args.gradient_checkpointing} \
    --ddp_find_unused_parameters {sft_training_args.ddp_find_unused_parameters} \
    --validation_file {sft_data_args.validation_file} \
    --validation_split_percentage {sft_data_args.validation_split_percentage} \
    --output_dir {sft_training_args.output_dir}

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.8/site-packages/bitsandbytes-0.39.0-py3.8.egg/bitsandbytes/libbitsandbytes_cuda114.so
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  war

# Step 5: merge fine-tuned lora
Having fine-tuned the lora model, we need to merge it into the original llama-alpaca model. 

In [57]:
!cp merge_llama_with_chinese_lora.py {script_dir}
!python {script_dir}/merge_llama_with_chinese_lora.py --base_model {llama_tokenizer_dir} --tokenizer_path {output_dir}/llama-zh,{output_dir}/llama-alpaca-zh --lora_model {output_dir}/llama-zh/lora,{output_dir}/llama-alpaca-zh/lora --output_type huggingface --output_dir {output_dir}/merge-alpaca-hf


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.8/site-packages/bitsandbytes-0.39.0-py3.8.egg/bitsandbytes/libbitsandbytes_cuda114.so
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 114
CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes-0.39.0-py3.8.egg/bitsandbytes/libbitsandb

# Step6: try inference
Now we can try to test our new trained model for some simple questions.

In [58]:
!python {script_dir}/inference_hf.py --base_model {output_dir}/merge-alpaca-hf --with_prompt --interactive


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.8/site-packages/bitsandbytes-0.39.0-py3.8.egg/bitsandbytes/libbitsandbytes_cuda114.so
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 114
CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes-0.39.0-py3.8.egg/bitsandbytes/libbitsandbytes_