# 转换并量化中文LLaMA-2和Alpaca-2模型

项目地址：https://github.com/ymcui/Chinese-LLaMA-Alpaca-2

⚠️ 内存消耗提示（确保刷出来的机器RAM大于以下要求）：
- 7B模型：15G+
- 13B模型：18G+
- 33B模型：22G+

💡 提示和小窍门：
- 免费用户默认的内存只有12G左右，不足以转换模型。**实测选择TPU的话有机会随机出35G内存**，建议多试几次
- Pro(+)用户请选择 “代码执行程序” -> “更改运行时类型” -> “高RAM”
- 程序莫名崩掉或断开连接就说明内存爆了
- 如果选了“高RAM”之后内存还是不够大的话，选择以下操作，有的时候会分配出很高内存的机器，祝你好运😄！
    - 可以把GPU或者TPU也选上（虽然不会用到）
    - 选GPU时，Pro(+)用户可选“A100”类型GPU

*温馨提示：用完之后注意断开运行时，选择满足要求的最低配置即可，避免不必要的计算单元消耗（Pro只给100个计算单元）。*

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 安装相关依赖

In [None]:
!pip install git+https://github.com/huggingface/peft.git@13e53fc
!pip install transformers==4.31.0
!pip install sentencepiece==0.1.97
!pip install bitsandbytes==0.39.1

Collecting git+https://github.com/huggingface/peft.git@13e53fc
  Cloning https://github.com/huggingface/peft.git (to revision 13e53fc) to /tmp/pip-req-build-hspk3le2
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-hspk3le2
[0m  Running command git checkout -q 13e53fc
  Resolved https://github.com/huggingface/peft.git to commit 13e53fc
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting transformers (from peft==0.3.0.dev0)
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate (from peft==0.3.0.dev0)
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m25

## 克隆目录和代码

In [None]:
!git clone https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 9225, done.[K
remote: Counting objects: 100% (3620/3620), done.[K
remote: Compressing objects: 100% (456/456), done.[K
remote: Total 9225 (delta 3375), reused 3294 (delta 3164), pack-reused 5605[K
Receiving objects: 100% (9225/9225), 8.41 MiB | 25.18 MiB/s, done.
Resolving deltas: 100% (6362/6362), done.


## 合并模型（以LLaMA-2-7B为例）

合并LoRA，生成全量模型权重。可以直接指定🤗模型库的地址，也可以是本地存放地址。
- 基模型：`meta-llama/Llama-2-7b-hf`（注意需要官方授权）
    - 这里使用一个平替（SHA256一致）做演示：`daryl149/llama-2-7b-hf`
- LoRA模型：`ziqingyang/chinese-llama-2-lora-7b`
- 输出格式：可选pth或者huggingface，这里选择huggingface

转换好的模型存放在`llama-2-7b-combined`目录。
如果你不需要量化模型，那么到这一步就结束了，可自行下载或者转存到Google Drive。

In [None]:
!python /content/drive/MyDrive/Llama2_ch/scripts/merge_llama2_with_chinese_lora_low_mem.py \
    --base_model /content/drive/MyDrive/Llama2_ch/models \
    --lora_model /content/drive/MyDrive/Llama2_ch/output/sft_lora_model \
    --output_type huggingface \
    --output_dir /content/drive/MyDrive/Llama2_ch/combine_models


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so...
Base model: /content/drive/MyDrive/Llama2_ch/models
LoRA model: /content/drive/MyDrive/Llama2_ch/output/sft_lora_model
Loading /content/drive/MyDrive/Llama2_ch/output/sft_lora_model
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull r

## 检查修改config.json（如果使用meta原版则可跳过）

- Llama-2的config中途有一次更新。因为本教程里使用的是第三方的权重，并没有及时更新对应的`config.json`文件。

- 请手动打开`llama-2-7b-combined`文件夹下的config.json（可直接双击打开），将`max_position_embeddings`字段由`2048`改为`4096`。cmd/ctrl+s保存即可。

## 量化模型
接下来我们使用[llama.cpp](https://github.com/ggerganov/llama.cpp)工具对上一步生成的全量版本权重进行转换，生成量化模型。

### 编译工具

首先对llama.cpp工具进行编译。

In [None]:
!cd llama.cpp && make

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

cc  -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c ggml.c -o ggml.o
g++ -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qu

### 模型转换为GGML格式（FP16）

这一步，我们将模型转换为GGML格式（FP16）。

In [None]:
!pip install gguf

Collecting gguf
  Downloading gguf-0.3.0-py3-none-any.whl (9.7 kB)
Installing collected packages: gguf
Successfully installed gguf-0.3.0


In [None]:
!cd llama.cpp && python convert.py /content/drive/MyDrive/Llama2_ch/combine_models

Loading model file /content/drive/MyDrive/Llama2_ch/combine_models/pytorch_model-00001-of-00003.bin
Loading model file /content/drive/MyDrive/Llama2_ch/combine_models/pytorch_model-00001-of-00003.bin
Loading model file /content/drive/MyDrive/Llama2_ch/combine_models/pytorch_model-00002-of-00003.bin
Loading model file /content/drive/MyDrive/Llama2_ch/combine_models/pytorch_model-00003-of-00003.bin
params = Params(n_vocab=55296, n_embd=5120, n_mult=6912, n_layer=40, n_ctx=4096, n_ff=13824, n_head=40, n_head_kv=40, f_norm_eps=1e-05, f_rope_freq_base=None, f_rope_scale=None, ftype=None, path_model=PosixPath('/content/drive/MyDrive/Llama2_ch/combine_models'))
Loading vocab file '/content/drive/MyDrive/Llama2_ch/combine_models/tokenizer.model', type 'spm'
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Per

### 将FP16模型进行量化

我们进一步将FP16模型转换为量化模型，此处选择的是新版Q6_K方法，其效果非常接近FP16。

In [None]:
!cd llama.cpp && ./quantize /content/drive/MyDrive/Llama2_ch/combine_models/ggml-model-f16.gguf /content/drive/MyDrive/Llama2_ch/combine_models/ggml-model-q6_K.bin q6_K

main: build = 1130 (8afe228)
main: quantizing '/content/drive/MyDrive/Llama2_ch/combine_models/ggml-model-f16.gguf' to '/content/drive/MyDrive/Llama2_ch/combine_models/ggml-model-q6_K.bin' as Q6_K
llama_model_loader: loaded meta data with 17 key-value pairs and 363 tensors from /content/drive/MyDrive/Llama2_ch/combine_models/ggml-model-f16.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  5120, 55296,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight f16      [  5120, 

### （可选）测试量化模型解码
至此已完成了所有转换步骤。
我们运行一条命令测试一下是否能够正常加载并进行输出。

In [None]:
!cd llama.cpp && ./main -m /content/drive/MyDrive/Llama2_ch/combine_models/ggml-model-q6_K.bin --color -p "[]中的文字是否有旅行意圖：[星期天想去西門町走走，有什麼地方好去?]" -n 256

Log start
main: build = 1130 (8afe228)
main: seed  = 1693453497
llama_model_loader: loaded meta data with 18 key-value pairs and 363 tensors from /content/drive/MyDrive/Llama2_ch/combine_models/ggml-model-q6_K.bin (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q6_K     [  5120, 55296,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
lla