##### 版權所有 2024 Google LLC.

In [1]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma - 使用 LLaMA Factory 微調

此筆記本展示了如何使用 LLaMA Factory 微調 Gemma。[LLaMA Factory](https://github.com/InternLM/xtuner) 是一個專門為微調 LLMs 設計的工具。LLaMA Factory 封裝了 Hugging Face 的微調功能，並提供了一個簡單的介面來進行微調。使用 LLaMA Factory 微調 Gemma 非常容易。此筆記本非常接近 LLaMA Factory 官方的 [Colab 筆記本](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)。

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/doggy8088/gemma-cookbook/blob/zh-tw-240628/Gemma/Finetune_with_LLaMA_Factory.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />在 Google Colab 中執行</a>
  </td>
</table>

## 設定

### 選擇 Colab 執行環境
要完成此指南，你需要有一個資源充足的 Colab 執行環境來執行 Gemma 模型。在這種情況下，你可以使用 T4 GPU：

1. 在 Colab 視窗的右上角，選擇 **▾ (其他連接選項)** 。
2. 選擇 **更改執行環境類型** 。
3. 在 **硬體加速器** 下，選擇 **T4 GPU** 。

### 在 Hugging Face 上設定 Gemma
LLaMA Factory 在底層使用 Hugging Face。因此你需要：

* 通過接受 Hugging Face 上特定模型頁面的 Gemma 許可來獲取 [huggingface.co](huggingface.co) 上的 Gemma 訪問權限，即 [Gemma 2B](https://huggingface.co/google/gemma-2b)。
* 生成一個 [Hugging Face 訪問令牌](https://huggingface.co/docs/hub/en/security-tokens) 並將其配置為 Colab 機密 'HF_TOKEN'。

In [2]:
import os
from google.colab import userdata
# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### 安裝 LLaMA Factory

從 GitHub 上的原始碼安裝 LLaMA Factory。

In [3]:
%cd /content/
%rm -rf LLaMA-Factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install -e .[torch,bitsandbytes]

/content
Cloning into 'LLaMA-Factory'...
remote: Enumerating objects: 12511, done.[K
remote: Counting objects: 100% (1292/1292), done.[K
remote: Compressing objects: 100% (552/552), done.[K
remote: Total 12511 (delta 883), reused 1026 (delta 723), pack-reused 11219[K
Receiving objects: 100% (12511/12511), 218.87 MiB | 13.35 MiB/s, done.
Resolving deltas: 100% (9132/9132), done.
/content/LLaMA-Factory
[0m[01;34massets[0m/       docker-compose.yml  [01;34mexamples[0m/  pyproject.toml  requirements.txt  [01;34msrc[0m/
CITATION.cff  Dockerfile          LICENSE    README.md       [01;34mscripts[0m/          [01;34mtests[0m/
[01;34mdata[0m/         [01;34mevaluation[0m/         Makefile   README_zh.md    setup.py
Obtaining file:///content/LLaMA-Factory
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ..

## Finetune Gemma

開始 Gemma 2B 微調，使用這個 [demo Alpaca dataset](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/alpaca_en_demo.json)。如果你想使用自己的數據集，請遵循這個 [guide from LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory/tree/main/data)。

In [4]:
import json

args = dict(
    stage="sft",  # do supervised fine-tuning
    do_train=True,
    model_name_or_path="google/gemma-2b",  # use bnb-4bit-quantized Gemma 2B model
    dataset="alpaca_en_demo",  # use the demo alpaca datasets
    template="gemma",  # use Gemma prompt template
    finetuning_type="lora",  # use LoRA adapters to save memory
    lora_target="all",  # attach LoRA adapters to all linear layers
    output_dir="gemma_lora",  # the path to save LoRA adapters
    per_device_train_batch_size=2,  # the batch size
    gradient_accumulation_steps=4,  # the gradient accumulation steps
    lr_scheduler_type="cosine",  # use cosine learning rate scheduler
    logging_steps=10,  # log every 10 steps
    warmup_ratio=0.1,  # use warmup scheduler
    save_steps=1000,  # save checkpoint every 1000 steps
    learning_rate=5e-5,  # the learning rate
    num_train_epochs=3.0,  # the epochs of training
    max_samples=500,  # use 500 examples in each dataset
    max_grad_norm=1.0,  # clip gradient norm to 1.0
    quantization_bit=4,  # use 4-bit QLoRA
    loraplus_lr_ratio=16.0,  # use LoRA+ algorithm with lambda=16.0
    fp16=True,  # use float16 mixed precision training
)

json.dump(args, open("train_gemma.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli train train_gemma.json

/content/LLaMA-Factory
2024-06-02 01:48:45.000610: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-02 01:48:45.000673: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-02 01:48:45.109651: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-02 01:48:45.314212: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
06/02/2024 01:48:55 - INFO - l

## 在聊天設定中執行推論

In [5]:
%cd /content/LLaMA-Factory/src/

from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
    model_name_or_path="google/gemma-2b",  # use Gemma 2B model
    adapter_name_or_path="gemma_lora",  # load the saved LoRA adapters
    template="gemma",  # same to the one in training
    finetuning_type="lora",  # same to the one in training
    quantization_bit=4,  # load 4-bit quantized model
)
chat_model = ChatModel(args)

messages = []
print(
    "Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application."
)
while True:
    query = input("\nUser: ")
    if query.strip() == "exit":
        break
    if query.strip() == "clear":
        messages = []
        torch_gc()
        print("History has been removed.")
        continue

    messages.append({"role": "user", "content": query})
    print("Assistant: ", end="", flush=True)

    response = ""
    for new_text in chat_model.stream_chat(messages):
        print(new_text, end="", flush=True)
        response += new_text
    print()
    messages.append({"role": "assistant", "content": response})

torch_gc()

/content/LLaMA-Factory/src
/content/LLaMA-Factory


[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,909 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/tokenizer.model
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,910 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/tokenizer.json
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,912 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,914 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/special_tokens_map.json
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,916 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2

06/02/2024 01:59:04 - INFO - llamafactory.model.utils.quantization - Quantizing model to 4 bit.


INFO:llamafactory.model.utils.quantization:Quantizing model to 4 bit.


06/02/2024 01:59:04 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.


INFO:llamafactory.model.patcher:Using KV cache for faster generation.
[INFO|modeling_utils.py:3474] 2024-06-02 01:59:04,316 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/model.safetensors.index.json
[INFO|modeling_utils.py:1519] 2024-06-02 01:59:04,322 >> Instantiating GemmaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:962] 2024-06-02 01:59:04,324 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}

Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO|modeling_utils.py:4280] 2024-06-02 01:59:10,951 >> All model checkpoint weights were used when initializing GemmaForCausalLM.

[INFO|modeling_utils.py:4288] 2024-06-02 01:59:10,956 >> All the weights of GemmaForCausalLM were initialized from the model checkpoint at google/gemma-2b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GemmaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:917] 2024-06-02 01:59:10,993 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/generation_config.json
[INFO|configuration_utils.py:962] 2024-06-02 01:59:10,995 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}



06/02/2024 01:59:11 - INFO - llamafactory.model.utils.attention - Using torch SDPA for faster training and inference.


INFO:llamafactory.model.utils.attention:Using torch SDPA for faster training and inference.


06/02/2024 01:59:11 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.


INFO:llamafactory.model.adapter:Upcasting trainable params to float32.


06/02/2024 01:59:11 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA


INFO:llamafactory.model.adapter:Fine-tuning method: LoRA


06/02/2024 01:59:11 - INFO - llamafactory.model.adapter - Loaded adapter(s): gemma_lora


INFO:llamafactory.model.adapter:Loaded adapter(s): gemma_lora


06/02/2024 01:59:11 - INFO - llamafactory.model.loader - all params: 2515978240


INFO:llamafactory.model.loader:all params: 2515978240


Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.

User: where is Chicago?
Assistant: Chicago is located in the U.S. state of Illinois, and is the third most populous city in the United States.

User: exit


## 合併 LoRA 適配器並上傳微調後的模型到 Hugging Face

In [6]:
import json

args = dict(
    model_name_or_path="google/gemma-2b",  # use official non-quantized Gemma 2B model
    adapter_name_or_path="gemma_lora",  # load the saved LoRA adapters
    template="gemma",  # same to the one in training
    finetuning_type="lora",  # same to the one in training
    export_dir="gemma_lora_merged",  # path to save the merged model
    export_size=2,  # the file shard size (in GB) of the merged model
    export_device="cpu",  # the device used in export, can be chosen from `cpu` and `cuda`
    export_hub_model_id="gemma-2b-finetuned-model-llama-factory",  # your Hugging Face hub model ID
)

json.dump(args, open("merge_gemma.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli export merge_gemma.json

/content/LLaMA-Factory
2024-06-02 01:59:36.861478: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-02 01:59:36.861538: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-02 01:59:36.862938: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:47,841 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/tokenizer.model
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:47,842 >> loading file tokenizer.json from cache at /roo