##### 版權所有 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma Basics (Hugging Face)
這個筆記本展示了如何利用 Hugging Face 加載、微調和部署 Gemma 模型。
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/doggy8088/gemma-cookbook/blob/zh-tw-240628/Gemma/Gemma_Basics_with_HF.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## 設定

### 選擇 Colab 執行環境
要完成此指南，你需要有一個具有足夠資源的 Colab 執行環境來執行 Gemma 模型。在這種情況下，你可以使用 T4 GPU：

1. 在 Colab 視窗的右上角，選擇 **▾ (其他連接選項)** 。
2. 選擇 **變更執行環境類型** 。
3. 在 **硬體加速器** 下，選擇 **T4 GPU** 。

### Gemma 設定

**在我們深入指南之前，讓我們先為你設定 Gemma：** 

1. **Hugging Face 帳號：** 如果你還沒有帳號，你可以點擊[這裡](https://huggingface.co/join)建立一個免費的 Hugging Face 帳號。
2. **Gemma 模型存取權限：** 前往[Gemma 模型頁面](https://huggingface.co/google/gemma-2b)並接受使用條款。
3. **Colab 與 Gemma 效能：** 對於此指南，你需要一個具有足夠資源來處理 Gemma 2B 模型的 Colab 執行環境。在開始你的 Colab 會話時選擇適當的執行環境。
4. **Hugging Face Token：** 點擊[這裡](https://huggingface.co/settings/tokens)生成一個 Hugging Face 存取（最好是 `write` 權限）token。你稍後在指南中會需要這個 token。

**完成這些步驟後，你就可以進入下一節，在你的 Colab 環境中設定環境變數。** 

### 設定你的 HF token

將你的 Hugging Face token 添加到 Colab Secrets 管理器中以安全地存儲它。

1. 打開你的 Google Colab 筆記本，然後點擊左側面板中的 🔑 Secrets 標籤。<img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. 建立一個名為 `HF_TOKEN` 的新 secret。
3. 將你的 token 鍵複製/貼上到 `HF_TOKEN` 的 Value 輸入框中。
4. 切換左側的按鈕以允許筆記本訪問該 secret。

In [None]:
import os
from google.colab import userdata
# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### 安裝相依套件
執行下方的單元以安裝所有必需的相依套件。

In [None]:
!pip install --upgrade -q transformers huggingface_hub peft \
  accelerate bitsandbytes datasets trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.7/401.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━

### 登入 Hugging Face Hub

In [None]:
from huggingface_hub import login

login(os.environ["HF_TOKEN"])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


準備好探索與 Gemma 相關的各種可能性！

## 實例化 Gemma 2B 模型

Gemma 是 Google 的一系列輕量級、最先進的開放模型，基於用於建立 Gemini 模型的相同研究和技術構建。它們是僅解碼的大型語言模型，支持英語，具有開放權重、預訓練變體和指令調優變體。Gemma 模型非常適合各種文本生成任務，包括問答、摘要和推理。它們相對較小的尺寸使其能夠在資源有限的環境中部署，例如筆記型電腦、桌上型電腦或你自己的雲端基礎設施，從而使最先進的 AI 模型的使用民主化，並幫助促進每個人的創新。

讓我們從 Hugging Face Hub 加載模型開始。

### 從 HF Hub 載入模型

In [None]:
model_id = "google/gemma-1.1-2b-it"
device = "cuda"

In [None]:
# Let's load the tokenizer first
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Let's quantize the model to reduce its weight
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Let's load the final model
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map={"": 0}
)

config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

### 試用

In [None]:
prompt = "My favourite color is"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=20)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

My favourite color is blue. It represents calmness, trust, and serenity. It brings me a sense of peace and tranquility


In [None]:
prompt = "What can you use an LLM for? Answer:"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=512)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

What can you use an LLM for? Answer:

**An LLM (Large Language Model) can be used for a wide range of tasks, including:**

* **Information retrieval:** Providing summaries, answering questions, and providing factual information.
* **Content creation:** Generating creative text formats, writing different kinds of content, and translating languages.
* **Summarization:** Extracting key points from large amounts of text.
* **Code generation:** Assisting developers in writing code and debugging errors.
* **Customer service:** Providing personalized and contextual support to users.
* **Education:** Providing personalized learning experiences and generating educational materials.
* **Marketing:** Creating targeted marketing campaigns and analyzing customer data.
* **Translation:** Translating documents and websites between multiple languages.
* **Creative writing:** Generating original and imaginative content.


## 使用 LoRA 微調模型

本指南的這一部分專注於訓練你的大型語言模型（LLM）以生成著名的引言。在這裡，我們將探討微調你的模型的過程，使其能夠產生類似於知名作家、哲學家和領導人的輸出。

In [None]:
# Let's try it out before the fine-tuning
text = "Quote: Imagination is more"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
tokenizer.decode(outputs[0], skip_special_tokens=True)

'Quote: Imagination is more than just a spark of genius; it is the fertile ground from which great art, science, and'

In [None]:
# Loading and processing the dataset
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
print("Example item:", data["train"][0])

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]

Example item: {'quote': '“Be yourself; everyone else is already taken.”', 'author': 'Oscar Wilde', 'tags': ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']}


In [None]:
# Let's tokenize the quotes
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [None]:
from peft import LoraConfig

# Define tuning parameters
lora_config = LoraConfig(
    r=8,
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "o_proj",
        "k_proj",
        "v_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

In [None]:
def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}<eos>"
    return [text]

In [None]:
import transformers
from trl import SFTTrainer

# Create Trainer objects that takes care of the process
trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

max_steps is given, it will override any value given in num_train_epochs


In [None]:
# Let's run the fine-tuning
trainer.train()

Step,Training Loss
1,8.4293
2,6.5725
3,7.1766
4,7.0208
5,8.272
6,7.6854
7,5.6567
8,5.4682
9,6.2324
10,7.0835


TrainOutput(global_step=10, training_loss=6.95973539352417, metrics={'train_runtime': 12.9991, 'train_samples_per_second': 3.077, 'train_steps_per_second': 0.769, 'total_flos': 16634596884480.0, 'train_loss': 6.95973539352417, 'epoch': 0.01594896331738437})

In [None]:
# Testing the models after fine-tuning
text = "Quote: Imagination is"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Imagination is the faculty of the mind to create new things." - Albert Einstein

**Answer:**

Imagination


## Push the model to your Hugging Face Hub

Hugging Face 允許你輕鬆地將訓練好的模型儲存在他們的 hub 中。

In [None]:
# Note: The token needs to have "write" permisssion
#       You can chceck it here:
#       https://huggingface.co/settings/tokens
model.push_to_hub("my-gemma-2-finetuned-model")

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/f33ac/my-gemma-2-finetuned-model/commit/c837075477e241519df9aaf42e6a032b1d2e6df7', commit_message='Upload GemmaForCausalLM', commit_description='', oid='c837075477e241519df9aaf42e6a032b1d2e6df7', pr_url=None, pr_revision=None, pr_num=None)

## 使用 Text Generation Inference (TGI) 服務你的模型

Text Generation Inference 是一個簡化部署和使用大型語言模型（LLM）如 Gemma 的工具包。它針對文本生成任務進行了最佳化，使模型能夠更快地執行並更快地產生結果。TGI 通過張量並行等技術實現這一點，將工作負載分佈在多個圖形卡（GPU）上以加快處理速度，並針對文本生成專門設計的最佳化程式碼。此外，TGI 還提供了一些使其適合生產環境的功能，例如分佈式追蹤以監控模型性能，Prometheus 指標以進行詳細的資料收集，以及像浮水印這樣的安全措施來保護模型輸出。你可以參考[官方文件](https://huggingface.co/docs/text-generation-inference/en/index)了解更多關於 TGI 的資訊。

要使用 TGI 部署你的模型，你可以：

1. **在本地部署（需要 Docker）:** 取消註解下面的程式碼單元，以在本地機器上執行模型。這種方法需要安裝 Docker 並附加 GPU。

2. **使用 GKE 在 Google Cloud Platform 上部署:** 按照此指南[使用 Hugging Face TGI 在 GKE 上使用 GPU 提供 Gemma 開放模型](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi)在 Google Cloud 的 CKE 服務上部署你的模型。此選項利用 GPU 進行高性能推論。

這兩種部署方法都將為你提供一個 HTTP 端點，用於發送請求並從你的模型接收文本生成響應。

In [None]:
!model="google/gemma-1.1-2b-it" # ID of the model in Hugging Face hube
# (you can use your own fine-tuned model from
# the prevous step)
!volume=$PWD/data               # Shared directory with the Docker container
# to avoid downloading weights every run

# !docker run --gpus all --shm-size 1g -p 8080:80 \
#     -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.3 \
#     --model-id $model