##### 版權 2024 Google LLC.


In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://ai.google.dev/gemma/docs/distributed_tuning"><img src="https://ai.google.dev/static/site-assets/images/docs/notebook-site-button.png" height="32" width="32" />在 ai.google.dev 上查看</a>
  <td>
    <a target="_blank" href="https://www.kaggle.com/code/nilaychauhan/keras-gemma-distributed-finetuning-and-inference"><img src="https://www.kaggle.com/static/images/logos/kaggle-logo-transparent-300.png" height="32" width="70"/>在 Kaggle 中執行</a>
  </td>
  <td>
    <a target="_blank" href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/google/generative-ai-docs/main/site/en/gemma/docs/distributed_tuning.ipynb"><img src="https://ai.google.dev/images/cloud-icon.svg" width="40" />在 Vertex AI 中開啟</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/doggy8088/generative-ai-docs/blob/main/site/zh/gemma/docs/distributed_tuning.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />在 GitHub 上檢視來源</a>
  </td>
</table>


# 使用 Keras 分散調整 Gemma


## 概述

Gemma 是輕量級、最先進的開放模型系列，建構自用於建立 Google Gemini 模型的研究和技術。Gemma 可進一步微調以適應特定需求。但大型語言模型 (例如 Gemma) 可能規模非常大，而其中一些可能不適合單一加速器進行微調。在這種情況下，有兩個通用方式進行微調：
1. 參數高效微調 (PEFT)，會嘗試降低有效模型大小，犧牲一些保真度。LoRA 屬於此類別，並且 [使用 LoRA 在 Keras 中微調 Gemma 模型](https://ai.google.dev/gemma/docs/lora_tuning) 教學示範如何使用 KerasNLP 在單一 GPU 上以 LoRA 微調 Gemma 2B 模型 `gemma_2b_en`。
2. 使用模型並行化進行完整參數微調。模型並行化將單一模型的權重分配到多個設備並支援橫向擴充。你可以在這份 [Keras 指南](https://keras.io/guides/distribution/) 瞭解更多關於分散式訓練的資訊。

本教學課程將引導你使用具備 JAX 後端的 Keras，透過 LoRA 和模型並行化分散式訓練來微調 Gemma 7B 模型，在 Google 的張量處理單元 (TPU) 上執行。請注意，本教學課程中可以關閉 LoRA，以進行速度較慢但準確性較高的完整參數微調。


## 加速器使用方式

從技術角度來說，本教學課程可以使用 TPU 或 GPU。

### TPU 環境備註

Google 提供 3 款產品，可提供 TPU：
* [Colab](https://colab.sandbox.google.com/) 提供 TPU v2，不適用於本教學課程。
* [Kaggle](https://www.kaggle.com/) 免費提供 TPU v3，適用於本教學課程。
* [Cloud TPU](https://cloud.google.com/tpu?hl=en) 提供 TPU v3 及更新世代。設定方式如下：
  1. 建立新的 [TPU VM](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#tpu-vms)
  2. 設定 [SSH 連接埠轉送](https://cloud.google.com/solutions/connecting-securely#port-forwarding-over-ssh) 以供預期的 Jupyter Server 連接埠使用
  3. 在 TPU VM 上安裝 Jupyter 並啟動，然後透過「連線至本機執行環境」連線到 Colab

### 多 GPU 設定備註

儘管本教學課程著重於 TPU 使用案例，如果你有配備多個 GPU 的機器，可輕易調整為符合你的需求。

如果你偏好使用 Colab，也可以透過 Colab Connect 選單中的「連線至自訂 GCE VM」直接為 Colab 配置配備多個 GPU 的 VM。


本教學課程將重點說明如何使用 **Kaggle 提供的免費 TPU** 。


## 在你開始之前


### Kaggle 認證

Gemma 模型由 Kaggle 來託管。要使用 Gemma，請在 Kaggle 上索取存取權：

- 在 [kaggle.com](https://www.kaggle.com) 登入或註冊
- 開啟 [Gemma 模型卡](https://www.kaggle.com/models/google/gemma) 並選擇「_索取存取權_」
- 完成同意書並接受條款和條件

然後，建立一個 API Token，使用 Kaggle API：

- 開啟 [Kaggle 設定](https://www.kaggle.com/settings)
- 選擇「_建立新的 Token_」
- 一個 `kaggle.json` 檔案將下載下來。它含有你的 Kaggle 認證

執行下列Cell，在詢問時輸入你的 Kaggle 認證。


In [None]:
# If you are using Kaggle, you don't need to login again.
!pip install ipywidgets
import kagglehub

kagglehub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://www.kaggle.com/static/images/site-logo.png\nalt=\'Kaggle…

如果kagglehub.login()對你不起作用，一個備選方法是將KAGGLE_USERNAME和KAGGLE_KEY設在你環境裡。


## 安裝

安裝 Keras 和 KerasNLP 以及 Gemma 模型。


In [None]:
!pip install -q -U keras-nlp
# Work around an import error with tensorflow-hub. The library is not used.
!pip install -q -U tensorflow-hub
# Install tensorflow-cpu so tensorflow does not attempt to access the TPU.
!pip install -q -U tensorflow-cpu
# Install keras 3 last. See https://keras.io/getting_started for details.
!pip install -q -U keras

### 設定 Keras JAX 後端


匯入 JAX 並在 TPU 上執行健全性檢查. Kaggle 提供配備 8 個 TPU 核心和 16 GB 記憶體的 TPUv3-8 裝置。


In [None]:
import jax

jax.devices()

[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0),
 TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1),
 TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0),
 TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1),
 TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0),
 TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1),
 TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0),
 TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]

In [None]:
import os

# The Keras 3 distribution API is only implemented for the JAX backend for now
os.environ["KERAS_BACKEND"] = "jax"
# Pre-allocate 90% of TPU memory to minimize memory fragmentation and allocation
# overhead
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "0.9"

## 載入模型


In [None]:
import keras
import keras_nlp

### 關於在 NVIDIA GPU 上進行混合精度訓練的注意事項

在 NVIDIA GPU 上進行訓練時，混合精度 (`keras.mixed_precision.set_global_policy('mixed_bfloat16')`) 可用於加速訓練，對訓練品質的影響最小。在大多數情況下，建議啟用混合精度因為它可以節省記憶體和時間。但是，請注意在較小的批次大小時，它會將記憶體用量增加 1.5 倍 (權重將載入兩次，在半精度和全精度)。

對於推論，半精度 (`keras.config.set_floatx("bfloat16")`) 將會運作並節省記憶體，而混合精度不適用。


In [None]:
# Uncomment the line below if you want to enable mixed precision training on GPUs
# keras.mixed_precision.set_global_policy('mixed_bfloat16')

要載入具有在 TPU 上分佈的權重和張量的模型，請先建立一個新的 `DeviceMesh`。`DeviceMesh` 代表一組已設定為分佈式運算的硬體裝置，且在 Keras 3 中作為統一分佈式 API 的一部分而推出。

分佈式 API 可啟用資料和模型平行，讓多個加速器和主機上的深度學習模型能有效地縮放。它利用基礎架構 (例如 JAX) 透過單一程式、多個資料 (SPMD) 擴充程式式，根據分片指令來分佈程式和張量。查看新的 [Keras 3 分佈式 API 指南](https://keras.io/guides/distribution/) 中的更多詳細資訊。


In [None]:
# Create a device mesh with (1, 8) shape so that the weights are sharded across
# all 8 TPUs.
device_mesh = keras.distribution.DeviceMesh(
    (1, 8),
    ["batch", "model"],
    devices=keras.distribution.list_devices())

分布式 API 中的 `LayoutMap` 指定如何切割或複製權重和 Tensor，使用字串鍵，例如以下的 `token_embedding/embeddings`，它會被視為 regex 以符合 Tensor 路徑。符合條件的 Tensor 會使用模型維度 (8 個 TPU) 進行切割；其餘的會完全複製。


In [None]:
model_dim = "model"

layout_map = keras.distribution.LayoutMap(device_mesh)

# Weights that match 'token_embedding/embeddings' will be sharded on 8 TPUs
layout_map["token_embedding/embeddings"] = (model_dim, None)
# Regex to match against the query, key and value matrices in the decoder
# attention layers
layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = (
    model_dim, None, None)

layout_map["decoder_block.*attention_output.*kernel"] = (
    model_dim, None, None)
layout_map["decoder_block.*ffw_gating.*kernel"] = (None, model_dim)
layout_map["decoder_block.*ffw_linear.*kernel"] = (model_dim, None)

`ModelParallel` 讓你可以在 `DeviceMesh` 所有裝置上分片模型權重或激活張量。在這種情況下，根據上述定義的 `layout_map`，部分 Gemma 7B 模型權重會在 8 個 TPU 晶片之間分片。現在以分散式載入模型。


In [None]:
model_parallel = keras.distribution.ModelParallel(
    device_mesh, layout_map, batch_dim_name="batch")

keras.distribution.set_distribution(model_parallel)
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_7b_en")


Attaching 'config.json' from model 'keras/gemma/keras/gemma_7b_en/1' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_7b_en/1' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_7b_en/1' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_7b_en/1' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_7b_en/1' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


現在驗證模型是否已正確分割。我們以 `decoder_block_1` 為例說明。


In [None]:
decoder_block_1 = gemma_lm.backbone.get_layer('decoder_block_1')
print(type(decoder_block_1))
for variable in decoder_block_1.weights:
  print(f'{variable.path:<58}  {str(variable.shape):<16}  {str(variable.value.sharding.spec)}')

<class 'keras_nlp.src.models.gemma.gemma_decoder_block.GemmaDecoderBlock'>
decoder_block_1/pre_attention_norm/scale                    (3072,)           PartitionSpec(None,)
decoder_block_1/attention/query/kernel                      (16, 3072, 256)   PartitionSpec(None, 'model', None)
decoder_block_1/attention/key/kernel                        (16, 3072, 256)   PartitionSpec(None, 'model', None)
decoder_block_1/attention/value/kernel                      (16, 3072, 256)   PartitionSpec(None, 'model', None)
decoder_block_1/attention/attention_output/kernel           (16, 256, 3072)   PartitionSpec(None, None, 'model')
decoder_block_1/pre_ffw_norm/scale                          (3072,)           PartitionSpec(None,)
decoder_block_1/ffw_gating/kernel                           (3072, 24576)     PartitionSpec('model', None)
decoder_block_1/ffw_gating_2/kernel                         (3072, 24576)     PartitionSpec('model', None)
decoder_block_1/ffw_linear/kernel                           (

## 微調前推論


In [None]:
gemma_lm.generate("Best comedy movies in the 90s ", max_length=64)

'Best comedy movies in the 90s 1. The Naked Gun 2½: The Smell of Fear (1991) 2. Wayne’s World (1992) 3. The Naked Gun 33⅓: The Final Insult (1994)'

該模型產生了一個清單，列舉了 90 年代值得觀看的經典喜劇電影。現在我們微調 Gem ma 模型以改變輸出風格。


## 微調使用 IMDB


In [None]:
import tensorflow_datasets as tfds

imdb_train = tfds.load(
    "imdb_reviews",
    split="train",
    as_supervised=True,
    batch_size=2,
)
# Drop labels.
imdb_train = imdb_train.map(lambda x, y: x)

imdb_train.unbatch().take(1).get_single_element().numpy()

[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteAJDUZT/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteAJDUZT/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteAJDUZT/imdb_reviews-unsupervised.t…

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

In [None]:
# Use a subset of the dataset for faster training.
imdb_train = imdb_train.take(2000)

使用 [低秩適應](https://arxiv.org/abs/2106.09685)(LoRA) 執行微調。LoRA 是一種微調技術，藉由凍結模型的全重量並將較少數量的可訓練新重量插入到模型中，大幅降低下遊任務中可訓練參數數量。LoRA 基本上是藉由較小的低秩矩陣 AxB 重新參數化較大的全重矩陣以進行訓練，此技術可讓訓練速度快得多，並且更具記憶體效率。


In [None]:
# Enable LoRA for the model and set the LoRA rank to 4.
gemma_lm.backbone.enable_lora(rank=4)

In [None]:
# Fine-tune on the IMDb movie reviews dataset.

# Limit the input sequence length to 128 to control memory usage.
gemma_lm.preprocessor.sequence_length = 128
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.summary()
gemma_lm.fit(imdb_train, epochs=1)

See an explanation at https://jax.readthedocs.io/en/latest/faq.html#buffer-donation.


[1m2000/2000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m358s[0m 163ms/step - loss: 2.7145 - sparse_categorical_accuracy: 0.4329


<keras.src.callbacks.history.History at 0x7e9cac7f41c0>

注意，啟用 LoRA 會大幅減少可訓練參數的數量，從 70 億減少到僅 1100 萬。


## 微調後的推理


In [None]:
gemma_lm.generate("Best comedy movies in the 90s ", max_length=64)

"Best comedy movies in the 90s \n\nThis is the movie that made me want to be a director. It's a great movie, and it's still funny today. The acting is superb, the writing is excellent, the music is perfect for the movie, and the story is great."

微調後，此模型學習電影評論的語法，現在利用此語法在 1990 年代喜劇電影背景下產生輸出。


## 接下來做什麼

在本教學課程中，你已學習如何使用 KerasNLP JAX 後端在功能強大的 TPU 上，以分散式方式在 IMDb 資料集上微調 Gemma 模型。以下是你可以進一步學習的建議：

* 了解如何 [開始使用 Keras Gemma](https://ai.google.dev/gemma/docs/get_started)。
* 了解如何 [在 GPU 上微調 Gemma 模型](https://ai.google.dev/gemma/docs/lora_tuning)。
