# 蒸餾步驟教學
### 分步指南

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/doggy8088/generative-ai/blob/main/language/tuning/distilling_step_by_step/distilling_step_by_step.zh.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory 標誌"><br> 在 Colab 中執行
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/doggy8088/generative-ai/blob/main/language/tuning/distilling_step_by_step/distilling_step_by_step.zh.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub 標誌"><br> 在 GitHub 上檢視
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/doggy8088/generative-ai/blob/main/language/tuning/distilling_step_by_step/distilling_step_by_step.zh.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI 標誌"><br> 在 Vertex AI Workbench 中開啟
    </a>
  </td>
</table>


| | |
|-|-|
|作者 | [Anirudh Haritas Murali](https://github.com/anihm136) |


# 概述


**蒸餾** 是機器學習中的一種技術，允許我們提取大型模型的學習成果，並使用較小的模型來表示。這允許改善可擴充性，因為較小的模型執行時需要較少的資源，並且生成推論所需的時間較少，同時仍能達到接近較大模型的準確度。

傳統上，蒸餾使用較大模型的內部參數 (特別是對數) 來訓練較小的模型。然而，當今表現最好的大型語言模型之一，包括 Google 的 [PaLM 2](https://ai.google/discover/palm2/) 模型，都以 API 形式提供給消費者，而沒有辦法訪問內部參數。直到最近，這才禁止使用這些模型作為蒸餾的教師模型。


## 目標

在本筆記本中，我們將檢閱論文 [逐步萃取](https://blog.research.google/2023/09/distilling-step-by-step-outperforming.html) 中描述的技術，其中說明了一種將大型 LLM 的知識萃取到較小型 LLM 中的新方法，而且不需要大型模型的內部參數。此研究的原始碼可在 [https://github.com/google-research/distilling-step-by-step](https://github.com/google-research/distilling-step-by-step) 取得。

我們將逐步訓練一個小型 (學生) 模型，以模擬較大型 (教師) 模型的推理能力。透過訓練學生模型模擬推理能力，而非實際輸出，我們可以讓小型模型更好地概化到其他未見的輸入。

所執行的步驟包括：

- 為萃取準備一個資料集
- 設定萃取管道
- 使用 PaLM 作為教師模型來訓練學生模型
- 評估萃取模型的效能
- 將萃取模型部署到 Vertex AI


## 成本
本教學課程使用了 Google Cloud 的計費元件：
- Vertex AI
- 儲存空間
- Artifact Registry
- Cloud Build

了解 [Vertex AI 價格](https://cloud.google.com/vertex-ai/pricing)、[儲存空間價格](https://cloud.google.com/storage/pricing)、[Artifact Registry 價格](https://cloud.google.com/artifact-registry/pricing) 和 [Cloud Build](https://cloud.google.com/build/pricing) 的價格，並使用 [價格計算器](https://cloud.google.com/products/calculator/) 根據預期的用量產生成本估計。


# 入門


## (僅在 Colab) 以使用者的身分驗證
在 Colab 上，我們將驗證具有權存取上述 Google Cloud 資源的使用者。當我們部署模型時，將會需要這項權限


In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

## 下載支援檔案
為簡化執行此示範的程序，已提供一些支援檔案 (資料集的 PaLM 輸出和建立模型服務容器的程式碼）


In [None]:
! gsutil -m cp -r gs://github-repo/distillation/* .
! wget https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/tuning/distilling_step_by_step/requirements.txt
! wget https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/tuning/distilling_step_by_step/prediction_container/Dockerfile -P prediction_container
! wget https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/tuning/distilling_step_by_step/prediction_container/app/main.py -P prediction_container/app
! wget https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/tuning/distilling_step_by_step/prediction_container/app/requirements.txt -P prediction_container/app
! wget https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/tuning/distilling_step_by_step/prediction_container/app/requirements-torch.txt -P prediction_container/app
! wget https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/tuning/distilling_step_by_step/prediction_container/app/prestart.sh -P prediction_container/app

## 安裝必要的函式庫


In [None]:
! pip install -r requirements.txt

## 啟用所需的 Google Cloud API
為了便於清理資源，你可以在本教學課程的末尾建立一個新專案並將其刪除


In [None]:
PROJECT = ""  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

In [None]:
!gcloud services enable aiplatform.googleapis.com --project {PROJECT}
!gcloud services enable artifactregistry.googleapis.com --project {PROJECT}
!gcloud services enable cloudbuild.googleapis.com --project {PROJECT}

# 步驟 1：資料準備


我們的數據集將需要三個欄位 -
1. LLM 的輸入提示
2. 基本事實標籤，這是預期的輸出
3. 「理由」，這是老師模型 (使用 CoT 提示) 產生的推理

在此，我們將使用 HuggingFace 的 [常識解釋](https://huggingface.co/datasets/cos_e) 數據集來訓練我們的學生模型。此數據集包含約 10k 個訓練範例和 1.2k 個測試範例。我們將使用預先生成自 PaLM 模型的理由作為老師，我們將預處理數據集以符合上述結構


In [None]:
from datasets import DatasetDict, load_dataset
from typing import Dict, Any, List

In [None]:
SOURCE_DATASET = "cos_e"  # @param {type:"string"}
SOURCE_DATASET_VERSION = "v1.11"  # @param {type:"string"}

dataset = load_dataset(SOURCE_DATASET, SOURCE_DATASET_VERSION)
dataset["test"] = dataset["validation"]
del dataset["validation"]

In [None]:
def prepare_input(example: Dict[str, Any]) -> Dict[str, Any]:
    question = example["question"]
    c_0 = example["choices"][0]
    c_1 = example["choices"][1]
    c_2 = example["choices"][2]
    c_3 = example["choices"][3]
    c_4 = example["choices"][4]

    input = f"{question}\nAnswer Choices:\n(a) {c_0}\n(b) {c_1}\n(c) {c_2}\n(d) {c_3}\n(e) {c_4}"

    example["input"] = input
    example["label"] = example["answer"]

    return example


dataset = dataset.map(
    prepare_input,
    remove_columns=[
        "id",
        "question",
        "choices",
        "answer",
        "abstractive_explanation",
        "extractive_explanation",
    ],
)

In [None]:
LLM_OUTPUTS_FILE_PREFIX = "PaLM_CoT"  # @param {type:"string"}
LLM_OUTPUTS_FILE = LLM_OUTPUTS_FILE_PREFIX + "_{split}.json"


def add_llm_outputs(dataset: DatasetDict, split: str) -> None:
    llm_ds = load_dataset("json", data_files=LLM_OUTPUTS_FILE.format(split=split))[
        "train"
    ]

    def _add(example: Dict[str, Any], idx: int) -> Dict[str, Any]:
        example["llm_rationale"] = llm_ds[idx]["rationale"]
        example["llm_label"] = llm_ds[idx]["label"]
        return example

    dataset[split] = dataset[split].map(_add, with_indices=True)


for split in ["train", "test"]:
    add_llm_outputs(dataset, split)

# 步驟 2：建立模型


In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
)
import pandas as pd
import torch

這裡我們將使用 T5 模型作為預訓練基礎進行蒸餾，且我們將使用對應的 tokenizer。你可以將模型名稱換成 HuggingFace Hub 上其他模型名稱、使用不同的預訓練模型 (及其 tokenizer)，或對你自己的資料集從頭訓練自訂模型/tokenizer。請注意，你需要更多資料量和運算資源才能從頭開始訓練一個良好的模型。


In [None]:
PRETRAINED_BASE_MODEL = "google/flan-t5-base"  # @param {type:"string"}
MAX_INPUT_LENGTH = 1024  # @param {type:"integer"}
MAX_OUTPUT_LENGTH = 256  # @param {type:"integer"}

## a) 準備分詞器並對資料集進行分詞


In [None]:
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_BASE_MODEL)


def tokenize_function(examples: Dict[str, List[Any]]):
    # Encode input to generate predictions and rationales
    model_inputs = tokenizer(
        ["predict: " + text for text in examples["input"]],
        max_length=MAX_INPUT_LENGTH,
        truncation=True,
    )
    expl_model_inputs = tokenizer(
        ["explain: " + text for text in examples["input"]],
        max_length=MAX_INPUT_LENGTH,
        truncation=True,
    )
    model_inputs["expl_input_ids"] = expl_model_inputs["input_ids"]
    model_inputs["expl_attention_mask"] = expl_model_inputs["attention_mask"]

    # Encode target label and target rationale
    label_output_encodings = tokenizer(
        text_target=examples["label"], max_length=MAX_OUTPUT_LENGTH, truncation=True
    )
    rationale_output_encodings = tokenizer(
        text_target=examples["llm_rationale"],
        max_length=MAX_OUTPUT_LENGTH,
        truncation=True,
    )
    model_inputs["labels"] = label_output_encodings["input_ids"]
    model_inputs["expl_labels"] = rationale_output_encodings["input_ids"]

    return model_inputs


tokenized_dataset = dataset.map(
    tokenize_function,
    remove_columns=["input", "llm_rationale", "label", "llm_label"],
    batched=True,
)

## b) 準備模型


In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(PRETRAINED_BASE_MODEL)
# Uncomment if you have more than one GPU to enable parallelism
# model.parallelize()

## c) 為多任務訓練準備數據整理器
由於我們需要為答案和原理生成預測，對每個訓練和預測步驟，我們將使用自訂的 DataCollator，它會取每批功能並回傳兩組功能和標籤，每組分別是答案和原理


In [None]:
class TaskPrefixDataCollator(DataCollatorForSeq2Seq):
    def __call__(self, features, return_tensors=None):
        features_df = pd.DataFrame(features)

        # Generate features for answers
        ans_features = features_df.loc[
            :, features_df.columns.isin(["labels", "input_ids", "attention_mask"])
        ].to_dict("records")
        ans_features = super().__call__(ans_features, return_tensors)

        # Generate features for explanations
        expl_features = (
            features_df.loc[
                :,
                features_df.columns.isin(
                    ["expl_labels", "expl_input_ids", "expl_attention_mask"]
                ),
            ]
            .rename(
                columns={
                    "expl_labels": "labels",
                    "expl_input_ids": "input_ids",
                    "expl_attention_mask": "attention_mask",
                }
            )
            .to_dict("records")
        )
        expl_features = super().__call__(expl_features, return_tensors)

        return {
            "ans": ans_features,
            "expl": expl_features,
        }


data_collator = TaskPrefixDataCollator(tokenizer=tokenizer, model=model)

## d) 準備訓練器進行多任務訓練
與此類似，我們將使用自訂訓練器來訓練模型，模型考量了答案生成和依據生成中兩種損失。我們會使用超參數 `alpha` 來控制這兩個損失對於總體模型損失的相對貢獻


In [None]:
class TaskPrefixTrainer(Seq2SeqTrainer):
    def __init__(self, alpha, output_rationale, **kwargs):
        super().__init__(**kwargs)
        self.alpha = alpha
        self.output_rationale = output_rationale

    def compute_loss(self, model, inputs, return_outputs=False):
        ans_outputs = model(**inputs["ans"])
        expl_outputs = model(**inputs["expl"])

        loss = self.alpha * ans_outputs.loss + (1.0 - self.alpha) * expl_outputs.loss

        return (
            (loss, {"ans": ans_outputs, "expl": expl_outputs})
            if return_outputs
            else loss
        )

    def prediction_step(self, model, inputs, prediction_loss_only, ignore_keys=None):
        ans_outputs = super().prediction_step(
            model, inputs["ans"], prediction_loss_only=False, ignore_keys=ignore_keys
        )
        if self.output_rationale:
            expl_outputs = super().prediction_step(
                model,
                inputs["expl"],
                prediction_loss_only=False,
                ignore_keys=ignore_keys,
            )
        else:
            expl_outputs = ans_outputs  # placeholder only

        loss = self.alpha * ans_outputs[0] + (1 - self.alpha) * expl_outputs[0]

        return (
            loss,
            [ans_outputs[1], expl_outputs[1]],
            [ans_outputs[2], expl_outputs[2]],
        )

# 步驟 3：訓練模型


In [None]:
from transformers import Seq2SeqTrainingArguments
from transformers.trainer_utils import set_seed
import numpy as np

In [None]:
RUN_ID = 0  # @param {type:"integer"}
CONFIG_DIR = "distillation_outputs"  # @param {type:"string"}
CKPT_DIR = f"{CONFIG_DIR}/ckpts/{RUN_ID}"  # for model checkpoints
LOG_DIR = f"{CONFIG_DIR}/logs/{RUN_ID}"  # for training logs

EVAL_STEPS = 500  # @param {type:"integer"}
SAVE_STEPS = 1000  # @param {type:"integer"}
MAX_STEPS = 4000  # @param {type:"integer"}

LEARNING_RATE = 5e-5
BATCH_SIZE = 16

ALPHA = 0.5

In [None]:
set_seed(RUN_ID)

training_args = Seq2SeqTrainingArguments(
    CKPT_DIR,
    remove_unused_columns=False,
    evaluation_strategy="steps",
    eval_steps=EVAL_STEPS,
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    logging_dir=LOG_DIR,
    logging_strategy="steps",
    logging_steps=EVAL_STEPS,
    max_steps=MAX_STEPS,
    learning_rate=LEARNING_RATE,
    gradient_accumulation_steps=1,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    predict_with_generate=True,
    seed=RUN_ID,
    local_rank=-1,
    bf16=False,
    generation_max_length=64,
    prediction_loss_only=False,
)

In [None]:
from typing import Tuple, Callable
from transformers import AutoTokenizer


def compute_metrics_text(tokenizer: AutoTokenizer) -> Callable:
    def compute_metrics(eval_pred: Tuple[np.ndarray, np.ndarray]) -> Dict[str, float]:
        predictions, labels = eval_pred
        decoded_preds = tokenizer.batch_decode(predictions[0], skip_special_tokens=True)

        labels = np.where(labels[0] != -100, labels[0], tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        acc = np.mean(np.array(decoded_preds) == np.array(decoded_labels))

        return {"accuracy": acc}

    return compute_metrics


compute_metrics = compute_metrics_text(tokenizer)

In [None]:
trainer_kwargs = {
    "alpha": ALPHA,
    "output_rationale": False,
    "model": model,
    "args": training_args,
    "train_dataset": tokenized_dataset["train"],
    "eval_dataset": {
        "test": tokenized_dataset["test"],
    },
    "data_collator": data_collator,
    "tokenizer": tokenizer,
    "compute_metrics": compute_metrics,
}

In [None]:
trainer = TaskPrefixTrainer(**trainer_kwargs)
trainer.train()

# 步驟 4：評估模型


現在讓我們將我們蒸餾的學生模型的效能與 PaLM 模型進行比較。我們也會嘗試從基礎學生模型中產生產出，以比較蒸餾訓練方法所造成的差異。


In [None]:
from transformers import pipeline

In [None]:
CHECKPOINT = f"{CKPT_DIR}/checkpoint-4000"

distilled_tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
distilled_model = AutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT)

base_tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_BASE_MODEL)
base_model = AutoModelForSeq2SeqLM.from_pretrained(PRETRAINED_BASE_MODEL)

In [None]:
distill_generator = pipeline(
    "text2text-generation", model=distilled_model, tokenizer=distilled_tokenizer
)
base_generator = pipeline(
    "text2text-generation", model=base_model, tokenizer=base_tokenizer
)


def generate_answers(sample: Dict[str, Any]) -> Dict[str, Any]:
    sample["distill_label"] = distill_generator(["predict: " + sample["input"]])[0][
        "generated_text"
    ]
    sample["base_label"] = base_generator(sample["input"])[0]["generated_text"]
    return sample


output_dataset = dataset["test"].map(generate_answers)

In [None]:
output_df = output_dataset.to_pandas().drop("llm_rationale", axis=1)
display_df = output_df.copy().rename(
    columns={
        "input": "Question",
        "label": "True answer",
        "llm_label": "PaLM answer",
        "base_label": "T5 answer",
        "distill_label": "Distilled T5 answer",
    }
)
display_df.head(10)

In [None]:
print(
    "The accuracy of PaLM model is {:.2f}%".format(
        output_df[output_df["label"] == output_df["llm_label"]]["label"].count()
        / len(output_df)
        * 100
    )
)
print(
    "The accuracy of raw student model is {:.2f}%".format(
        output_df[output_df["label"] == output_df["base_label"]]["label"].count()
        / len(output_df)
        * 100
    )
)
print(
    "The accuracy of distilled student model is {:.2f}%".format(
        output_df[output_df["label"] == output_df["distill_label"]]["label"].count()
        / len(output_df)
        * 100
    )
)

正如我們所見，原始預訓練學生模型無法產生答案。然而，只需幾個訓練樣本和次世代，我們便已能夠使用小得多的 T5 模型接近 PaLM 模型的準確率。


# 第 5 步：將模型部署至 Vertex AI
*備註：下列步驟將建立一個具有指定名稱的 Cloud Storage 儲存空間和一個 Artifact Registry Docker 儲存庫。如果你想使用現有的儲存空間或儲存庫，請在下方提供其名稱，並註解出建立資源的步驟，如下所示*


In [None]:
STAGING_BUCKET = ""  # @param {type:"string"}
ARTIFACTS_DIR = f"{STAGING_BUCKET}/distilled-t5"
CHECKPOINT_STEP = 4000  # @param {type:"integer"}
CHECKPOINT = f"{CKPT_DIR}/checkpoint-{CHECKPOINT_STEP}"
DOCKER_REPO_NAME = "distill-step-by-step"  # @param {type:"string"}

## 將產品發送到雲端儲存空間


In [None]:
! gsutil mb gs://{STAGING_BUCKET} # comment to use existing bucket
! gsutil -m cp {CHECKPOINT}/* gs://{ARTIFACTS_DIR}

## 建立模型服務容器


In [None]:
!gcloud artifacts repositories create {DOCKER_REPO_NAME} --location {REGION} --repository-format=docker  # comment to use existing bucket
!gcloud auth configure-docker {REGION}-docker.pkg.dev --quiet
!gcloud builds submit --tag {REGION}-docker.pkg.dev/{PROJECT}/{DOCKER_REPO_NAME}/distilled-flan-t5:latest ./prediction_container

## 上傳模型


In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT, location=REGION, staging_bucket=STAGING_BUCKET)

DEPLOY_IMAGE = (
    f"{REGION}-docker.pkg.dev/{PROJECT}/{DOCKER_REPO_NAME}/distilled-flan-t5:latest"
)
HEALTH_ROUTE = "/health"
PREDICT_ROUTE = "/predict"
SERVING_CONTAINER_PORTS = [7080]

model = aiplatform.Model.upload(
    display_name=f"distilled-flan-t5",
    description=f"Distilled Flan T5 model using Step-By-Step Distillation",
    serving_container_image_uri=DEPLOY_IMAGE,
    serving_container_predict_route=PREDICT_ROUTE,
    serving_container_health_route=HEALTH_ROUTE,
    serving_container_ports=SERVING_CONTAINER_PORTS,
    artifact_uri=f"gs://{ARTIFACTS_DIR}",
)
print(model.resource_name)

## 部署模型


In [None]:
model = aiplatform.Model(model.resource_name)

endpoint = model.deploy(
    machine_type="n1-standard-4",
    traffic_split={"0": 100},
    min_replica_count=1,
    max_replica_count=1,
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    traffic_percentage=100,
    deploy_request_timeout=1200,
    sync=True,
)
endpoint.wait()

# 總結
在這個筆記本中，我們學到如何透過讓大型老師 LLM 來教導較小的學生 LLM 思考，這將明顯改善較小型模型在簡單指令微調上的效能。

如果你有興趣在 Google Cloud 上運作 LLM 上類似的擴散程序，請查看 [在 Google Cloud 上擴散文字模型](https://cloud.google.com/vertex-ai/docs/generative-ai/models/distill-text-models/)
