##### 版權所有 2024 Google LLC.


In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 使用 PaliGemma 的圖像標註
在這個筆記本中，我們將探索使用 PaliGemma 進行圖像標註，這是一個由 Google 開發的最先進的視覺語言模型。PaliGemma 被設計用來理解圖像和文本，使其成為為各種圖像生成準確且描述性標註的理想選擇。

圖像標註在使網絡對所有人可訪問方面起著至關重要的作用，特別是對於盲人或視障人士而言。雖然替代文本（alt text）提供了對圖像的簡潔描述，但標註提供了更全面的解釋，傳達了在簡短的替代文本中可能會錯過的上下文、細節和細微差別。這確保了所有用戶，無論其視覺能力如何，都能充分理解和欣賞網站上的圖像內容，從而促進更具包容性和公平性的在線體驗。

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/doggy8088/gemma-cookbook/blob/zh-tw/PaliGemma/Image_captioning_using_PaliGemma.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />在 Google Colab 中執行</a>
  </td>
</table>


## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you should use a L4 GPU or an A100 GPU, as a T4 will be insufficient:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)** .
2. Select **Change runtime type** .
3. Under **Hardware accelerator** , select **L4 GPU** or **A100 GPU** .

### Gemma setup on Kaggle
To complete this tutorial, you'll first need to complete the setup instructions at [Gemma setup](https://ai.google.dev/gemma/docs/setup), as PaliGemma is a Gemma variant.

In brief, you will need to

* Get access to Gemma on kaggle.com.
* Generate and configure a Kaggle username and API key.

After you've completed the Gemma setup, move on to the next section, where you'll set your username and API key as environment variables for your Colab environment.


## 存取 Kaggle 憑證

我們需要提供 Kaggle 用戶名和 API 金鑰才能從 Kaggle 下載 PaliGemma 模型。

下面的程式碼從 Google Colab 用戶資料中獲取這些憑證，避免直接在筆記本中暴露它們。

如果你還沒有這樣做，請在你的 Colab 用戶資料中適當地設定你的 Kaggle 用戶名和 API 金鑰。


In [None]:
import os
from google.colab import userdata

os.environ["KAGGLE_USERNAME"] = userdata.get("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = userdata.get("KAGGLE_KEY")

## 安裝必要的函式庫

在我們深入使用 PaliGemma 之前，讓我們確保已安裝所有必要的函式庫。以下命令將升級 `keras-cv`、`keras-nlp` 和 `keras` 到最新版本，確保我們可以使用最新的功能和改進來處理視覺和語言模型。


In [None]:
!pip install --upgrade keras-cv
!pip install --upgrade keras-nlp
!pip install --upgrade keras

Collecting keras-cv
  Downloading keras_cv-0.9.0-py3-none-any.whl (650 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/650.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m368.6/650.7 kB[0m [31m10.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m650.7/650.7 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Collecting keras-core (from keras-cv)
  Downloading keras_core-0.1.7-py3-none-any.whl (950 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/950.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
Collecting namex (from keras-core->keras-cv)
  Downloading namex-0.0.8-py3-none-any.whl (5.8 kB)
Installing collected packages: namex, keras-core, keras-cv
Successfully installed keras-core-0.1.7 keras-cv-0.9.0 na

## 載入 PaliGemma 和設定影像尺寸

現在我們將載入 PaliGemma 模型本身。我們將使用預設配置來簡化過程，並確保我們有一個相容的模型來生成影像標題。

今天我們將使用 **pali_gemma_3b_mix_448** 模型，這將要求我們的影像為 448x448 像素...但幸運的是，我們可以在稍後載入影像時指定這一點。

>⚠️ 這是至關重要的，因為 PaliGemma 期望影像以特定格式輸入，以生成準確的標題。

供未來參考，不同的預設主要在三個方面有所不同：

1. **影像尺寸:** 
  - `_224`: 訓練並期望輸入影像尺寸為 224x224 像素。這適合較小的影像且計算需求較低。
  - `_448`: 訓練並期望輸入影像尺寸為 448x448 像素。這在細節和計算成本之間提供了一個平衡。
  - `_896`: 訓練並期望輸入影像尺寸為 896x896 像素。這提供了最高的細節級別，但計算需求更高。
2. **訓練類型:** 
  - `_pt`: *預訓練* 在大量影像-文本對數據集上。這是一個進行一般影像標題生成任務的良好起點。
  - `_mix`: *混合微調* 在多樣的視覺-語言任務集上。預期在更廣泛的任務上表現良好，但通常僅供研究用途。
3. **文本序列長度:** \
這指的是生成標題的最大長度。具有較高影像尺寸的預設通常具有較長的文本序列長度，因為它們可能提供更詳細的描述。

在撰寫本文時(2024/05/28)，可用的預設如下。

預設名稱 |	參數 |	描述
------------|------------|----------------
pali_gemma_3b_mix_224 |	2.92B	 | 影像尺寸 224, 混合微調, 文本序列長度為 256
pali_gemma_3b_mix_448	| 2.92B	| 影像尺寸 448, 混合微調, 文本序列長度為 512
pali_gemma_3b_224	| 2.92B	| 影像尺寸 224, 預訓練, 文本序列長度為 128
pali_gemma_3b_448	| 2.92B	| 影像尺寸 448, 預訓練, 文本序列長度為 512
pali_gemma_3b_896	| 2.93B	| 影像尺寸 896, 預訓練, 文本序列長度為 512

你可以隨時在 Keras 文件中查看最新列表 [這裡](https://keras.io/api/keras_nlp/models/pali_gemma/pali_gemma_causal_lm/#frompreset-method).


In [None]:
import keras_nlp

# load paligemma from a preset
#
# for more info and options to use, see the docs:
# https://keras.io/api/keras_nlp/models/pali_gemma/pali_gemma_causal_lm/#frompreset-method
model_name = "pali_gemma_3b_mix_448"
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset(model_name)

# we need to resize the image to the size expected by the model
# we're assuming the model name ends with _NUM here
target_size_x = int(model_name[model_name.rfind("_") + 1 :])
target_size = (target_size_x, target_size_x)

Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/metadata.json...
100%|██████████| 143/143 [00:00<00:00, 191kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/task.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/config.json...
100%|██████████| 861/861 [00:00<00:00, 1.02MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/model.weights.h5...
100%|██████████| 5.45G/5.45G [07:10<00:00, 13.6MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/preprocessor.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/tokenizer.json...
100%|██████████| 410/410 [00:00<00:00, 494kB/s]
Downloading from https://www.kaggle.com/api/

## 載入和準備圖像

讓我們載入圖像並將其準備好用於 PaliGemma。我們將在這個範例中使用一張貓的樣本圖像(我的貓!)。

下面的程式碼將從 URL 載入圖像，將其調整到 PaliGemma 模型預期的尺寸，並將其轉換為 Tensor 物件，這是模型輸入所需的格式。


In [None]:
from keras.preprocessing.image import load_img, img_to_array
import tensorflow as tf

# here we're loading an image of my cat because that's easier than finding a
# creative commons image
image_path = tf.keras.utils.get_file(
    "juice.jpg", "https://jethac.github.io/assets/juice.jpg"
)
keras_img = load_img(image_path, target_size=target_size)

# convert image to NumPy array
img_array = img_to_array(keras_img)

# convert NumPy array to Tensor object
img_tensor = tf.convert_to_tensor(img_array)

Downloading data from https://jethac.github.io/assets/juice.jpg
[1m251543/251543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


## 生成圖像標題

最後，我們將使用 PaliGemma 生成我們圖像的標題。我們將為模型提供圖像張量和一個提示，指示它描述圖像。

由於我們沒有使用指令調整的模型，我們需要手動從模型的輸出中移除提示，以獲得乾淨的標題。


In [None]:
# define prompt separately so we can measure its length later
prompt = "Caption the image:"

# pass images and prompts to paligemma
response = pali_gemma_lm.generate({"images": [img_tensor], "prompts": [prompt]})

# we're not using an instruction-trained model so we have to cut the prompt off
# the front of our output
filtered = response[0][len(prompt) :]
print(filtered)

A black and white cat sits comfortably on a black backpack, its eyes open and its paw resting on the bag. The cat's white fur and black nose are prominent features in the image. The backpack is open, revealing the cat's black and white paws and the black strap on the side. The cat's eyes are green, and its whiskers are white. The cat's head is tilted slightly towards the camera, and its ears are perked up. The cat's black and white coat is contrasted by its white chest and paws. The cat's eyes are bright and alert, and its nose is wrinkled in concentration.
