##### 版權所有 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma - deploy with vLLM

This notebook demonstrates how you can deploy a Gemma model with [vLLM](https://github.com/vllm-project/vllm) and query it. vLLM is a fast and easy-to-use library for LLM inference and serving, and has built-in support for Gemma deployment.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/doggy8088/gemma-cookbook/blob/zh-tw-240628/Gemma/Deploy_with_vLLM.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## 設定

### 選擇 Colab 執行環境
要完成此指南，你需要有一個資源足夠的 Colab 執行環境來執行 Gemma 模型。在這種情況下，你可以使用 T4 GPU：

1. 在 Colab 視窗的右上角，選擇 **▾ (Additional connection options)** 。
2. 選擇 **Change runtime type** 。
3. 在 **Hardware accelerator** 下，選擇 **T4 GPU** 。

### 在 Hugging Face 上設定 Gemma
vLLM 在底層使用 Hugging Face。所以你需要：

* 通過接受 Hugging Face 特定模型頁面的 Gemma 許可來獲取 [huggingface.co](huggingface.co) 上的 Gemma 訪問權限，即 [Gemma 2B](https://huggingface.co/google/gemma-2b)。
* 生成一個 [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) 並將其配置為 Colab 機密 'HF_TOKEN'。

## 安裝


In [None]:
!pip install vllm



## 推論

### 匯入 vLLM

In [None]:
from vllm import LLM

實例化 vLLM 的引擎 (注意 Colab 的免費 T4 GPU 只支援 `float32`)。

In [None]:
llm = LLM(model="google/gemma-2b", dtype="float32")



config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

INFO 05-24 06:23:04 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='google/gemma-2b', speculative_config=None, tokenizer='google/gemma-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float32, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=google/gemma-2b)


tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

INFO 05-24 06:23:06 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 05-24 06:23:08 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-24 06:23:08 selector.py:32] Using XFormers backend.
INFO 05-24 06:23:10 weight_utils.py:199] Using model weights format ['*.safetensors']


model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

INFO 05-24 06:24:08 model_runner.py:175] Loading model weights took 9.3440 GB
INFO 05-24 06:24:16 gpu_executor.py:114] # GPU blocks: 1724, # CPU blocks: 7281
INFO 05-24 06:24:19 model_runner.py:937] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 05-24 06:24:19 model_runner.py:941] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 05-24 06:24:29 model_runner.py:1017] Graph capturing finished in 11 secs.


執行推論與單一提示

In [None]:
prompt = "The capital of USA is"
output = llm.generate(prompt)
print()
print(f"Prompt: {prompt}, Gemma output: {output[0].outputs[0].text}")

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.42it/s]


Prompt: The capital of USA is, Gemma output:  planning to invest in sustainable transportation and is ready to accept transformational advances in technology





執行批次推論

In [None]:
prompts = [
    "The best thing about California is",
    "Physics studies",
    "The best place for sightseeing in Japan is",
]

In [None]:
outputs = llm.generate(prompts)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print()
    print(f"Prompt: {prompt}, Gemma output: {generated_text}")

Processed prompts: 100%|██████████| 3/3 [00:00<00:00,  3.59it/s]


Prompt: The best thing about California is, Gemma output:  that there are always new ways to spend time and enjoy the magnificent state as a

Prompt: Physics studies, Gemma output:  matter, its properties, how it interacts with other matter, and other fields of

Prompt: The best place for sightseeing in Japan is, Gemma output:  Kyoto. It is the very first city you will visit in Japan when you get



