# Using LLama Factory finetune on SageMaker 
# 2. 使用vLLM进行本地推理

## 安装依赖包

In [1]:
!pip install vllm bitsandbytes

Collecting vllm
  Downloading vllm-0.4.3-cp310-cp310-manylinux1_x86_64.whl.metadata (7.8 kB)
Collecting cmake>=3.21 (from vllm)
  Downloading cmake-3.29.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.1 kB)
Collecting openai (from vllm)
  Downloading openai-1.31.2-py3-none-any.whl.metadata (21 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)
Collecting lm-format-enforcer==0.10.1 (from vllm)
  Downloading lm_format_enforcer-0.10.1-py3-none-any.whl.metadata (16 kB)
Collecting outlines==0.0.34 (from vllm)
  Downloading outlines-0.0.34-py3-none-any.whl.metadata (13 kB)
Collecting ray>=2.9 (from vllm)
  Downloading ray-2.23.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (13 kB)
Collecting nvidia-ml-py (from vllm)
  Downloading nvidia_ml_py-12.555.43-py3-none-any.whl.metadata (8.6 kB)
Collecting torch==2.3.0 (from vllm)
  Downloading torch-2.3.0-cp310-cp310-manyli

In [None]:
### 从s3下载模型文件到本地

In [2]:
import boto3
import pprint
from tqdm import tqdm
import sagemaker
sagemaker_session =  sagemaker.session.Session() #sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [9]:
!aws s3 sync s3://{default_bucket}/llama3-8b-qlora/ ./local_model

download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/adapter_config.json to local_model/finetuned_model/adapter_config.json
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/all_results.json to local_model/finetuned_model/all_results.json
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/special_tokens_map.json to local_model/finetuned_model/checkpoint-500/special_tokens_map.json
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/adapter_config.json to local_model/finetuned_model/checkpoint-500/adapter_config.json
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/checkpoint-500/README.md to local_model/finetuned_model/checkpoint-500/README.md
download: s3://sagemaker-us-east-1-434444145045/llama3-8b-qlora/finetuned_model/README.md to local_model/finetuned_model/README.md
download: s3://sagemaker-us-east-1-4344441

## 加载模型tokenizer

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

In [4]:
model_id = 'TechxGenus/Meta-Llama-3-8B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## 加载sample数据，用于对比

In [5]:
from datasets import load_dataset
from random import randrange
# dataset_name = "zxbsmk/webnovel_cn"
dataset_name = "hfl/ruozhiba_gpt4"
# Load dataset from the hub
train_dataset = load_dataset(dataset_name, split="train")

In [28]:

print(f"Training size: {len(train_dataset)}")
print("\nTraining sample:\n")
num_samples = 200
print(train_dataset[randrange(num_samples)])

Training size: 4898

Training sample:

{'output': '首先，‘女鬼压床’是指睡眠瘫痪症，这是一种在睡觉时突然无法动弹且伴有幻觉的不适感，并非真正的鬼魂现象。多数情况下睡眠瘫痪是暂时的，数秒至数分钟内会自行恢复。这种现象通常与压力大、睡眠不足、作息不规律等因素有关。假如你经历了这种情况，首先不用过度恐慌，应保持冷静，意识到这是一个科学现象而不是超自然事件。其次，改善生活习惯，如保持规律的作息时间，避免过度疲劳和压力，能够有效减少这种现象的发生。因此，‘拒绝’女鬼的方法就是通过健康的生活习惯预防和缓解睡眠瘫痪症的发生。如果频繁出现，可以考虑咨询睡眠专家以寻求更专业的帮助。', 'input': '', 'instruction': '睡觉时被女鬼压床我已经有老婆了我该怎么拒绝'}


In [17]:
sql_lora_path = './local_model/finetuned_model'

## 使用本地的vLLM部署

In [8]:
from vllm.lora.request import LoRARequest
from vllm import LLM,SamplingParams
from transformers import AutoModelForCausalLM, AutoTokenizer


In [9]:
model_id = 'TechxGenus/Meta-Llama-3-8B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
llm = LLM(model=model_id,max_model_len=4096,enable_lora=True)

INFO 06-06 15:27:38 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='TechxGenus/Meta-Llama-3-8B-Instruct-AWQ', speculative_config=None, tokenizer='TechxGenus/Meta-Llama-3-8B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TechxGenus/Meta-Llama-3-8B-Instruct-AWQ)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-06 15:27:39 weight_utils.py:207] Using model weights format ['*.safetensors']
INFO 06-06 15:27:40 model_runner.py:146] Loading model weights took 5.3479 GB
INFO 06-06 15:27:43 gpu_executor.py:83] # GPU blocks: 6586, # CPU blocks: 2048
INFO 06-06 15:27:45 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-06 15:27:45 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-06 15:27:53 model_runner.py:924] Graph capturing finished in 8 secs.


In [29]:
#测试第一个消息
messages = [
    {"role": "system", "content":"请始终用中文回答"},
     {"role": "user", "content": "你是谁？你是干嘛的"},
]

#测试第二个消息
messages = [
    {"role": "system", "content":"请始终用中文回答"},
     {"role": "user", "content": "睡觉时被女鬼压床我该怎么办？"},
]


inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

### 使用原始模型进行推理

In [30]:
sampling_params = SamplingParams(temperature=0.1, top_p=0.95,max_tokens=512)

outputs = llm.generate(inputs, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt:\n{prompt!r}")
    print(f"Response:\n{generated_text!r}")


Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.37s/it, Generation Speed: 66.08 toks/s]

Prompt:
'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n请始终用中文回答<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n睡觉时被女鬼压床我该怎么办？<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
Response:
'如果您睡觉时被女鬼压床，以下是一些可能有助的建议：\n\n1.保持冷静：在这种情况下，保持冷静和镇定非常重要。不要 Panic 或恐惧，因为这可能会使情况变得更糟。\n2.呼吸深长：深呼吸可以帮助您保持冷静和平静。深呼吸可以减少压力和焦虑。\n3.祈祷或念佛：祈祷或念佛可以帮助您感受到安全和保护。您可以祈祷女鬼离开您，也可以念佛以求保护。\n4.寻求帮助：如果您感到女鬼的压床非常强烈或您感到不安全，可以寻求帮助。您可以向家人或朋友寻求帮助，也可以寻求专业人士的帮助，如心理医生或灵媒。\n5.保持洁净：保持洁净和整洁的环境可以帮助您避免女鬼的攻击。您可以每天清洁房间，也可以在睡前祈祷以求保护。\n6.避免在黑暗中睡觉：女鬼通常在黑暗中活动，因此避免在黑暗中睡觉可能有助于避免女鬼的攻击。\n7.寻求专业人士的帮助：如果您感到女鬼的压床非常强烈或您感到不安全，可以寻求专业人士的帮助，如灵媒或占卜师。他们可以帮助您驱逐女鬼或提供其他帮助。\n\n记住，女鬼的压床可能是精神或心理问题的表现，因此寻求专业人士的帮助也非常重要。'





### 加载Lora进行推理

In [31]:
sql_lora_path = './local_model/finetuned_model'

In [32]:
outputs = llm.generate(inputs, sampling_params,lora_request=LoRARequest("adapter", 1, sql_lora_path))

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt:\n{prompt!r}")
    print(f"Response:\n{generated_text!r}")

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.07s/it, Generation Speed: 62.72 toks/s]

Prompt:
'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n请始终用中文回答<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n睡觉时被女鬼压床我该怎么办？<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
Response:
'睡觉时被女鬼压床是一种非常奇怪且不可能的现象。女鬼是根据传统文化和宗教信仰存在的虚构概念，而在现实中，睡眠时不会有任何非自然的力量压床。因此，用户不需要担心这种情况。'



