# SageMaker Endpoint 추론 및 간단한 벤치마크

### 선수 사항
- 이 노트북은 [20-Fine-Tune-Llama-7B-INF2](../../20-Fine-Tune-Llama-7B-INF2/README.md) 의 Llama-7B 모델의 파인 튜닝후에 SageMaker Endpoint 가 배포 된 이후에 실행 결과 입니다. 
- 다른 Llama 2 계열의 SageMaker Endpoint 가 배포된 이후에 실행 하셔도 됩니다. 


실험 환경:  노트북은 SageMaker Studio Code Editor 에서 테스트 되었습니다.
- 사용 커널: base(Python 3.10.13)

---

# 0. 필요 패키지 설치

In [1]:
install_needed = True
if install_needed:
    ! pip install -q transformers==4.31.0
    ! pip list | grep transformers

transformers                          4.31.0


# 1. 환경 설정

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import sys, os

def add_python_path(module_path):
    if os.path.abspath(module_path) not in sys.path:
        sys.path.append(os.path.abspath(module_path))
        print(f"python path: {os.path.abspath(module_path)} is added")
    else:
        print(f"python path: {os.path.abspath(module_path)} already exists")
    print("sys.path: ", sys.path)

module_path = ".."
add_python_path(module_path)


python path: /home/sagemaker-user/aws-ai-ml-workshop-kr/genai/aws-gen-ai-kr/40_inference/90_benchmark is added
sys.path:  ['/home/sagemaker-user/aws-ai-ml-workshop-kr/genai/aws-gen-ai-kr/40_inference/90_benchmark/10-Getting-Started', '/opt/conda/lib/python310.zip', '/opt/conda/lib/python3.10', '/opt/conda/lib/python3.10/lib-dynload', '', '/opt/conda/lib/python3.10/site-packages', '/home/sagemaker-user/aws-ai-ml-workshop-kr/genai/aws-gen-ai-kr/40_inference/90_benchmark']


In [4]:
from benchmark_utils.benchmark import (print_ww, 
                                       pretty_print_json,
                                       invoke_endpoint_sagemaker
                                       )

# 2. pay_load 생성

In [5]:
def create_payload_llama_7b_fine_tuned_model(prompt, param):
    # prompt="What is a machine learning?"
    input_data = f"<s>[INST] <<SYS>>\nAs a data scientist\n<</SYS>>\n{prompt} [/INST]"
    pay_load = {"inputs": input_data, "parameters": param}
    return pay_load


prompt = "What happened to the dinosaurs? "
param = {"max_new_tokens":300, "temperature": 0.1 , "do_sample":"False", "stop" : ["</s>"]}
pay_load = create_payload_llama_7b_fine_tuned_model(prompt, param)



# 3. SageMaker Endpoint 호출
### [중요] 아래 endpoint_name 을 입력하세요.
그림의 예시처럼, SageMaker endpoint 의 name 을 복사해서 아래에 붙여넣기 하세요.
- ![sagemaker_ep_console.png](img/sagemaker_ep_console.png)

In [6]:
endpoint_name = '<Type Your SageMaker Endpoint Name>'
endpoint_name = 'lmi-model-2024-04-13-14-53-53-788'

## Sagemaker Endpoint 호출

In [7]:
import time
s = time.perf_counter()

response = invoke_endpoint_sagemaker(endpoint_name = 'lmi-model-2024-04-13-14-53-53-788', 
                         pay_load = pay_load)    

elapsed_async = time.perf_counter() - s
from termcolor import colored

print(f"elapsed time: {round(elapsed_async,3)} second")
print("## payload: ") 
pretty_print_json(pay_load)
print("## inference esponse: ")                      
print_ww(colored(response, "green"))                         

elapsed time: 3.754 second
## payload: 
{
    "inputs": "<s>[INST] <<SYS>>\nAs a data scientist\n<</SYS>>\nWhat happened to the dinosaurs?  [/INST]",
    "parameters": {
        "max_new_tokens": 300,
        "temperature": 0.1,
        "do_sample": "False",
        "stop": [
            "</s>"
        ]
    }
}
## inference esponse: 
[32m{"generated_text": " The dinosaurs went extinct about65 million years ago. The most widely
accepted theory is that a massive asteroid impact caused a global cooling event that led to the
extinction of the dinosaurs. This event is known as the K-Pg extinction event.\n\nThe K-Pg
extinction event occurred about65 million years ago, during the Cretaceous-Paleogene period. It is
believed that a massive asteroid impact caused a global cooling event, which led to the extinction
of many species, including the dinosaurs. The impact is believed to have occurred in the Yucatan
Peninsula in Mexico, and it is estimated that the impact was about10 kilometers in di

# 4. 토큰 갯수 세기

- "NousResearch/Llama-2-7b-chat-hf" 모델 훈련에 사용한 Llama2 의 Tokenizer 를 로딩 합니다.
- 자세한 정보는 [여기]((https://huggingface.co/docs/transformers/v4.31.0/model_doc/llama2#transformers.LlamaTokenizer)) 츨 참조 하세요. 

In [8]:
from transformers import (
    AutoTokenizer
)
# Load LLaMA tokenizer
model_name = "NousResearch/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

def count_tokens(text, tokenizer):
    # 텍스트를 토크나이즈하고 토큰 수를 반환
    tokens = tokenizer.encode(text)
    tokens_text = tokenizer.convert_ids_to_tokens(tokens)
    # print(tokens_text)
    return len(tokens), tokens_text



text = "Hello, how are you doing today?"
token_count, tokens_text = count_tokens(text=text, tokenizer = tokenizer)
print(f"Number of tokens: {token_count}")
print(f"Tokens: \n {tokens_text}")

Number of tokens: 9
Tokens: 
 ['<s>', '▁Hello', ',', '▁how', '▁are', '▁you', '▁doing', '▁today', '?']


## 파인 튜닝 모델의 입력, 출력 토큰 수 세기

In [9]:
import time
s = time.perf_counter()

response = invoke_endpoint_sagemaker(endpoint_name = 'lmi-model-2024-04-13-14-53-53-788', 
                         pay_load = pay_load)    

elapsed_async = time.perf_counter() - s
from termcolor import colored

print(f"elapsed time: {round(elapsed_async,3)} second")
print("## payload: ") 
pretty_print_json(pay_load)
print("## inference esponse: ")                      
print_ww(colored(response, "green"))                         

elapsed time: 2.924 second
## payload: 
{
    "inputs": "<s>[INST] <<SYS>>\nAs a data scientist\n<</SYS>>\nWhat happened to the dinosaurs?  [/INST]",
    "parameters": {
        "max_new_tokens": 300,
        "temperature": 0.1,
        "do_sample": "False",
        "stop": [
            "</s>"
        ]
    }
}
## inference esponse: 
[32m{"generated_text": " The dinosaurs went extinct about65 million years ago. The most widely
accepted theory is that a massive asteroid impact caused a global cooling of the Earth's climate,
which led to the extinction of the dinosaurs. This event is known as the K-Pg extinction
event.\n\nOther theories include volcanic eruptions, climate change, and disease. However, the
asteroid impact theory is currently the most widely accepted explanation for the extinction of the
dinosaurs.\n\nIt is worth noting that the dinosaurs were a diverse group of animals, and not all
species went extinct at the same time. Some species of birds, which are believed to be th

## JSON 으로 메트릭 정리

In [10]:
import json

def set_metrics(pay_load,response, elapsed_async, tokenizer):
    prompt = pay_load["inputs"]
    prompt_token_count, prompt_tokens_text = count_tokens(text=prompt, tokenizer = tokenizer)
    # print(f"Number of tokens: {token_count}")
    # print(f"Tokens: \n {tokens_text}")

    completion = json.loads(response)["generated_text"]
    completion_token_count, completion_tokens_text = count_tokens(text=completion, tokenizer = tokenizer)
    latency = round(elapsed_async,3)
    completion_tokens_per_sec = round(completion_token_count/latency,3)
    # print(f"Number of tokens: {token_count}")
    # print(f"Tokens: \n {tokens_text}")

    return dict(prompt_token_count = prompt_token_count,
                completion_token_count = completion_token_count,
                latency = round(elapsed_async,3),
                completion_tokens_per_sec = completion_tokens_per_sec,
                )

metrics = set_metrics(pay_load,response, elapsed_async, tokenizer)
pretty_print_json(metrics)


{
    "prompt_token_count": 35,
    "completion_token_count": 241,
    "latency": 2.924,
    "completion_tokens_per_sec": 82.421
}


# 5. 간단한 벤치 마크

In [11]:
from benchmark_utils.benchmark import Benchmark

BM = Benchmark(endpoint_name)
BM.run_benchmark(
    num_inferences = 1,
    num_threads = 1,
    pay_load = pay_load,
    tokenizer = tokenizer,
    verbose = False,
)

## total execution time: 3.625 second
total_completion_token_count:  299
Throughput was 82.477 tokens per second.
Latency p50 was 3.624 sec
Latency p95 was 3.624 sec
Latency p99 was 3.624 sec


In [14]:
BM.run_benchmark(
    num_inferences = 12,
    num_threads = 2,
    pay_load = pay_load,
    tokenizer = tokenizer,
    verbose = False,    
)

## total execution time: 21.449 second
total_completion_token_count:  8814
Throughput was 410.926 tokens per second.
Latency p50 was 3.558 sec
Latency p95 was 3.823 sec
Latency p99 was 5.916 sec


In [15]:
BM.run_benchmark(
    num_inferences = 12,
    num_threads = 3,
    pay_load = pay_load,
    tokenizer = tokenizer,
    verbose = False,    
)

## total execution time: 16.217 second
total_completion_token_count:  12834
Throughput was 791.401 tokens per second.
Latency p50 was 3.55 sec
Latency p95 was 6.256 sec
Latency p99 was 6.584 sec


In [16]:
BM.run_benchmark(
    num_inferences = 12,
    num_threads = 4,
    pay_load = pay_load,
    tokenizer = tokenizer,
    verbose = False,    
)

## total execution time: 11.875 second
total_completion_token_count:  16593
Throughput was 1397.359 tokens per second.
Latency p50 was 3.558 sec
Latency p95 was 5.699 sec
Latency p99 was 6.55 sec


In [19]:
BM.run_benchmark(
    num_inferences = 64,
    num_threads = 2,
    pay_load = pay_load,
    tokenizer = tokenizer,
    verbose = False,    
)

## total execution time: 109.746 second
total_completion_token_count:  67597
Throughput was 615.939 tokens per second.
Latency p50 was 3.532 sec
Latency p95 was 6.098 sec
Latency p99 was 6.568 sec


In [18]:
BM.run_benchmark(
    num_inferences = 64,
    num_threads = 4,
    pay_load = pay_load,
    tokenizer = tokenizer,
    verbose = False,    
)

## total execution time: 65.651 second
total_completion_token_count:  48957
Throughput was 745.717 tokens per second.
Latency p50 was 3.567 sec
Latency p95 was 6.182 sec
Latency p99 was 6.644 sec
