## 실시간 엔드포인트를 Transformers NeuronX 백엔드 엔진과 torchserve를 이용해 호스팅
이번 노트북에서는 2-1에서 컴파일된 NEFF 형태의 모델 웨이트를 사용하여 [TorchServe](https://pytorch.org/serve/)와 Neuron을 EC2 Inf2 및 Trn1 인스턴스에서 사용하는 방법을 보여줍니다. 

이 노트북을 통해 EC2 Inf2/Trn1 인스턴스에서 지원하는 모델을 TorchServe로 서빙하는 방법을 확인할 수 있습니다. 조금 전에 컴파일한 llama3-8b 모델을 사용하여 추론을 해보겠습니다.

## TorchServe 코드 확인

Jupyter Lab에서 `torchserve_inf.py` 파일을 열고, TorchServe를 이용해 추론할 수 있는 코드를 확인합니다.

In [None]:
!cat torchserve_inf2.py

`torchserve_config.yaml` 파일을 열고, 설정을 확인합니다.

In [None]:
!cat torchserve_config.yaml

## TorchServe로 모델 스트리밍 형태로 서빙하기

서버를 시작합니다. 일반적으로는 별도의 콘솔에서 이를 실행하는 것이 좋지만, 이번 데모에서는 출력 결과를 파일로 리디렉션할 해 보겠습니다.

필요한 종속성을 설치합니다.

In [23]:
!pip install torch-model-archiver torchserve setuptools==69.5.1

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


TorchServe를 실행하기 위해 Java를 설치합니다.

In [10]:
!sudo apt-get update && sudo apt-get install java-common
!wget https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.deb
!sudo dpkg --install amazon-corretto-17-x64-linux-jdk.deb

Hit:1 http://ap-northeast-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://ap-northeast-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:3 http://ap-northeast-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:4 https://download.docker.com/linux/ubuntu jammy InRelease                 
Get:5 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  InRelease [1484 B]
Get:6 https://apt.corretto.aws stable InRelease [10.7 kB]                      
Hit:7 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  InRelease
Hit:8 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  InRelease      
Get:9 https://apt.corretto.aws stable/main amd64 Packages [17.1 kB]            
Hit:10 https://apt.repos.neuron.amazonaws.com jammy InRelease                  
Get:11 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Fetched 286 kB in 1s (229 kB/s)     
Reading package lists... Done
W: https://download.d

컴파일한 모델을 아카이빙 한 후 model_store 디렉토리로 복사하고, 토크나이저와 config도 복사합니다. 

In [33]:
%%sh

rm -rf model_store
mkdir model_store
torch-model-archiver --model-name meta-llama-3-8b-neuronx --version 1.0 --handler torchserve_inf2.py -r requirements.txt --config-file torchserve_config.yaml --extra-files "meta-llama/Meta-Llama-3-8B/config.json,meta-llama/Meta-Llama-3-8B/generation_config.json,meta-llama/Meta-Llama-3-8B/model-00001-of-00004.safetensors,meta-llama/Meta-Llama-3-8B/model-00002-of-00004.safetensors,meta-llama/Meta-Llama-3-8B/model-00003-of-00004.safetensors,meta-llama/Meta-Llama-3-8B/model-00004-of-00004.safetensors,meta-llama/Meta-Llama-3-8B/model.safetensors.index.json,meta-llama/Meta-Llama-3-8B/special_tokens_map.json,meta-llama/Meta-Llama-3-8B/tokenizer.json,meta-llama/Meta-Llama-3-8B/tokenizer_config.json,torchserve_inf2.py" --archive-format no-archive
mv meta-llama-3-8b-neuronx model_store/
cp -r neuron_artifacts model_store/meta-llama-3-8b-neuronx/
mv model_store/meta-llama-3-8b-neuronx/neuron_artifacts model_store/meta-llama-3-8b-neuronx/neuron_cache 

## TorchServe 모델 서빙
다음으로, 앞서 정의한 모델 설정을 사용하여 컨테이너 엔드포인트를 생성합니다.

In [21]:
!export TS_INSTALL_PY_DEP_PER_MODEL="true"
!torchserve --ncs --start --model-store model_store --models meta-llama-3-8b-neuronx

## 추론 테스트
TorchServe 엔드포인트가 생성된 후, 엔드포인트에 대해 실시간 스트리밍 예측을 수행할 수 있습니다.
- 추론 요청을 제출하고 응답을 받기 위해 아래 Python 코드를 사용합니다.

모델 서버에 추론 요청을 제출하고 추론 결과를 받아봅시다.

In [32]:
# Run single inference request
!python utils/llm_streaming.py -m meta-llama-3-8b-neuronx -o 50 -t 2 -n 4 --prompt-text "Today the weather is really nice and I am planning on "

Tasks are completed
payload={'prompt': 'Today the weather is really nice and I am planning on ', 'max_new_tokens': 50}
, output=Today the weather is really nice and I am planning on 2 hours of walking. I am going to walk to the park and then to the beach. I am going to walk to the park and then to the beach. I am going to walk to the park and then to the beach. I am going

payload={'prompt': 'Today the weather is really nice and I am planning on ', 'max_new_tokens': 50}
, output=Today the weather is really nice and I am planning on 2 hours of walking. I am going to walk to the park and then to the beach. I am going to walk to the park and then to the beach. I am going to walk to the park and then to the beach. I am going

payload={'prompt': 'Today the weather is really nice and I am planning on ', 'max_new_tokens': 50}
, output=Today the weather is really nice and I am planning on 2 hours of walking. I am going to walk to the park and then to the beach. I am going to walk to the park a

In [38]:
# Run single inference request (Stream)
!python utils/llm_streaming.py -m meta-llama-3-8b-neuronx --demo-streaming --prompt-text "Today the weather is really nice and I am planning on "

payload={'prompt': 'Today the weather is really nice and I am planning on ', 'max_new_tokens': 64}
, output=
^C
Traceback (most recent call last):
  File "/home/ubuntu/inferentia2-llm/utils/llm_streaming.py", line 174, in <module>
    main()
  File "/home/ubuntu/inferentia2-llm/utils/llm_streaming.py", line 165, in main
    predictor.join()
  File "/usr/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt


In [None]:
!curl -X POST "http://localhost:8080/predictions/meta-llama-3-8b-neuronx" -H "Content-Type: application/json" -d '{"inputs": "Today the weather is really nice and I am planning on", "parameters": {"max_new_tokens": 50, "prompt_randomize": false}}'