# SageMaker Inference: BiEncoder RoBerta from local

[KLUE RoBERTa](https://huggingface.co/klue/roberta-base) 모델을 SageMaker Endpoint로 배포하고 추론합니다.

---

## [선수 작업] AWS Role 정보를 .env 파일에 아래와 같이 저장
```
SAGEMAKER_ROLE_ARN=arn:aws:iam::XXXXXX:role/gonsoomoon-sm-inference
```

## 0 환경 확인

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('..')


### .env 에서 role 정보 불러오기

In [2]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('..')


from dotenv import load_dotenv
import os

load_dotenv('../.env')
SAGEMAKER_ROLE_ARN = os.getenv('SAGEMAKER_ROLE_ARN')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. 로컬 추론 함수 테스트

먼저 로컬에서 inference.py가 정상 동작하는지 확인합니다.

In [3]:
!python ../src/test_inference.py

Testing SageMaker BiEncoder Inference Functions

1. Testing model_fn...
Loading BiEncoder model from .
Using device: cuda
Downloading model from HuggingFace: klue/roberta-base
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
BiEncoder model loaded successfully
   ✓ BiEncoder model loaded

2. Testing input_fn...
Received content_type: application/json
   ✓ Processed 3 queries and 3 documents

3. Testing predict_fn...
Starting BiEncoder prediction
Processing 3 query(s) and 3 document(s)

## 2. 환경 설정

In [4]:
import json
import time
import boto3
import sagemaker
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

sagemaker_session = sagemaker.Session()
role = SAGEMAKER_ROLE_ARN
bucket = sagemaker_session.default_bucket()

print(f"Bucket: {bucket}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
Bucket: sagemaker-us-east-1-057716757052


## 3. 모델 아티팩트 생성

model.tar.gz 구조로 생성을 하면 , SageMaker 가 이를 인지 합니다.
model.tar.gz 구조:
```
model.tar.gz/
├── config.json
├── model.safetensors
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── vocab.txt
└── code/
    ├── inference.py
    └── requirements.txt
```

참조: [SageMaker PyTorch Documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#deploy-pytorch-models)

### model.tar.gz 파일 생성

In [5]:
!rm -rf ../model_artifact
!mkdir -p ../model_artifact/code
!cp ../src/inference.py ../model_artifact/code/
!cp ../src/requirements.txt ../model_artifact/code/
!cp ../model/* ../model_artifact/

!cd ../model_artifact && tar -czf ../model.tar.gz *



## 4. Local SageMaker Endpoint 생성

In [6]:
import os

model_data_dir = ".."
local_model_path = os.path.join(model_data_dir, 'model.tar.gz')
print("local_model_path: ", local_model_path)

local_model_path:  ../model.tar.gz


### GPU 리소스 확인

In [7]:
import os
import subprocess

try:
    if subprocess.call("nvidia-smi") == 0:
        ## Set type to GPU if one is present
        instance_type = "local_gpu"
    else:
        instance_type = "local"        
except:
    pass

print("Instance type = " + instance_type)

Sat Sep 27 13:49:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      On  |   00000000:36:00.0 Off |                    0 |
| N/A   38C    P8             16W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Instance type = local_gpu


### Local SageMaker Endpoint 생성

In [8]:
endpoint_name = "local-endpoint-dual-encoder-{}".format(int(time.time()))

local_pytorch_model = PyTorchModel(model_data=local_model_path,
                                   role=role,
                                   entry_point='inference.py',
                                   source_dir = '../src',
                                   framework_version='2.5',
                                   py_version='py311',
                                   model_server_workers=1,
                                  )

local_predictor = local_pytorch_model.deploy(
                           instance_type=instance_type, 
                           initial_instance_count=1, 
                           endpoint_name=endpoint_name,
                           wait=True,
                           log = False,
                        )

RootlessDocker not detected, falling back to remote host IP or localhost.


Attaching to xrwfv8abms-algo-1-p6d1o
xrwfv8abms-algo-1-p6d1o  | CUDA compat package should be installed for NVIDIA driver smaller than 550.163.01
xrwfv8abms-algo-1-p6d1o  | Current installed NVIDIA driver version is 570.133.20
xrwfv8abms-algo-1-p6d1o  | Skipping CUDA compat setup as newer NVIDIA driver is installed
xrwfv8abms-algo-1-p6d1o  |   import pkg_resources
xrwfv8abms-algo-1-p6d1o  | Collecting transformers>=4.30.0 (from -r /opt/ml/model/code/requirements.txt (line 1))
xrwfv8abms-algo-1-p6d1o  |   Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
xrwfv8abms-algo-1-p6d1o  | Collecting huggingface-hub<1.0,>=0.34.0 (from transformers>=4.30.0->-r /opt/ml/model/code/requirements.txt (line 1))
xrwfv8abms-algo-1-p6d1o  |   Downloading huggingface_hub-0.35.1-py3-none-any.whl.metadata (14 kB)
xrwfv8abms-algo-1-p6d1o  | Collecting regex!=2019.12.17 (from transformers>=4.30.0->-r /opt/ml/model/code/requirements.txt (line 1))
xrwfv8abms-algo-1-p6d1o  |   Downloading regex-20

## 5. Endpoint 추론

In [9]:
local_predictor.serializer = JSONSerializer()
local_predictor.deserializer = JSONDeserializer()

# BiEncoder: 단일 쿼리-문서 쌍 테스트
result = local_predictor.predict({
    "queries": ["맛있는 한국 전통 음식 김치찌개"],
    "documents": ["김치찌개와 된장찌개는 한국의 대표 전통 음식입니다."]
})

print(f"Query embeddings shape: ({result['num_queries']}, {result['embedding_dim']})")
print(f"Document embeddings shape: ({result['num_documents']}, {result['embedding_dim']})")


import numpy as np

# 유사도 계산 (이미 정규화되어 있으므로 내적만 계산)
query_emb = np.array(result["query_embeddings"])[0]
doc_emb = np.array(result["doc_embeddings"])[0]

similarity = np.dot(query_emb, doc_emb)

print(f"\nCosine similarity: {similarity:.4f}")

RootlessDocker not detected, falling back to remote host IP or localhost.


xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,662 [INFO ] epollEventLoopGroup-3-2 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:model,model_version:default|#hostname:2240db8f9a7b,timestamp:1758981179
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,663 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1758981179663
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,665 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1758981179
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,665 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Received content_type: application/json
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,665 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Received content_type: application/json
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,665 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Starting BiEncoder prediction
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,665 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -

xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,888 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - BiEncoder prediction completed successfully
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,889 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - BiEncoder prediction completed successfully
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,889 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Formatting output with accept: application/json
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,889 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Formatting output with accept: application/json
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,889 [INFO ] W-9000-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:224.48|#ModelName:model,Level:Model|#type:GAUGE|#hostname:2240db8f9a7b,1758981179,6a311f04-2a5d-4617-b85a-1015d77444a8, pattern=[METRICS]
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,891 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.ms:224.48|#ModelN

### 3개의 샘플 추론 및 유사도 비교

In [10]:
from src.utils import test_biencoder_pairs

query_doc_pairs = [
    (
        "맛있는 한국 전통 음식 김치찌개",
        "김치찌개와 된장찌개는 한국의 대표 전통 음식입니다."
    ),
    (
        "최신 기술 발전",
        "인공지능 기술이 빠르게 발전하고 있습니다."
    ),
    (
        "색깔",
        "파리의 에펠탑은 프랑스의 상징입니다."
    )
]

# BiEncoder 쌍별 유사도 테스트 실행
test_biencoder_pairs(local_predictor, query_doc_pairs)

RootlessDocker not detected, falling back to remote host IP or localhost.


xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,972 [INFO ] epollEventLoopGroup-3-2 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:model,model_version:default|#hostname:2240db8f9a7b,timestamp:1758981179
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,972 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1758981179972
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,973 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1758981179
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,973 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Received content_type: application/json
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,974 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Received content_type: application/json
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,974 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Starting BiEncoder prediction
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:52:59,974 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -

### 8개 쿼리-문서 쌍 배치 추론

In [11]:
batch_result = local_predictor.predict({
    "queries": [
        "맛있는 한국 전통 음식 김치찌개",
        "최신 기술 발전", 
        "색깔",
        "여행 계획",
        "스포츠 경기",
        "영화 추천",
        "날씨 정보",
        "건강 관리"
    ],
    "documents": [
        "김치찌개와 된장찌개는 한국의 대표 전통 음식입니다.",
        "인공지능 기술이 빠르게 발전하고 있습니다.",
        "파리의 에펠탑은 프랑스의 상징입니다.",
        "제주도는 한국의 인기 여행지입니다.",
        "축구 경기가 오늘 저녁에 있습니다.",
        "최근 개봉한 영화가 좋은 평가를 받고 있습니다.",
        "내일은 맑은 날씨가 예상됩니다.",
        "규칙적인 운동이 건강에 좋습니다."
    ]
})

print(f"Batch inference completed:")
print(f"  Queries: {batch_result['num_queries']}")
print(f"  Documents: {batch_result['num_documents']}")
print(f"  Embedding dim: {batch_result['embedding_dim']}\n")

# 각 쌍의 코사인 유사도 계산
query_embs = np.array(batch_result['query_embeddings'])
doc_embs = np.array(batch_result['doc_embeddings'])

print("Pair-wise cosine similarities:")
for i in range(len(query_embs)):
    similarity = np.dot(query_embs[i], doc_embs[i])
    print(f"  Pair {i+1}: {similarity:.4f}")

RootlessDocker not detected, falling back to remote host IP or localhost.


xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:53:00,078 [INFO ] epollEventLoopGroup-3-2 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:model,model_version:default|#hostname:2240db8f9a7b,timestamp:1758981180
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:53:00,078 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1758981180078
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:53:00,079 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1758981180
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:53:00,080 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Received content_type: application/json
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:53:00,080 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Received content_type: application/json
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:53:00,080 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Starting BiEncoder prediction
xrwfv8abms-algo-1-p6d1o  | 2025-09-27T13:53:00,080 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -

## 6. 로컬 엔드 포인트 제거


In [12]:
local_predictor.delete_endpoint(delete_endpoint_config=True)
print("✅ Endpoint 삭제 완료")

Gracefully Stopping... press Ctrl+C again to force


✅ Endpoint 삭제 완료
