# 1. SageMaker Training with Experiments and Processing For AutoGluon

## 학습 작업의 실행 노트북 개요

- SageMaker Training에 SageMaker 실험을 추가하여 여러 실험의 결과를 비교할 수 있습니다.
    - [작업 실행 시 필요 라이브러리 import](#작업-실행-시-필요-라이브러리-import)
    - [SageMaker 세션과 Role, 사용 버킷 정의](#SageMaker-세션과-Role,-사용-버킷-정의)
    - [하이퍼파라미터 정의](#하이퍼파라미터-정의)
    - [학습 실행 작업 정의](#학습-실행-작업-정의)
        - 학습 코드 명
        - 학습 코드 폴더 명
        - 학습 코드가 사용한 Framework 종류, 버전 등
        - 학습 인스턴스 타입과 개수
        - SageMaker 세션
        - 학습 작업 하이퍼파라미터 정의
        - 학습 작업 산출물 관련 S3 버킷 설정 등
    - [학습 데이터셋 지정](#학습-데이터셋-지정)
        - 학습에 사용하는 데이터셋의 S3 URI 지정
    - [SageMaker 실험 설정](#SageMaker-실험-설정)
    - [학습 실행](#학습-실행)
    - [데이터 세트 설명](#데이터-세트-설명)
    - [실험 결과 보기](#실험-결과-보기)
    - [Evaluation 하기](#Evaluation-하기)

### 작업 실행 시 필요 라이브러리 import

In [2]:
!pip install -U sagemaker-experiments

You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [18]:
import os
import json
import pandas as pd
import boto3
import sagemaker

In [19]:
from ag_model import (
    AutoGluonTraining,
    AutoGluonInferenceModel,
    AutoGluonTabularPredictor,
    AutoGluonFramework
)

### SageMaker 세션과 Role, 사용 버킷 정의

In [20]:
sagemaker_session = sagemaker.session.Session()
region = sagemaker_session._region_name
role = sagemaker.get_execution_role()

In [21]:
bucket = sagemaker_session.default_bucket()
code_location = f's3://{bucket}/autogluon/code'
output_path = f's3://{bucket}/autogluon/output'

### 하이퍼파라미터 정의

In [22]:
hyperparameters = {
       "config_name" : "config-med.yaml"
}

### 학습 데이터셋 지정

In [23]:
data_path=f's3://{bucket}/autogluon/dataset'
config_path = f's3://{bucket}/autogluon/config'
!aws s3 sync ../data/dataset/ $data_path
!aws s3 sync ./config/ $config_path

data_path

upload: ../data/dataset/test.csv to s3://sagemaker-us-east-1-238312515155/autogluon/dataset/test.csv
upload: ../data/dataset/test_no_header.csv to s3://sagemaker-us-east-1-238312515155/autogluon/dataset/test_no_header.csv
upload: ../data/dataset/.ipynb_checkpoints/train-checkpoint.csv to s3://sagemaker-us-east-1-238312515155/autogluon/dataset/.ipynb_checkpoints/train-checkpoint.csv
upload: ../data/dataset/train.csv to s3://sagemaker-us-east-1-238312515155/autogluon/dataset/train.csv
upload: config/config/.ipynb_checkpoints/config-full-checkpoint.yaml to s3://sagemaker-us-east-1-238312515155/autogluon/config/config/.ipynb_checkpoints/config-full-checkpoint.yaml
upload: config/config/.ipynb_checkpoints/config-med-checkpoint.yaml to s3://sagemaker-us-east-1-238312515155/autogluon/config/config/.ipynb_checkpoints/config-med-checkpoint.yaml
upload: config/config/config-med.yaml to s3://sagemaker-us-east-1-238312515155/autogluon/config/config/config-med.yaml
upload: config/config/config-full

's3://sagemaker-us-east-1-238312515155/autogluon/dataset'

### 학습 실행 작업 정의

In [24]:
instance_count = 1
instance_type = "ml.m5.large"
# instance_type = 'local'
max_run = 1*60*60

use_spot_instances = False
if use_spot_instances:
    max_wait = 1*60*60
else:
    max_wait = None

In [25]:
if instance_type == 'local':
    from sagemaker.local import LocalSession
    
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
    local_data_path = "file://" + os.getcwd().replace('/lab_1_training', '') + "/data/dataset"
    
    data_channels = {
        "inputdata": local_data_path, 
        "config" : "file://" + os.getcwd() + '/config'
    }
    
else:
    sess = boto3.Session()
    sagemaker_session = sagemaker.Session()
    sm = sess.client('sagemaker')
    
    data_channels = {
        "inputdata": data_path, 
        "config" : config_path
    }

In [26]:
ag_estimator = AutoGluonTraining(
    entry_point="autogluon_starter_script.py",
    source_dir=os.getcwd() + "/src",
    role=role,
    # region=region,
    sagemaker_session=sagemaker_session,
    output_path=output_path,
    code_location=code_location,
    hyperparameters=hyperparameters,
    instance_count=instance_count,
    instance_type=instance_type,
    framework_version="0.4",
    py_version="py38",
    max_run=max_run,
    use_spot_instances=use_spot_instances,  # spot instance 활용
    max_wait=max_wait,
)

### SageMaker 실험 설정

In [27]:
experiment_name='autogluon-poc-1'

In [28]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from time import strftime

In [29]:
def create_experiment(experiment_name):
    try:
        sm_experiment = Experiment.load(experiment_name)
    except:
        sm_experiment = Experiment.create(experiment_name=experiment_name)

In [30]:
def create_trial(experiment_name):
    create_date = strftime("%m%d-%H%M%s")

    sm_trial = Trial.create(trial_name=f'{experiment_name}-{create_date}',
                            experiment_name=experiment_name)

    job_name = f'{sm_trial.trial_name}'
    return job_name

### 학습 실행

In [31]:
data_channels

{'inputdata': 's3://sagemaker-us-east-1-238312515155/autogluon/dataset',
 'config': 's3://sagemaker-us-east-1-238312515155/autogluon/config'}

In [32]:
create_experiment(experiment_name)
job_name = create_trial(experiment_name)

ag_estimator.fit(inputs = data_channels,
                  job_name = job_name,
                  experiment_config={
                      'TrialName': job_name,
                      'TrialComponentDisplayName': job_name,
                  },
                  wait=False)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: autogluon-poc-1-0722-14421658500967


In [33]:
ag_estimator.logs()

2022-07-22 14:42:48 Starting - Starting the training job...
2022-07-22 14:43:12 Starting - Preparing the instances for trainingProfilerReport-1658500968: InProgress
.........
2022-07-22 14:44:32 Downloading - Downloading input data...
2022-07-22 14:45:13 Training - Downloading the training image.........
2022-07-22 14:46:40 Training - Training image download completed. Training in progress..[34m2022-07-22 14:46:42,551 sagemaker-training-toolkit INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2022-07-22 14:46:42,554 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-07-22 14:46:42,568 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"config_name":"config-med.yaml"}', 'SM_USER_ENTRY_POINT': 'autogluon_starter_script.py', 'SM_FRAMEWORK_PARAMS': '{}', 'SM_RESOURCE_CONFIG': '{"current_group_name":"homogeneousCluster","curre

###  실험 결과 보기
위의 실험한 결과를 확인 합니다.
- 각각의 훈련잡의 시도에 대한 훈련 사용 데이터, 모델 입력 하이퍼 파라미터, 모델 평가 지표, 모델 아티펙트 결과 위치 등의 확인이 가능합니다.
- **아래의 모든 내용은 SageMaker Studio 를 통해서 직관적으로 확인이 가능합니다.**

In [34]:
!rm -rf ./autogluon/
!mkdir -p ./autogluon/result
!aws s3 cp {ag_estimator.model_data} ./autogluon/

download: s3://sagemaker-us-east-1-238312515155/autogluon/output/autogluon-poc-1-0722-14421658500967/output/model.tar.gz to autogluon/model.tar.gz


In [35]:
!ls -alF ./autogluon/model.tar.gz

-rw-r--r-- 1 root root 11118023 Jul 22 14:49 ./autogluon/model.tar.gz


In [36]:
!tar -xzf ./autogluon/model.tar.gz -C ./autogluon/result/

### Endpoint Deployment

In [37]:
instance_type = "ml.m5.2xlarge"
# instance_type = 'local'

In [38]:
if instance_type == 'local':
    from sagemaker.local import LocalSession
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
else:
    sess = boto3.Session()
    sagemaker_session = sagemaker.Session()

In [39]:
model = AutoGluonInferenceModel(
    source_dir=os.getcwd() + "/src",
    entry_point="autogluon_serve.py",
    model_data=ag_estimator.model_data,
    instance_type=instance_type,
    role=role,
    sagemaker_session=sagemaker_session,
    # region=region,
    framework_version="0.4",
    py_version="py38",
    predictor_cls=AutoGluonTabularPredictor
)

In [40]:
from sagemaker.serializers import CSVSerializer

predictor = model.deploy(
    initial_instance_count=1, serializer=CSVSerializer(), instance_type=instance_type
)

INFO:sagemaker:Creating model with name: autogluon-inference-2022-07-22-14-50-04-093
INFO:sagemaker:Creating endpoint-config with name autogluon-inference-2022-07-22-14-50-04-721
INFO:sagemaker:Creating endpoint with name autogluon-inference-2022-07-22-14-50-04-721


------!

### Predict on unlabeled test data

Remove target variable (`fraud`) from the data and get predictions for a sample of 100 rows using the deployed endpoint.

In [41]:
df = pd.read_csv("../data/dataset/test.csv")
data = df.drop(columns="fraud")[:100].values

In [42]:
preds = predictor.predict(data)
pred_df = pd.DataFrame(json.loads(preds))

In [43]:
pred_df['fraud'].reset_index(drop=True, inplace=True)
df["fraud"][:len(pred_df)].reset_index(drop=True, inplace=True)

In [44]:
p = pd.DataFrame({"preds": pred_df['fraud'], "actual": df["fraud"][: len(pred_df)]})
p.head()

Unnamed: 0,preds,actual
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0


In [45]:
print(f"{(p.preds==p.actual).astype(int).sum()}/{len(p)} are correct")

97/100 are correct


### Cleanup Endpoint

In [46]:
!predictor.delete_endpoint()

/bin/bash: -c: line 1: syntax error: unexpected end of file


# Batch Transform

학습된 모델을 호스트된 엔드포인트에 배포하는 것은 출시 이후 SageMaker에서 사용할 수 있으며 웹 사이트나 모바일 앱과 같은 서비스에 실시간 예측을 제공하는 좋은 방법입니다. 그러나 지연 시간을 최소화하는 것이 문제가 되지 않는 대규모 데이터 세트에서 학습된 모델에서 예측을 생성하는 것이 목표라면 배치 변환 기능이 더 쉽고, 더 확장 가능하며, 더 적절할 수 있다.

[Read more about Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html).

In [47]:
instance_type = "ml.m5.2xlarge"

In [48]:
# model = AutoGluonInferenceModel(
#     source_dir=os.getcwd() + "/src",
#     entry_point="autogluon_serve.py",
#     model_data=ag_estimator.model_data,
#     instance_type=instance_type,
#     role=role,
#     sagemaker_session=sagemaker_session,
#     region=region,
#     framework_version="0.4",
#     py_version="py38",    
#     predictor_cls=AutoGluonTabularPredictor,
# )

In [49]:
transformer = model.transformer(
    instance_count=1,
    instance_type=instance_type,
    strategy="MultiRecord",
    max_payload=6,
    max_concurrent_transforms=1,
    output_path=output_path,
    accept="application/json",
    assemble_with="Line",
)


INFO:sagemaker:Creating model with name: autogluon-inference-2022-07-22-14-53-30-823


Prepare data for batch transform

In [50]:
pd.read_csv(f"../data/dataset/test.csv")[:100].to_csv("../data/dataset/test_no_header.csv", header=False, index=False)

In [51]:
test_input = transformer.sagemaker_session.upload_data(
    path=os.path.join("../data/dataset", "test_no_header.csv"), key_prefix=f"{bucket}/autogluon/dataset"
)
test_input

's3://sagemaker-us-east-1-238312515155/sagemaker-us-east-1-238312515155/autogluon/dataset/test_no_header.csv'

In [52]:
transformer.transform(
    test_input,
    input_filter="$[1:]",  # filter-out target variable
    split_type="Line",
    content_type="text/csv",
    output_filter="$['fraud']",  # keep only prediction class in the output
)

transformer.wait()

INFO:sagemaker:Creating transform job with name: autogluon-inference-2022-07-22-14-53-31-885


[34m2022-07-22T14:58:44,122 [INFO ] main com.amazonaws.ml.mms.ModelServer - [0m
[34mMMS Home: /usr/local/lib/python3.8/dist-packages[0m
[34mCurrent directory: /[0m
[34mTemp directory: /home/model-server/tmp[0m
[34mNumber of GPUs: 0[0m
[34mNumber of CPUs: 8[0m
[34mMax heap size: 7045 M[0m
[34mPython executable: /usr/bin/python3[0m
[34mConfig file: /etc/sagemaker-mms.properties[0m
[34mInference address: http://0.0.0.0:8080[0m
[34mManagement address: http://0.0.0.0:8080[0m
[34mModel Store: /.sagemaker/mms/models[0m
[34mInitial Models: ALL[0m
[34mLog dir: null[0m
[34mMetrics dir: null[0m
[34mNetty threads: 0[0m
[34mNetty client threads: 0[0m
[34mDefault workers per model: 8[0m
[34mBlacklist Regex: N/A[0m
[34mMaximum Response Size: 6553500[0m
[34mMaximum Request Size: 6553500[0m
[34mPreload model: false[0m
[34mPrefer direct buffer: false[0m
[34m2022-07-22T14:58:44,177 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStrea

batch transform 결과를 다운로드 받습니다.

In [53]:
!rm -rf ./autogluon_batch_result
!mkdir ./autogluon_batch_result

In [54]:
transformer.output_path

's3://sagemaker-us-east-1-238312515155/autogluon/output'

In [55]:
!aws s3 cp {transformer.output_path}/test_no_header.csv.out ./autogluon_batch_result/

download: s3://sagemaker-us-east-1-238312515155/autogluon/output/test_no_header.csv.out to autogluon_batch_result/test_no_header.csv.out


In [56]:
p = pd.concat(
    [
        pd.read_json("./autogluon_batch_result/test_no_header.csv.out", orient="index")
        .sort_index()
        .rename(columns={0: "preds"}),
        pd.read_csv("../data/dataset/test.csv")[["fraud"]].iloc[:100].rename(columns={"fraud": "actual"}),
    ],
    axis=1,
)
p.head()

Unnamed: 0,preds,actual
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0


In [57]:
print(f"{(p.preds==p.actual).astype(int).sum()}/{len(p)} are correct")

97/100 are correct


###  Processing Evaluation 하기
SageMaker Processing을 이용하여 Evalution을 수행하는 코드를 동작할 수 있습니다. MLOps에서 Processing을 적용하면 전처리, Evaluation 등을 serverless로 동작할 수 있습니다.

In [58]:
from sagemaker.processing import FrameworkProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.estimator import Framework

In [59]:
instance_count = 1
instance_type = "ml.m5.large"
# instance_type = 'local'

In [60]:
from sagemaker import image_uris

image_uri = image_uris.retrieve(
    "autogluon",
    region=region,
    version="0.4",
    py_version="py38",
    image_scope="training",
    instance_type=instance_type,
)
image_uri


'763104351884.dkr.ecr.us-east-1.amazonaws.com/autogluon-training:0.4-cpu-py38'

In [61]:
script_eval = FrameworkProcessor(
    AutoGluonFramework,
    framework_version="0.4",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    instance_type=instance_type,
    instance_count=instance_count
)

In [62]:
detect_outputpath = f's3://{bucket}/autogluon/processing'

In [63]:
source_dir='src'

if instance_type == 'local':
    from sagemaker.local import LocalSession
    from pathlib import Path
    
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
    source_dir = f'{Path.cwd()}/src'
    s3_test_path=f'../data/dataset/test.csv'
else:
    sagemaker_session = sagemaker.Session()
    s3_test_path = data_path + '/test.csv'

In [64]:
create_experiment(experiment_name)
job_name = create_trial(experiment_name)

script_eval.run(
    code="autogluon_evaluation.py",
    source_dir=source_dir,
    inputs=[ProcessingInput(source=s3_test_path, input_name="test_data", destination="/opt/ml/processing/test"),
            ProcessingInput(source=ag_estimator.model_data, input_name="model_weight", destination="/opt/ml/processing/model")
    ],
    outputs=[
        ProcessingOutput(source="/opt/ml/processing/output", output_name='evaluation', destination=detect_outputpath + "/" + job_name),
    ],
    job_name=job_name,
    experiment_config={
        'TrialName': job_name,
        'TrialComponentDisplayName': job_name,
    },
    wait=False
)

INFO:sagemaker.processing:Uploaded src to s3://sagemaker-us-east-1-238312515155/autogluon-poc-1-0722-14591658501959/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-us-east-1-238312515155/autogluon-poc-1-0722-14591658501959/source/runproc.sh
INFO:sagemaker:Creating processing-job with name autogluon-poc-1-0722-14591658501959



Job Name:  autogluon-poc-1-0722-14591658501959
Inputs:  [{'InputName': 'test_data', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-238312515155/autogluon/dataset/test.csv', 'LocalPath': '/opt/ml/processing/test', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'model_weight', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-238312515155/autogluon/output/autogluon-poc-1-0722-14421658500967/output/model.tar.gz', 'LocalPath': '/opt/ml/processing/model', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-238312515155/autogluon-poc-1-0722-14591658501959/source/sourcedir.tar.gz', 'LocalPath': '/opt/ml/processing/input/code/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': '

In [65]:
script_eval.latest_job.wait()

...............................[34m485/500 are correct[0m

