# <B> # SageMaker Training with Experiments and Processing </B>
* Container: codna_python3

## 학습 작업의 실행 노트북 개요

- SageMaker Training에 SageMaker 실험을 추가하여 여러 실험의 결과를 비교할 수 있습니다.
    - [작업 실행 시 필요 라이브러리 import](#작업-실행-시-필요-라이브러리-import)
    - [SageMaker 세션과 Role, 사용 버킷 정의](#SageMaker-세션과-Role,-사용-버킷-정의)
    - [하이퍼파라미터 정의](#하이퍼파라미터-정의)
    - [학습 실행 작업 정의](#학습-실행-작업-정의)
        - 학습 코드 명
        - 학습 코드 폴더 명
        - 학습 코드가 사용한 Framework 종류, 버전 등
        - 학습 인스턴스 타입과 개수
        - SageMaker 세션
        - 학습 작업 하이퍼파라미터 정의
        - 학습 작업 산출물 관련 S3 버킷 설정 등
    - [학습 데이터셋 지정](#학습-데이터셋-지정)
        - 학습에 사용하는 데이터셋의 S3 URI 지정
    - [SageMaker 실험 설정](#SageMaker-실험-설정)
    - [학습 실행](#학습-실행)
    - [데이터 세트 설명](#데이터-세트-설명)
    - [실험 결과 보기](#실험-결과-보기)

## AutoReload

In [1]:
%load_ext autoreload
%autoreload 2

## 0. Install packages

In [3]:
install_needed = False  # should only be True once
# install_needed = False

In [4]:
%%bash
#!/bin/bash

DAEMON_PATH="/etc/docker"
MEMORY_SIZE=10G

FLAG=$(cat $DAEMON_PATH/daemon.json | jq 'has("data-root")')
# echo $FLAG

if [ "$FLAG" == true ]; then
    echo "Already revised"
else
    echo "Add data-root and default-shm-size=$MEMORY_SIZE"
    sudo cp $DAEMON_PATH/daemon.json $DAEMON_PATH/daemon.json.bak
    sudo cat $DAEMON_PATH/daemon.json.bak | jq '. += {"data-root":"/home/ec2-user/SageMaker/.container/docker","default-shm-size":"'$MEMORY_SIZE'"}' | sudo tee $DAEMON_PATH/daemon.json > /dev/null
    sudo service docker restart
    echo "Docker Restart"
fi

Already revised


In [5]:
import sys
import IPython

if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install -U pip
    !{sys.executable} -m pip install -U smdebug sagemaker-experiments
    !{sys.executable} -m pip install -U sagemaker
    
    IPython.Application.instance().kernel.do_shutdown(True)

## 1. parameter store 설정

In [6]:
import boto3
from utils.ssm import parameter_store

In [7]:
strRegionName=boto3.Session().region_name
pm = parameter_store(strRegionName)
strPrefix = pm.get_params(key="PREFIX")

In [8]:
strBucketName = pm.get_params(key="-".join([strPrefix, "BUCKET"]))
strExecutionRole = pm.get_params(key="-".join([strPrefix, "SAGEMAKER-ROLE-ARN"]))

In [9]:
print (f'strBucketName: {strBucketName}')
print (f'strExecutionRole: {strExecutionRole}')

strBucketName: sagemaker-us-east-1-419974056037
strExecutionRole: arn:aws:iam::419974056037:role/service-role/AmazonSageMaker-ExecutionRole-20221206T163436


## 2. Dataset

In [10]:
import os

In [11]:
strS3DataPath = f's3://{strBucketName}/DJ-SM-PIPELINE-DATA'
strLocalDataPath = os.path.join(os.getcwd(), "dataset")

## 3.Training-job

In [14]:
import os
import sagemaker
from sagemaker.xgboost.estimator import XGBoost

* **Set Up SageMaker Experiment**
    - Create or load [SageMaker Experiment](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) for the example training job. This will create an experiment trial object in SageMaker.

In [15]:
from time import strftime
from smexperiments.trial import Trial
from smexperiments.experiment import Experiment

In [16]:
def create_experiment(experiment_name):
    try: sm_experiment = Experiment.load(experiment_name)
    except: sm_experiment = Experiment.create(experiment_name=experiment_name)

In [17]:
def create_trial(experiment_name):
    create_date = strftime("%m%d-%H%M%s")
    sm_trial = Trial.create(trial_name=f'{experiment_name}-{create_date}',
                            experiment_name=experiment_name)
    job_name = f'{sm_trial.trial_name}'
    return job_name

* params for training job

In [26]:
# Set to True to enable SageMaker to run locally
local_mode = True

if local_mode:
    
    from sagemaker.local import LocalSession
    
    strInstanceType = "local"
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
        
    dicDataChannels = {
        "TR": f'file://{os.path.join(strLocalDataPath, "abalone.csv")}',
        "TE": f'file://{os.path.join(strLocalDataPath, "abalone.csv")}',
    }
    
else:
    strInstanceType = "ml.m5.2xlarge"
    
    sagemaker_session = sagemaker.Session()
    dicDataChannels = {
        "TR": os.path.join(strS3DataPath, "abalone.csv"), 
        "TE": os.path.join(strS3DataPath, "abalone.csv"), 
    }

nInstanceCount = 1

bSpotTraining = False
if bSpotTraining:
    nMaxWait = 1*60*60
    nMaxRun = 1*60*60
    
else:
    nMaxWait = None
    nMaxRun = 1*60*60
    

bUseTrainWarmPool = False ## training image 다운받지 않음, 속도 빨라진다
if bUseTrainWarmPool: nKeepAliveSeconds = 3600 ## 최대 1시간 동안!!, service quota에서 warmpool을 위한 request 필요
else: nKeepAliveSeconds = None
if bSpotTraining:
    bUseTrainWarmPool = False # warmpool은 spot instance 사용시 활용 할 수 없음
    nKeepAliveSeconds = None
    


strOutputPath = os.path.join(
    "s3://{}".format(strBucketName),
    strPrefix,
    "training",
    "model-output"
)

strCodeLocation = os.path.join(
    "s3://{}".format(strBucketName),
    strPrefix,
    "training",
    "backup_codes"
)

strExperimentName = '-'.join([strPrefix, "experiments"])

## You can't override the metric definitions for Amazon SageMaker algorithms. 
# strNumeticRegEx = "([0-9\\.]+)(e-?[[01][0-9])?"
# listMetricDefinitions = [
#     {"Name": "train_loss", "Regex": f"loss={strNumeticRegEx}"},
#     {"Name": "wer", "Regex": f"wer:{strNumeticRegEx}"}
# ]

# dicGitConfig = {
#     'repo': f'https://{pm.get_params(key="-".join([prefix, "CODE_REPO"]))}',
#     'branch': 'main',
#     'username': pm.get_params(key="-".join([prefix, "CODECOMMIT-USERNAME"]), enc=True),
#     'password': pm.get_params(key="-".join([prefix, "CODECOMMIT-PWD"]), enc=True)
# }  

kwargs = {}

In [27]:
print (f'strInstanceType: {strInstanceType}')
print (f'nInstanceCount: {nInstanceCount}')
print (f'sagemaker_session: {sagemaker_session}')
print (f'bSpotTraining: {bSpotTraining}')
print (f'strExperimentName: {strExperimentName}')
print (f'dicDataChannels: {dicDataChannels}')
print (f'strOutputPath: {strOutputPath}')
print (f'strCodeLocation: {strCodeLocation}')
print (f'bUseTrainWarmPool: {bUseTrainWarmPool}/{nKeepAliveSeconds}')

strInstanceType: local
nInstanceCount: 1
sagemaker_session: <sagemaker.local.local_session.LocalSession object at 0x7fb230116170>
bSpotTraining: False
strExperimentName: DJ-SM-PIPELINE-experiments
dicDataChannels: {'TR': 'file:///home/ec2-user/SageMaker/mlops-step-alert/1.building-component/dataset/abalone.csv', 'TE': 'file:///home/ec2-user/SageMaker/mlops-step-alert/1.building-component/dataset/abalone.csv'}
strOutputPath: s3://sagemaker-us-east-1-419974056037/DJ-SM-PIPELINE/training/model-output
strCodeLocation: s3://sagemaker-us-east-1-419974056037/DJ-SM-PIPELINE/training/backup_codes
bUseTrainWarmPool: False/None


* Define training job

In [28]:
dicHyperparameters = {  
    "max_depth": "10",
    "eta": "0.3",
    "objective": "reg:squarederror",
    "num_round": "100",
}

In [29]:
estimator = XGBoost(
    entry_point="xgboost_regression.py",
    source_dir="sources/train/",
    output_path=strOutputPath,
    code_location=strCodeLocation,
    hyperparameters=dicHyperparameters, ## Contatiner내 env. variable로 들어 감
    role=strExecutionRole,
    sagemaker_session=sagemaker_session,
    instance_count=nInstanceCount,
    instance_type=strInstanceType,
    framework_version="1.3-1",
    image_uri = 419974056037.dkr.ecr.us-east-1.amazonaws.com/mlops-custom-docker:latest
    max_run=nMaxRun,
    use_spot_instances=bSpotTraining,
    max_wait=nMaxWait,
    keep_alive_period_in_seconds=nKeepAliveSeconds,
    enable_sagemaker_metrics=True,
    #metric_definitions=listMetricDefinitions,
    volume_size=256, ## GB
)

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: local.


* run

In [33]:
if strInstanceType =='local_gpu': estimator.checkpoint_s3_uri = None

create_experiment(strExperimentName)
job_name = create_trial(strExperimentName)

estimator.fit(
    inputs=dicDataChannels, 
    job_name=job_name,
    experiment_config={
      'TrialName': job_name,
      'TrialComponentDisplayName': job_name,
    },
    wait=True,
)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating training-job with name: DJ-SM-PIPELINE-experiments-0509-05531683611610
INFO:sagemaker.local.local_session:Starting training job
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-msvk6:
    command: train
    container_name: f9x2j3j42d-algo-1-msvk6
    environment:
    - '[Masked]'
    - '[Masked]'
    image: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.3-1
    networks:
      sagemaker-local:
        aliases:
        - algo-1-msvk6
    stdin_open: true
    tty: true
    vol

Creating f9x2j3j42d-algo-1-msvk6 ... 
Creating f9x2j3j42d-algo-1-msvk6 ... done
Attaching to f9x2j3j42d-algo-1-msvk6
[36mf9x2j3j42d-algo-1-msvk6 |[0m [2023-05-09 05:53:33.814 5e169d7d63a0:1 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[36mf9x2j3j42d-algo-1-msvk6 |[0m [2023-05-09 05:53:33.844 5e169d7d63a0:1 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
[36mf9x2j3j42d-algo-1-msvk6 |[0m [2023-05-09:05:53:33:INFO] Imported framework sagemaker_xgboost_container.training
[36mf9x2j3j42d-algo-1-msvk6 |[0m [2023-05-09:05:53:33:INFO] No GPUs detected (normal if no gpus installed)
[36mf9x2j3j42d-algo-1-msvk6 |[0m [2023-05-09:05:53:33:INFO] Invoking user training script.
[36mf9x2j3j42d-algo-1-msvk6 |[0m [2023-05-09:05:53:33:INFO] Installing module with the following command:
[36mf9x2j3j42d-algo-1-msvk6 |[0m /miniconda3/bin/python3 -m pip install . 
[36mf9x2j3j42d-algo-1-msvk6 |[0m Processing /op

INFO:root:creating /tmp/tmpnibhpue7/artifacts/output/data
INFO:root:copying /tmp/tmpnibhpue7/algo-1-msvk6/output/data/metrics.json -> /tmp/tmpnibhpue7/artifacts/output/data
INFO:root:copying /tmp/tmpnibhpue7/model/xgboost-model -> /tmp/tmpnibhpue7/artifacts/model


[36mf9x2j3j42d-algo-1-msvk6 exited with code 0
[0mAborting on container exit...




===== Job Complete =====


* save model-path, experiment-name

In [18]:
pm.put_params(key="-".join([strPrefix, "MODEL-PATH"]), value=estimator.model_data, overwrite=True)
pm.put_params(key="-".join([strPrefix, "EXPERI-NAME"]), value=strExperimentName, overwrite=True)

'Store suceess'

* show experiments

In [19]:
from sagemaker.analytics import ExperimentAnalytics
import pandas as pd
#pd.options.display.max_columns = 50
#pd.options.display.max_rows = 10
#pd.options.display.max_colwidth = 100

In [20]:
trial_component_training_analytics = ExperimentAnalytics(
    sagemaker_session= sagemaker_session,
    experiment_name= strExperimentName,
    sort_by="metrics.validation:auc.max",        
    sort_order="Descending",
    metric_names=["validation:auc"]
)

trial_component_training_analytics.dataframe()[['Experiments', 'Trials', 'validation:auc - Min', 'validation:auc - Max',
                                                'validation:auc - Avg', 'validation:auc - StdDev', 'validation:auc - Last', 
                                                'eta', 'max_depth', 'num_round', 'scale_pos_weight']]

Unnamed: 0,Experiments,Trials,validation:auc - Min,validation:auc - Max,validation:auc - Avg,validation:auc - StdDev,validation:auc - Last,eta,max_depth,num_round,scale_pos_weight
0,[DJ-SM-IMD-experiments],[DJ-SM-IMD-experiments-0424-04371682311053],1.0,1.0,1.0,0.0,1.0,"""0.3""","""2""","""100""","""19"""
1,[DJ-SM-IMD-experiments],[DJ-SM-IMD-experiments-0424-04281682310513],1.0,1.0,1.0,0.0,1.0,"""0.3""","""2""","""100""","""19"""
2,[DJ-SM-IMD-experiments],[DJ-SM-IMD-experiments-0412-10121681294361],0.821124,0.821124,0.821124,0.0,0.821124,"""0.3""","""2""","""100""","""19"""
3,[DJ-SM-IMD-experiments],[DJ-SM-IMD-experiments-0416-06421681627343],0.821124,0.821124,0.821124,0.0,0.821124,"""0.3""","""2""","""100""","""19"""
4,[DJ-SM-IMD-experiments],[DJ-SM-IMD-experiments-0419-04191681877971],0.821124,0.821124,0.821124,0.0,0.821124,"""0.3""","""2""","""100""","""19"""
