# SageMaker를 사용하여 ML 모델 훈련하기
이탈 예측은 플레이어들이 왜 게임을 떠나는지, 그리고 이러한 플레이어들을 유지하기 위해 어떤 종류의 변화가 게임에 필요한지에 대한 통찰력과 기회를 제공합니다. 예를 들어, 머신 러닝(ML) 기반 이탈 예측 시스템은 플레이어가 게임을 떠나기로 결정하기 전에 적절한 시점에 특정 맞춤형 제안이나 타겟 프로모션을 게임 내에서 트리거하여 플레이어를 유지할 수 있습니다. 이 노트북에서는 [이전](02-preprocess.ipynb) 노트북에서 준비한 데이터셋을 사용하여 XGBoost를 활용한 분류 모델을 훈련시킬 것입니다. 다음 다이어그램은 MLOps 컨텍스트 내에서의 훈련 프로세스를 보여줍니다:

![training notebook](img/sagemaker-mlops-model-train-diagram.jpg)

In [None]:
%pip install scikit-learn s3fs==0.4.2 sagemaker xgboost mlflow==2.13.2 sagemaker-mlflow==0.1.0

관련 라이브러리 가져오기

In [None]:
import os
import boto3
import sagemaker
import mlflow
from time import gmtime, strftime, sleep
import xgboost as xgb
import tarfile
import json
from botocore.exceptions import ClientError

from sagemaker.inputs import TrainingInput

# 헬퍼 함수 정의

In [None]:
def download_from_s3(s3_client, local_file_path, bucket_name, s3_file_path):
    try:
        # Download the file
        s3_client.download_file(bucket_name, s3_file_path, local_file_path)
        print(f"File downloaded successfully to {local_file_path}")
        return True
    except ClientError as e:
        if e.response['Error']['Code'] == "404":
            print("The object does not exist.")
        else:
            print(f"An error occurred: {e}")
        return False
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return False

def upload_to_s3(s3_client, local_file_path, bucket_name, s3_file_path=None):
    # If S3 file path is not specified, use the basename of the local file
    if s3_file_path is None:
        s3_file_path = os.path.basename(local_file_path)

    try:
        # Upload the file
        s3_client.upload_file(local_file_path, bucket_name, s3_file_path)
        print(f"File {local_file_path} uploaded successfully to {bucket_name}/{s3_file_path}")
        return True
    except ClientError as e:
        print(f"ClientError: {e}")
        return False
    except FileNotFoundError:
        print(f"The file {local_file_path} was not found")
        return False
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return False
        
def write_params(s3_client, step_name, params, notebook_param_s3_bucket_prefix):
    local_file_path = f"{step_name}.json"
    with open(local_file_path, "w") as f:
        f.write(json.dumps(params))
    base_local_file_path = os.path.basename(local_file_path)
    bucket_name = notebook_param_s3_bucket_prefix.split("/")[2] # Format: s3://<bucket_name>/..
    s3_file_path = os.path.join("/".join(notebook_param_s3_bucket_prefix.split("/")[3:]), base_local_file_path)
    upload_to_s3(s3_client, local_file_path, bucket_name, s3_file_path)
    
def read_params(s3_client, notebook_param_s3_bucket_prefix, step_name):
    local_file_path = f"{step_name}.json"
    base_local_file_path = os.path.basename(local_file_path)
    bucket_name = notebook_param_s3_bucket_prefix.split("/")[2] # Format: s3://<bucket_name>/..
    s3_file_path = os.path.join("/".join(notebook_param_s3_bucket_prefix.split("/")[3:]),  base_local_file_path)
    downloaded = download_from_s3(s3_client, local_file_path, bucket_name, s3_file_path)
    with open(local_file_path, "r") as f:
        data = f.read()
        params = json.loads(data)
    return params


In [None]:
# helper function to load XGBoost model into xgboost.Booster
def load_model(model_data_s3_uri):
    model_file = "./xgboost-model.tar.gz"
    bucket, key = model_data_s3_uri.replace("s3://", "").split("/", 1)
    boto3.client("s3").download_file(bucket, key, model_file)
    
    with tarfile.open(model_file, "r:gz") as t:
        t.extractall(path=".")
    
    # Load model
    model = xgb.Booster()
    model.load_model("xgboost-model")

    return model

In [None]:
def get_xgb_estimator(
    session,
    instance_type,
    output_s3_url,
    base_job_name,
):
    # Instantiate an XGBoost estimator object
    estimator = sagemaker.estimator.Estimator(
        image_uri=XGBOOST_IMAGE_URI,
        role=sagemaker.get_execution_role(), 
        instance_type=instance_type,
        instance_count=1,
        output_path=output_s3_url,
        sagemaker_session=session,
        base_job_name=base_job_name
    )
    
    # Define algorithm hyperparameters
    estimator.set_hyperparameters(
        num_round=100, # the number of rounds to run the training
        max_depth=3, # maximum depth of a tree
        eta=0.5, # step size shrinkage used in updates to prevent overfitting
        alpha=2.5, # L1 regularization term on weights
        objective="binary:logistic",
        eval_metric="auc", # evaluation metrics for validation data
        subsample=0.8, # subsample ratio of the training instance
        colsample_bytree=0.8, # subsample ratio of columns when constructing each tree
        min_child_weight=3, # minimum sum of instance weight (hessian) needed in a child
        early_stopping_rounds=10, # the model trains until the validation score stops improving
        verbosity=1, # verbosity of printing messages
    )

    return estimator

# 변수 초기화
이전 노트북과 유사하게, 다음 변수들은 이 노트북 전체에서 특별히 사용되는 이 셀에서 정의됩니다. 하드코딩된 값 외에도, 이러한 변수들은 노트북이 SageMaker Pipeline 작업이나 SageMaker Project를 통한 CICD 파이프라인과 같이 원격으로 실행되도록 예약될 때 노트북에 매개변수로 전달될 수 있습니다. 다음 실습에서 이 노트북에 매개변수를 전달하는 방법에 대해 자세히 알아보겠습니다. 노트북 매개변수화에 대한 자세한 정보는 [이 문서](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run-troubleshoot-override.html)를 참조하세요.

`02-preprocess.ipynb` 노트북과 유사하게, 다음 변수들은 SageMaker Studio 런처를 통해 얻을 수 있습니다. 추가 도움이 필요한 경우 노트북에 안내 및 스크린샷이 제공됩니다.

In [None]:
region = "us-east-1"
os.environ["AWS_DEFAULT_REGION"] = region
boto_session = boto3.Session(region_name=region)
sess = sagemaker.Session(boto_session=boto_session)
bucket_name = sess.default_bucket()
bucket_prefix = "player-churn/xgboost"
notebook_param_s3_bucket_prefix=f"s3://{bucket_name}/{bucket_prefix}/params"
experiment_name = "player-churn-model-experiment"
mlflow_tracking_server_arn = "" # Provide a valid mlflow tracking server ARN. You can find the value in the output from 00-start-here.ipynb
run_id = None
train_instance_type = "ml.m5.xlarge"

In [None]:
assert len(mlflow_tracking_server_arn) > 0

In [None]:
# Define the output S3 location for storing the model artifacts.
output_s3_url = f"s3://{bucket_name}/{bucket_prefix}/output"

In [None]:
XGBOOST_IMAGE_URI = sagemaker.image_uris.retrieve(
            "xgboost",
            region=boto3.Session().region_name,
            version="1.7-1"
)

이전 노트북에서 단계 변수를 검색합니다.

In [None]:
preprocess_step_name = "02-preprocess"
s3_client = boto3.client("s3", region_name=region)
preprocess_step_params = read_params(s3_client, notebook_param_s3_bucket_prefix, preprocess_step_name)

In [None]:
# use sagemaker.Session() in the estimator to a training job immediately
estimator = get_xgb_estimator(
    session=sagemaker.Session(),
    instance_type=train_instance_type,
    output_s3_url=output_s3_url,
    base_job_name=f"player-churn-xgboost-train",
)

In [None]:
# Set up the training inputs using the outputs from preprocess function
training_inputs = {
    "train": TrainingInput(
        preprocess_step_params['train_data'],
        content_type="text/csv",
    ),
    "validation": TrainingInput(
        preprocess_step_params['validation_data'],
        content_type="text/csv",
    ),
}

다음 셀은 MLFlow 추적 서버를 이 모델 훈련 작업과 통합합니다.

In [None]:
# Run the training job
suffix = strftime('%d-%H-%M-%S', gmtime())
mlflow.set_tracking_uri(mlflow_tracking_server_arn)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(
    run_name=f"training-{strftime('%d-%H-%M-%S', gmtime())}",
    description="training in the notebook with a training job") as run:
    mlflow.log_params(estimator.hyperparameters())
    
    estimator.fit(training_inputs)

    mlflow.log_param("training job name", estimator.latest_training_job.name)
    mlflow.log_metrics({i['metric_name'].replace(':', '_'):i['value'] for i in estimator.training_job_analytics.dataframe().iloc})
    mlflow.xgboost.log_model(load_model(estimator.model_data), artifact_path="xgboost")

# 매개변수 저장
다음 셀에서는 관련 매개변수를 S3 버킷에 저장하여 후속 단계에서 다른 단계로 전달할 수 있도록 하겠습니다.

In [None]:
params = {}
params['model_s3_path'] = estimator.model_data

In [None]:
step_name = "03-train"
write_params(s3_client, step_name, params, notebook_param_s3_bucket_prefix)

In [None]:
mlflow.end_run()