## Amazon SageMaker Processing jobs

*이 노트북은 [Amazon SageMaker Processing jobs (영문 원본)](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb) 의 한국어 번역입니다.*

Amazon SageMaker Processing 작업을 사용하면 간소화된 관리 환경을 활용하여 Amazon SageMaker 플랫폼에서 데이터 전처리&후처리 및 모델 검증 워크로드를 실행할 수 있습니다.

Processing 작업은 Amazon Simple Storage Service(Amazon S3)에서 입력 데이터를 다운로드한 다음, processing 작업 중 또는 processing 후 출력 결과를 Amazon S3에 업로드합니다.

<img src="Processing-1.jpg">

이 노트북은 아래 4가지 작업들을 수행하는 법을 보여줍니다.

1. Processing 작업을 실행하여 scikit-learn 스크립트를 실행하여 정제(cleans), 전처리(pre-processes), feature 엔지니어링을 수행하고 입력 데이터를 학습 및 테스트셋으로 분할
2. 전처리된 학습 데이터에서 학습 작업을 실행하여 모델 학습
3. 전처리 된 테스트 데이터에서 processing 작업을 실행하여 학습된 모델의 성능을 검증
4. 자체 사용자 정의 컨테이너(Your Own Custom Container)를 사용하여 자체 Python 라이브러리 및 종속성으로 processing 작업 실행

본 노트북에서 사용된 데이터셋은 [Census-Income KDD Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29) 입니다. 이 데이터셋에서 feature를 선택하고 데이터를 정제한 후, 데이터를 학습 알고리즘이 이진 분류(binary classification) 모델을 학습할 수 있는 feature로 변환하고, 데이터를 학습 및 테스트셋으로 분할합니다. 이 작업(task)은 응답자의 연간 소득이 `$50,000`이상인지, 또는 `$50,000` 이하인지 예측하는 이진 분류 문제입니다. 이 데이터셋은 클래스 불균형(class imbalance)이 심하며 대부분의 레코드는 `$50,000` 미만의 소득입니다. 여러분은 로지스틱 회귀 모델(Logistic regression model)을 학습한 후 홀드-아웃(hold-out) 테스트 데이터셋에 대해 모델을 검증하고 각 레이블의 정밀도(Precision), 재현도(recall) 및 F1 점수, 정확도(accuracy) 및 ROC AUC를 포함한 분류 검증 지표(classification evaluation metrics)들을 저장합니다.

## Data pre-processing and feature engineering

scikit-learn 전처리 스크립트를 processing 작업으로 실행하려면, 제공된 scikit-learn 이미지를 사용하여 processing 작업 내에서 스크립트를 실행할 수 있는 `SKLearnProcessor`를 생성하세요.

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

데이터 정제, 전처리 및 feature 엔지니어링에 사용하는 스크립트를 도입하기 전에 데이터셋의 첫 20행을 확인해 보세요. 목표는 소득 범주(`income` category)를 예측하는 것입니다. 선택한 데이터셋의 feature는 연령, 교육, 주요 산업 코드, 노동자 계급, 고용주를 위해 일한 인원, 자본 이득, 자본 손실 및 주식 배당금입니다 (`age`, `education`, `major industry code`, `class of worker`, `num persons worked for employer`, `capital gains`, `capital losses`, and `dividends from stocks`).

In [2]:
import pandas as pd

input_data = 's3://sagemaker-sample-data-{}/processing/census/census-income.csv'.format(region)
df = pd.read_csv(input_data, nrows=10)
df.head(n=10)

Unnamed: 0,age,class of worker,detailed industry recode,detailed occupation recode,education,wage per hour,enroll in edu inst last wk,marital stat,major industry code,major occupation code,...,country of birth father,country of birth mother,country of birth self,citizenship,own business or self employed,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year,year,income
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
5,48,Private,40,10,Some college but no degree,1200,Not in universe,Married-civilian spouse present,Entertainment,Professional specialty,...,Philippines,United-States,United-States,Native- Born in the United States,2,Not in universe,2,52,95,- 50000.
6,42,Private,34,3,Bachelors degree(BA AB BS),0,Not in universe,Married-civilian spouse present,Finance insurance and real estate,Executive admin and managerial,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
7,28,Private,4,40,High school graduate,0,Not in universe,Never married,Construction,Handlers equip cleaners etc,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,30,95,- 50000.
8,47,Local government,43,26,Some college but no degree,876,Not in universe,Married-civilian spouse present,Education,Adm support including clerical,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95,- 50000.
9,34,Private,4,37,Some college but no degree,0,Not in universe,Married-civilian spouse present,Construction,Machine operators assmblrs & inspctrs,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.


이 노트북 셀은 전처리 스크립트가 포함된 `preprocessing.py` 파일을 작성합니다. 스크립트를 업데이트하고 이 셀을 다시 실행하여 `preprocessing.py`를 덮어쓸 수 있습니다. 다음 셀에서 이 이것을 processing 작업으로 실행할 수 있습니다. 이 스크립트에서 여러분은 아래 작업들을 실행할 수 있습니다.

* 충돌 데이터(conflicting data)가 있는 중복 행을 제거합니다.
* target 변수 `imcome`을 두 개의 레이블만 포함된 열로 변환합니다.
* `age` and `num persons worked for employer` 수치형 변수를 binning하여 범주형 변수로 변환합니다.
지속적인 자본 이득, 자본 손실 및 주식의 배당 규모를 조정하여 훈련에 적합
* 연속형 변수 `capital gains`, `capital losses`, `dividends from stocks`의 스케일을 조정합니다.
* 범주형 변수 `education`, `major industry code`, `class of worker`를 인코딩합니다. 
* 데이터를 학습 및 테스트 데이터셋으로 분할하고 학습 feature 및 레이블과 테스트 feature 및 레이블을 저장합니다.

학습 스크립트는 전처리된 학습 feature 및 레이블(label)을 사용하여 모델을 학습하고 검증(evaluation) 스크립트는 학습된 모델과 전처리된 테스트 feature 및 레이블을 사용하여 모델을 검증합니다.

In [4]:
%%writefile preprocessing.py

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import make_column_transformer

from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)


columns = ['age', 'education', 'major industry code', 'class of worker', 'num persons worked for employer',
           'capital gains', 'capital losses', 'dividends from stocks', 'income']
class_labels = [' - 50000.', ' 50000+.']

def print_shape(df):
    negative_examples, positive_examples = np.bincount(df['income'])
    print('Data shape: {}, {} positive examples, {} negative examples'.format(df.shape, positive_examples, negative_examples))

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
    args, _ = parser.parse_known_args()
    
    print('Received arguments {}'.format(args))

    input_data_path = os.path.join('/opt/ml/processing/input', 'census-income.csv')
    
    print('Reading input data from {}'.format(input_data_path))
    df = pd.read_csv(input_data_path)
    df = pd.DataFrame(data=df, columns=columns)
    df.dropna(inplace=True)
    df.drop_duplicates(inplace=True)
    df.replace(class_labels, [0, 1], inplace=True)
    
    negative_examples, positive_examples = np.bincount(df['income'])
    print('Data after cleaning: {}, {} positive examples, {} negative examples'.format(df.shape, positive_examples, negative_examples))
    
    split_ratio = args.train_test_split_ratio
    print('Splitting data into train and test sets with ratio {}'.format(split_ratio))
    X_train, X_test, y_train, y_test = train_test_split(df.drop('income', axis=1), df['income'], test_size=split_ratio, random_state=0)

    preprocess = make_column_transformer(
        (['age', 'num persons worked for employer'], KBinsDiscretizer(encode='onehot-dense', n_bins=10)),
        (['capital gains', 'capital losses', 'dividends from stocks'], StandardScaler()),
        (['education', 'major industry code', 'class of worker'], OneHotEncoder(sparse=False))
    )
    print('Running preprocessing and feature engineering transformations')
    train_features = preprocess.fit_transform(X_train)
    test_features = preprocess.transform(X_test)
    
    print('Train data shape after preprocessing: {}'.format(train_features.shape))
    print('Test data shape after preprocessing: {}'.format(test_features.shape))
    
    train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_features.csv')
    train_labels_output_path = os.path.join('/opt/ml/processing/train', 'train_labels.csv')
    
    test_features_output_path = os.path.join('/opt/ml/processing/test', 'test_features.csv')
    test_labels_output_path = os.path.join('/opt/ml/processing/test', 'test_labels.csv')
    
    print('Saving training features to {}'.format(train_features_output_path))
    pd.DataFrame(train_features).to_csv(train_features_output_path, header=False, index=False)
    
    print('Saving test features to {}'.format(test_features_output_path))
    pd.DataFrame(test_features).to_csv(test_features_output_path, header=False, index=False)
    
    print('Saving training labels to {}'.format(train_labels_output_path))
    y_train.to_csv(train_labels_output_path, header=False, index=False)
    
    print('Saving test labels to {}'.format(test_labels_output_path))
    y_test.to_csv(test_labels_output_path, header=False, index=False)


Writing preprocessing.py


이 스크립트를 processing 작업으로 실행하세요. 간단하게 `SKLearnProcessor.run()` 메소드를 사용하시면 됩니다. `run()` 메소드를 실행할 때 inputs 인자값에 하나의 `ProcessingInput` 인스턴스를 제공해야 하며, `ProcessingInput()`의 `source` 인자값은 Amazon S3의 census(인구 조사) 데이터셋이고 `destination` 인자값은 이 데이터를 읽는 위치입니다 (`/opt/ml/processing/input`). 참고로, processing 컨테이너 내부의 이러한 로컬 경로는 `/opt/ml/processing/`으로 시작해야 합니다.

또한 `run()` 메소드의 outputs 인자값에 `ProcessingOutput`를 입력하셔야 합니다. 여기서 `source`는 스크립트가 출력 데이터를 쓰는 경로입니다. 출력의 경우 `destination`은 `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name/` 형식에 따라 Amazon SageMaker Python SDK가 생성하는 S3 버킷으로 기본 설정됩니다. 또한 작업이 실행 된 후 이러한 출력 아티팩트를 더 쉽게 검색할 수 있도록 `output_name`에 ProcessingOutputs 값을 제공합니다.

마지막으로, `run()` 메소드의 `arguments` 파라메터는 `preprocessing.py` 스크립트의 커맨드라인(command-line) 인자값들입니다.

In [5]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test')],
                      arguments=['--train-test-split-ratio', '0.2']
                     )

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'train_data':
        preprocessed_training_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'test_data':
        preprocessed_test_data = output['S3Output']['S3Uri']


Job Name:  sagemaker-scikit-learn-2019-12-06-00-19-04-996
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-sample-data-us-east-1/processing/census/census-income.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-143656149352/sagemaker-scikit-learn-2019-12-06-00-19-04-996/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-143656149352/sagemaker-scikit-learn-2019-12-06-00-19-04-996/output/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-

이제 처리가 완료된 feature들(processed features)로 구성된 전처리 작업의 출력을 확인해 보세요.

In [6]:
training_features = pd.read_csv(preprocessed_training_data + '/train_features.csv', nrows=10)
print('Training features shape: {}'.format(training_features.shape))
training_features.head(n=10)

Training features shape: (10, 73)


Unnamed: 0,0.0,0.0.1,0.0.2,0.0.3,0.0.4,1.0,0.0.5,0.0.6,0.0.7,0.0.8,...,0.0.56,0.0.57,0.0.58,0.0.59,0.0.60,1.0.4,0.0.61,0.0.62,0.0.63,0.0.64
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


## Training using the pre-processed data

학습 스크립트 `train.py`를 사용하여 학습 작업을 실행하는 데 사용할 `SKLearn` 인스턴스를 생성합니다.

In [7]:
from sagemaker.sklearn.estimator import SKLearn

sklearn = SKLearn(
    entry_point='train.py',
    train_instance_type="ml.m5.xlarge",
    role=role)

학습 스크립트 `train.py`는 학습 데이터에 대한 로지스틱 회귀 모델(logistic regression model)을 학습시키고, 모델을 `/opt/ml/model` 디렉토리에 저장합니다. 이 디렉토리는 Amazon SageMaker가 학습 작업 종료시 S3에 `model.tar.gz` 파일로 압축하여 업로드하는 경로입니다.

In [8]:
%%writefile train.py

import os

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

if __name__=="__main__":
    training_data_directory = '/opt/ml/input/data/train'
    train_features_data = os.path.join(training_data_directory, 'train_features.csv')
    train_labels_data = os.path.join(training_data_directory, 'train_labels.csv')
    print('Reading input data')
    X_train = pd.read_csv(train_features_data, header=None)
    y_train = pd.read_csv(train_labels_data, header=None)

    model = LogisticRegression(class_weight='balanced', solver='lbfgs')
    print('Training LR model')
    model.fit(X_train, y_train)
    model_output_directory = os.path.join('/opt/ml/model', "model.joblib")
    print('Saving model to {}'.format(model_output_directory))
    joblib.dump(model, model_output_directory)

Writing train.py


Run the training job using `train.py` on the preprocessed training data.

In [9]:
sklearn.fit({'train': preprocessed_training_data})
training_job_description = sklearn.jobs[-1].describe()
model_data_s3_uri = '{}{}/{}'.format(
    training_job_description['OutputDataConfig']['S3OutputPath'],
    training_job_description['TrainingJobName'],
    'output/model.tar.gz')

2019-12-06 00:23:18 Starting - Starting the training job...
2019-12-06 00:23:20 Starting - Launching requested ML instances.........
2019-12-06 00:24:55 Starting - Preparing the instances for training...
2019-12-06 00:25:40 Downloading - Downloading input data...
2019-12-06 00:26:10 Training - Training image download completed. Training in progress..[34m2019-12-06 00:26:11,247 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2019-12-06 00:26:11,249 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2019-12-06 00:26:11,258 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2019-12-06 00:26:18,658 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2019-12-06 00:26:18,659 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2019-12-06 00:26:18,659 sagemaker-containers INFO     Generating MANIFEST.in[0m
[34

## Model Evaluation

`evaluation.py`는 모델 검증 스크립트입니다. 스크립트는 scikit-learn에 의존하여 실행되므로 이전에 작성한 `SKLearnProcessor`를 사용하여 실행하세요. 이 스크립트는 학습된 모델과 테스트 데이터셋를 입력으로 사용하고 각 레이블의 정밀도(precision), 재현도(recall) 및 F1 점수, 모델의 정확도(accuracy) 및 ROC AUC를 포함하여 분류 검증 지표들(classification evaluation metrics)이 포함된 JSON 파일을 생성합니다.

In [10]:
%%writefile evaluation.py

import json
import os
import tarfile

import pandas as pd

from sklearn.externals import joblib
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score

if __name__=="__main__":
    model_path = os.path.join('/opt/ml/processing/model', 'model.tar.gz')
    print('Extracting model from path: {}'.format(model_path))
    with tarfile.open(model_path) as tar:
        tar.extractall(path='.')
    print('Loading model')
    model = joblib.load('model.joblib')

    print('Loading test input data')
    test_features_data = os.path.join('/opt/ml/processing/test', 'test_features.csv')
    test_labels_data = os.path.join('/opt/ml/processing/test', 'test_labels.csv')

    X_test = pd.read_csv(test_features_data, header=None)
    y_test = pd.read_csv(test_labels_data, header=None)
    predictions = model.predict(X_test)

    print('Creating classification evaluation report')
    report_dict = classification_report(y_test, predictions, output_dict=True)
    report_dict['accuracy'] = accuracy_score(y_test, predictions)
    report_dict['roc_auc'] = roc_auc_score(y_test, predictions)

    print('Classification report:\n{}'.format(report_dict))

    evaluation_output_path = os.path.join('/opt/ml/processing/evaluation', 'evaluation.json')
    print('Saving classification report to {}'.format(evaluation_output_path))

    with open(evaluation_output_path, 'w') as f:
        f.write(json.dumps(report_dict))

Writing evaluation.py


In [11]:
import json
from sagemaker.s3 import S3Downloader

sklearn_processor.run(code='evaluation.py',
                      inputs=[ProcessingInput(
                                  source=model_data_s3_uri,
                                  destination='/opt/ml/processing/model'),
                              ProcessingInput(
                                  source=preprocessed_test_data,
                                  destination='/opt/ml/processing/test')],
                      outputs=[ProcessingOutput(output_name='evaluation',
                                  source='/opt/ml/processing/evaluation')]
                     )                    
evaluation_job_description = sklearn_processor.jobs[-1].describe()


Job Name:  sagemaker-scikit-learn-2019-12-06-00-27-00-424
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-143656149352/sagemaker-scikit-learn-2019-12-06-00-23-18-098/output/model.tar.gz', 'LocalPath': '/opt/ml/processing/model', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-143656149352/sagemaker-scikit-learn-2019-12-06-00-19-04-996/output/test_data', 'LocalPath': '/opt/ml/processing/test', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-143656149352/sagemaker-scikit-learn-2019-12-06-00-27-00-424/input/code/evaluation.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated',

이제 검증 보고서(evaluation report)가 포함 된 Amazon S3에서 `evaluation.json` 파일을 검색하세요.

In [17]:
evaluation_output_config = evaluation_job_description['ProcessingOutputConfig']
for output in evaluation_output_config['Outputs']:
    if output['OutputName'] == 'evaluation':
        evaluation_s3_uri = output['S3Output']['S3Uri'] + '/evaluation.json'
        break

evaluation_output = S3Downloader.read_file(evaluation_s3_uri)
evaluation_output_dict = json.loads(evaluation_output)
print(json.dumps(evaluation_output_dict, sort_keys=True, indent=4))

{
    "0": {
        "f1-score": 0.8389297724060626,
        "precision": 0.9404501748251748,
        "recall": 0.757191871206123,
        "support": 11367
    },
    "1": {
        "f1-score": 0.5136129506990433,
        "precision": 0.3873473917869034,
        "recall": 0.7620087336244541,
        "support": 2290
    },
    "accuracy": 0.7579995606648605,
    "macro avg": {
        "f1-score": 0.676271361552553,
        "precision": 0.6638987833060391,
        "recall": 0.7596003024152885,
        "support": 13657
    },
    "micro avg": {
        "f1-score": 0.7579995606648605,
        "precision": 0.7579995606648605,
        "recall": 0.7579995606648605,
        "support": 13657
    },
    "roc_auc": 0.7596003024152885,
    "weighted avg": {
        "f1-score": 0.7843807849484165,
        "precision": 0.8477061334429062,
        "recall": 0.7579995606648605,
        "support": 13657
    }
}


## Running processing jobs with your own dependencies

위에서 scikit-learn이 설치된 processing 컨테이너(container)를 사용했지만, processing 작업에서 사용자 정의 processing 컨테이너를 실행할 수 있으며 processing 컨테이너 내에서 실행할 스크립트를 계속 제공할 수 있습니다.

아래에서는 processing 컨테이너를 만드는 방법과 `ScriptProcessor`를 사용하여 컨테이너 내에서 사용자 정의 코드(your own code)를 실행하는 방법에 대해 설명합니다. scikit-learn 컨테이너를 작성하고 위에서 사용한 것과 동일한 `preprocessing.py` 스크립트를 사용하여 처리 작업을 실행하세요. 여러분은 이 컨테이너 내에 사용자 정의 종속성을 제공하여 processing 스크립트를 실행할 수 있습니다.

In [12]:
!mkdir docker

processing 컨테이너를 작성하는 Dockerfile을 작성합니다. 코드에서 `pandas`와 `scikit-learn`을 설치하는 것을 확인할 수 있습니다. 이와 같이 여러분이 필요한 의존성 패키지들(your own dependencies)을 자유롭게 설치할 수 있습니다.

In [13]:
%%writefile docker/Dockerfile

FROM python:3.7-slim-buster

RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

Writing docker/Dockerfile


이 코드 블록은 `docker` 커맨드를 사용하여 컨테이너를 빌드하고 Amazon ECR(Amazon Elastic Container Registry) 리포지토리(repository)를 생성한 다음 이미지를 Amazon ECR로 push합니다.

In [14]:
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-processing-container'
tag = ':latest'
processing_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

Sending build context to Docker daemon  2.048kB
Step 1/4 : FROM python:3.7-slim-buster
3.7-slim-buster: Pulling from library/python

[1Bee12ec04: Pulling fs layer 
[1Bd83f8229: Pulling fs layer 
[1B0bee82a3: Pulling fs layer 
[1Bdcedfc84: Pulling fs layer 
[1B1cccc7f9: Pull complete 157MB/2.157MBB[3A[1K[K[5A[1K[K[3A[1K[K[3A[1K[K[5A[1K[K[1A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[5A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[3A[1K[K[2A[1K[K[1A[1K[K[1A[1K[K[1A[1K[KDigest: sha256:59af1bb7fb92ff97c9a23abae23f6beda13a95dbfd8100c7a2f71d150c0dc6e5
Status: Downloaded newer image for python:3.7-slim-buster
 ---> 9f4008bf3f11
Step 2/4 : RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
 ---> Running in 9d4660d08bdc
Collecting pandas==0.25.3
  Downloading https://files.pythonh

`ScriptProcessor` 클래스를 사용하면 이 컨테이너 내에서 명령을 실행할 수 있으며, 이 스크립트를 사용하여 여러분의 자체 스크립트(your own script)를 실행할 수 있습니다.

In [15]:
from sagemaker.processing import ScriptProcessor

script_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')

위에서 실행했던 것과 동일한 `preprocessing.py` 스크립트를 실행하지만, 이제 이 코드는 Amazon SageMaker가 유지 관리하는 scikit-learn 이미지가 아니라 이 노트북에서 빌드한 Docker 컨테이너 내에서 실행됩니다. Docker 이미지에 종속성을 추가하고 이 컨테이너 내에서 여러분이 자체 작성한 전처리, feature 엔지니어링 및 모델 검증 스크립트를 실행할 수 있습니다.

In [16]:
script_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test')],
                      arguments=['--train-test-split-ratio', '0.2']
                     )
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)


Job Name:  sagemaker-processing-container-2019-12-06-00-33-00-387
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-sample-data-us-east-1/processing/census/census-income.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-143656149352/sagemaker-processing-container-2019-12-06-00-33-00-387/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-143656149352/sagemaker-processing-container-2019-12-06-00-33-00-387/output/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 