### 1. Modelling

#### [SageMaker XGBoost](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/xgboost.html)
#### XGBoost(eXtreme Gradient Boosting)는 그라디언트 부스팅 트리 알고리즘에서 유명하고 효율적인 오픈 소스 구현입니다. 그라디언트 부스팅은 더욱 단순하고 약한 모델 세트의 추정치를 결합하여 대상 변수를 정확하게 예측하려 시도하는 지도 학습 알고리즘입니다.

##### ACCOUNT_ID 정보를 입력합니다.

In [None]:
S3_BUCKET_POSTFIX = '123456789'

##### Train

In [None]:
import boto3
import sagemaker
import pandas as pd
import numpy as np

from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.predictor import csv_serializer
from sklearn.metrics import accuracy_score

role = get_execution_role()
region = boto3.Session().region_name
container = get_image_uri(region, 'xgboost', '0.90-1')

s3_bucket = 's3://analytics-hol-' + S3_BUCKET_POSTFIX

In [None]:
s3_input_train = sagemaker.s3_input(s3_data=s3_bucket + '/train', content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=s3_bucket + '/validation', content_type='csv')

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path=s3_bucket + '/output',
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

##### Deploy

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

##### Evaluate

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

test_data = pd.read_csv(s3_bucket + '/test/test.csv')
actual = test_data.iloc[:, 0]
predictions = np.round(predict(test_data.values[:, 1:]))

In [None]:
pd.crosstab(index=actual, columns=predictions, rownames=['actual'], colnames=['predictions'])

In [None]:
accuracy_score(actual, predictions)

##### Clean up

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)