# Module 3. Training on Amazon SageMaker
---

본 모듈에서는 Amazon SageMaker API를 호출하여 모델 훈련을 수행합니다. Multi-GPU 분산 훈련에 더 관심이 있거나, SageMaker 기본 용법에 익숙하신 분들은 이 모듈을 건너 뛰고 Module 4로 곧바로 진행하시면 됩니다.

앞의 모듈과 달리 SageMaker notebook instance는 저렴한 인스턴스를 사용하시면 되고, 훈련 인스턴스 지정 시 GPU 기반 인스턴스를 선택하시면 됩니다.

In [1]:
import boto3
import sagemaker

boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)

In [12]:
from sagemaker.pytorch import PyTorch
role = sagemaker.get_execution_role()

In [13]:
estimator = PyTorch(entry_point='train.py',
                    source_dir='src',
                    role=role,
                    train_instance_type='ml.p3.2xlarge',
                    train_instance_count=1,
                    framework_version='1.5.0',
                    py_version='py3',
                    hyperparameters = {'num_epochs': 1, 
                                       'num_folds': 5,
                                       'vld_fold_idx': 4,
                                       'batch_size': 256,
                                       'lr': 0.001,
                                       'log_interval': 10,
                                      }                       
                   )

In [14]:
bucket = sagemaker.Session().default_bucket()
prefix = 'bangali/train'
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}'.format(bucket, prefix), content_type='csv')

In [15]:
estimator.fit(s3_input_train)

2020-08-03 02:46:37 Starting - Starting the training job...
2020-08-03 02:46:39 Starting - Launching requested ML instances.........
2020-08-03 02:48:20 Starting - Preparing the instances for training......
2020-08-03 02:49:28 Downloading - Downloading input data......
2020-08-03 02:50:30 Training - Downloading the training image.....
2020-08-03 02:51:19 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-08-03 02:51:20,649 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-08-03 02:51:20,677 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-08-03 02:51:20,682 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-08-03 02:51:20,984 sagemaker-containers INFO     Module default_user_module_name does 

[34m=== Getting Pre-trained model ===[0m
[34m=== Start Training ===[0m
[34m[2020-08-03 02:57:36.251 algo-1:53 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.[0m
[34m[2020-08-03 02:57:36.252 algo-1:53 INFO hook.py:183] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.[0m
[34m[2020-08-03 02:57:36.252 algo-1:53 INFO hook.py:228] Saving to /opt/ml/output/tensors[0m
[34m[2020-08-03 02:57:36.298 algo-1:53 INFO hook.py:364] Monitoring the collections: losses[0m
[34m[2020-08-03 02:57:36.299 algo-1:53 INFO hook.py:422] Hook is writing from the hook with pid: 53
[0m
[34m[Epoch 0 Batch 10/628] loss: 3.1253[0m
[34m[Epoch 0 Batch 20/628] loss: 2.6783[0m
[34m[Epoch 0 Batch 30/628] loss: 2.5658[0m
[34m[Epoch 0 Batch 40/628] loss: 2.3485[0m
[34m[Epoch 0 Batch 50/628] loss: 2.0930[0m
[34m[Epoch 0 Batch 60/628] loss: 2.1415[0m
[34m[Epoch 0 Batch 70/628] loss: 1.9556[0m
[34m