# Fine-tuning Naver Movie Review Sentiment Classification with KoBERT on Amazon SageMaker
---

*Note: If you want to check the training results interactively or need background knowledge of BERT(Bi-directional Encoder Representations from Transformers), please refer to [this notebook](module1_kobert_nsmc_finetuning.ipynb) first.*

### Introduction
Amazon SageMaker training is very intuitive if you just put the training scripts in the entry point.

Of course, there are some cases when you need to build a new Docker container with BYOC(Bring Your Own Container). In this example, however, we will do the training with only the training script without docker container building and ECR pushing. Installing additional packages is easy if you include the code snippet below in your learning script.

```python
# Install/Update Packages
subprocess.call([sys.executable, '-m', 'pip', 'install', 'gluonnlp', 'torch', 'sentencepiece', 
                 'onnxruntime', 'transformers', 'git+https://git@github.com/SKTBrain/KoBERT.git@master'])

```

<br>

## 1. Preparing Data
---

Just download the data and upload it to Amazon S3. Since the data is a well-organized tabular data, uploading without any pre-processing is ready for training, but wrangling of actual data can be more complicated.

In [1]:
import io, os
import random
import pandas as pd
import numpy as np
import mxnet as mx
from mxnet.gluon import nn, rnn
from mxnet import gluon, autograd
import gluonnlp as nlp
from mxnet import nd 
import time
import itertools
import random

from kobert.mxnet_kobert import get_mxnet_kobert_model
from kobert.utils import get_tokenizer

import warnings
warnings.filterwarnings('ignore')

In [2]:
import boto3
import sagemaker

boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)

In [3]:
!wget -O train.txt https://www.dropbox.com/s/374ftkec978br3d/ratings_train.txt?dl=1
!wget -O validation.txt https://www.dropbox.com/s/977gbwh542gdy94/ratings_test.txt?dl=1

--2020-05-20 11:34:10--  https://www.dropbox.com/s/374ftkec978br3d/ratings_train.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.1, 2620:100:601c:1::a27d:601
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/374ftkec978br3d/ratings_train.txt [following]
--2020-05-20 11:34:10--  https://www.dropbox.com/s/dl/374ftkec978br3d/ratings_train.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uccccc63e56e2e1470aa07e29329.dl.dropboxusercontent.com/cd/0/get/A4FTSoL6Pm25UKnNATp7zS3PpxLNvYZvOOxW9vhIIwWP-PCrBLbYRVs0qr-FThhjP5xlBLgJW0i6O5Ecwx7UtveBCqyyNzcU_VuOrYC0DqJUtJDizKATLYJ_ZQzYDSQlHfw/file?dl=1# [following]
--2020-05-20 11:34:10--  https://uccccc63e56e2e1470aa07e29329.dl.dropboxusercontent.com/cd/0/get/A4FTSoL6Pm25UKnNATp7zS3PpxLNvYZvOOxW9vhIIwWP-PCrBLbYRVs0qr-FThhjP5xlBLgJW0i6O5Ecwx7UtveBCqyyNzcU_

In [4]:
bucket = sagemaker_session.default_bucket()
prefix = 'data/KoBERT-nsmc'

In [5]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.txt')).upload_file('train.txt')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.txt')).upload_file('validation.txt')

In [6]:
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, 'train.txt')
s3_validation_data = 's3://{}/{}/validation/{}'.format(bucket, prefix, 'validation.txt')

<br>

## 2. Training
---

### Environment Variables 
When SageMaker performs training with `estimator.fit()`, the root path of the Docker container for training is `/opt/ml`, and training is performed by storing data in subdirectories of `/opt/ml`.
When the training is completed, the model parameters in the output folder are automatically compressed into `model.tar.gz` and saved in Amazon S3.
For reference, the path of the training Docker container is as follows.

```
/opt/ml/
    input/
        config/
        data/
    model/
    output/
        failure/
```

Typical environment variables are as follows.
 
- `SM_MODEL_DIR`: This is the path to store the completed model artifacts, `/opt/ml/model`, and model parameters in the model folder are automatically compressed into `model.tar.gz` and stored in Amazon S3 even if there is no separate processing when writing an additional script.
- `SM_CHANNEL_TRAIN`: This is the path to store training data; `/opt/ml/input/data/train`.
- `SM_CHANNEL_VALIDATION`: This is the path to store validation data; `opt/ml/input/data/validation`.

The following table shows examples of S3 path and container path according to environment variables.

|  S3 path  |  Environment variable  | Container path |
| :---- | :---- | :----| 
|  s3://bucket_name/prefix/train  |  `SM_CHANNEL_TRAIN`  | `/opt/ml/input/data/train`  |
|  s3://bucket_name/prefix/validation  |  `SM_CHANNEL_VALIDATION`  | `/opt/ml/input/data/validation`  |
|  s3://bucket_name/prefix/eval  |  `SM_CHANNEL_EVAL`  | `/opt/ml/input/data/eval`  |
|  s3://bucket_name/prefix/model.tar.gz  |  `SM_MODEL_DIR`  |  `/opt/ml/model`  |
|  s3://bucket_name/prefix/output.tar.gz  |  `SM_OUTPUT_DATA_DIR`  |  `/opt/ml/output/data`  |

Pleaes refer to the document below for more details.
https://github.com/aws/sagemaker-containers#list-of-provided-environment-variables-by-sagemaker-containers


### Training Script

The only code snippets you need to add in the training script are the SageMaker environment variables:
```python
parser = argparse.ArgumentParser()

# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--num_epochs', type=int, default=4)
parser.add_argument('--batch_size', type=int, default=64)  
parser.add_argument('--lr', type=float, default=5e-5)
parser.add_argument('--log_interval', type=int, default=50) 

# SageMaker environment variables
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))      
parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))      
parser.add_argument('--model_output_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))

```

### Training

We fine-tune KoBERT models using MXNet and GluonNLP as a backend deep learning framework.
Amazon SageMaker can easily do the training by creating an MXNet estimator if you only have custom training scripts.

In [7]:
from sagemaker.mxnet import MXNet
role = sagemaker.get_execution_role()

mxnet_estimator = MXNet(entry_point='train.py',
                            source_dir='src',
                            role=role,
                            train_instance_type='ml.p3.2xlarge',
                            train_instance_count=1,
                            framework_version='1.6.0',
                            py_version='py3',
                            hyperparameters = {'num_epochs': 1, 
                                               'batch_size': 64,
                                               'lr': 5e-5,
                                               'log_interval': 50}
                       )

Since it takes a lot of time to provision a training instance as well as to install dependent packages like GluonNLP and KoBERT, the total time is about 18 minutes for 1 epoch on `p3.2xlarge`. Please note, however, that not all of these fees are charged, and that the fee is charged only when the training instance is doing training.

*[Note]
Starting in August 2019, SageMaker can significantly reduce costs by using EC2 spot instances for training instances. We call this a managed spot instance and activation of this feature is easily possible with `train_use_spot_instances = 'True'`. For more information, please see the AWS blog below. <br>
https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/*

In [8]:
%%time
mxnet_estimator.fit({'train': s3_train_data, 'validation': s3_validation_data})

2020-05-20 11:34:13 Starting - Starting the training job...
2020-05-20 11:34:20 Starting - Launching requested ML instances......
2020-05-20 11:35:34 Starting - Preparing the instances for training......
2020-05-20 11:36:34 Downloading - Downloading input data
2020-05-20 11:36:34 Training - Downloading the training image......
2020-05-20 11:37:30 Training - Training image download completed. Training in progress.[34m2020-05-20 11:37:31,455 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2020-05-20 11:37:31,484 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"batch_size":64,"log_interval":50,"lr":5e-05,"num_epochs":1}', 'SM_USER_ENTRY_POINT': 'train.py', 'SM_FRAMEWORK_PARAMS': '{}', 'SM_RESOURCE_CONFIG': '{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}', 'SM_INPUT_DATA_CONFIG': '{"train":{"RecordWrapperType":"None"

[34mCollecting sentencepiece
  Downloading https://files.pythonhosted.org/packages/3b/88/49e772d686088e1278766ad68a463513642a2a877487decbd691dec02955/sentencepiece-0.1.90-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)[0m
[34mCollecting onnxruntime
  Downloading https://files.pythonhosted.org/packages/5b/37/572986fb63e0df4e026c5f4c11f6a8977344293587b451d9210a429f5882/onnxruntime-1.3.0-cp36-cp36m-manylinux1_x86_64.whl (3.9MB)[0m
[34mCollecting transformers
  Downloading https://files.pythonhosted.org/packages/22/97/7db72a0beef1825f82188a4b923e62a146271ac2ced7928baa4d47ef2467/transformers-2.9.1-py3-none-any.whl (641kB)[0m
[34mCollecting future
  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)[0m
[34mCollecting tokenizers==0.7.0
  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x8

[34m�████████████████████.......................]#015[███████████████████████████.......................]#015[███████████████████████████.......................]#015[███████████████████████████.......................]#015[███████████████████████████.......................]#015[████████████████████████████......................]#015[████████████████████████████......................]#015[████████████████████████████......................]#015[████████████████████████████......................]#015[████████████████████████████......................]#015[████████████████████████████......................]#015[████████████████████████████......................]#015[█████████████████████████████.....................]#015[█████████████████████████████.....................]#015[█████████████████████████████.....................]#015[█████████████████████████████.....................]#015[█████████████████████████████.....................]#015[█████████████████████████████....................

[34m=== Getting Data ===[0m
[34m/opt/ml/input/data/train/train.txt[0m
[34m/opt/ml/input/data/validation/validation.txt[0m
[34musing cached model[0m
[34m=== Start Training ===[0m
[34m[2020-05-20 11:40:04.991 ip-10-0-208-220.ec2.internal:39 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.[0m
[34m[2020-05-20 11:40:04.991 ip-10-0-208-220.ec2.internal:39 INFO hook.py:170] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.[0m
[34m[2020-05-20 11:40:04.991 ip-10-0-208-220.ec2.internal:39 INFO hook.py:215] Saving to /opt/ml/output/tensors[0m
[34m[2020-05-20 11:40:05.016 ip-10-0-208-220.ec2.internal:39 INFO hook.py:351] Monitoring the collections: losses[0m
[34m[2020-05-20 11:40:05.165 ip-10-0-208-220.ec2.internal:39 INFO hook.py:226] Registering hook for block softmaxcrossentropyloss0[0m
[34mERROR:root:'NoneType' object has no attribute 'write'[0m
[34m[Epoch 0 Batch 50/234