<center><h1>Train model with SageMaker</center></h1>

## 1. Create a Training Job

In [None]:
import boto3
from time import strftime, gmtime
import json

In [None]:
## Set a sagemaker role  
try:
    # if you are on a sagemaker notebook instance
    import sagemaker
    role = sagemaker.get_execution_role()
except: 
    # if locally, create a Sagemaker execution role in the aws console and assign it here
    iam = boto3.client('iam')
    role_name = "YOUR_EXECUTIONROLE_FOR_SAGEMAKER"
    role = iam.get_role(RoleName=role_name)['Role']['Arn']

If you are using a local notabook, please make sure to modify `role_name` by a proper value. <br>
For more details about roles, please sign in to [AWS Management Console](https://console.aws.amazon.com/iam/) and create a role in the left navigation pane.

In [None]:
region = boto3.Session().region_name # get the region name
account = boto3.Session().client('sts').get_caller_identity()['Account'] # get the account id
sm = boto3.Session().client('sagemaker') # create a sagemaker session
print("role: {}".format(role))
print("region: {}".format(region))
print("account: {}".format(account))

**Specify the data and model location** <br>
Please change the parameters in the following cells according to the location of your data and where you want to store the model artefacts.

In [None]:
# Data location
bucket_name = "sm-transformers-datasets" # Bucket name where the data is located
train_prefix = "data/dataset_multilabel_500" # folder of train data
models_prefix = "models" # folder where model will be saved
train_s3_uri = "s3://{}/{}".format(bucket_name, train_prefix)
models_s3_uri = "s3://{}/{}".format(bucket_name, models_prefix)
print("Train data location : {}".format(train_s3_uri))
print("Models data location : {}".format(models_s3_uri))

**Specify the docker image name**

In [None]:
image_name = "sm-transformers-gpu"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, image_name)
print("image of model: {}".format(image))

**Set the training job name**

If you want to explicitly set the training job name, ignore the following cell and change the value of `training_job_name` .

In [None]:
# Training job name 
training_job_name = "{}-{}".format(image_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
# shorten it (should be less than 63 characters)
if len(training_job_name) > 63:
    training_job_name = training_job_name[max(len(training_job_name)-62,0):]
print("training job name : {}".format(training_job_name))

**Set checkpoints path**

Optionally: You can specify an old training job name to be resumed !

In [None]:
checkpoints_s3_uri = "s3://{}/{}/{}/checkpoints".format(bucket_name, models_prefix, training_job_name) #old_training_job
print("checkpoints will be saved in {}".format(checkpoints_s3_uri))

**Define Metrics**

In [None]:
## Metrics to follow during training (by parsing the logs!)
metrics = [
            {
            "Name": "training:epoch",
            "Regex": "'epoch': (.*?)}"
            },
            {
            "Name": "evaluation:loss",
            "Regex": "'eval_loss': (.*?),"
            },
            {
            "Name": "evaluation:accuracy",
            "Regex": "'eval_accuracy': (.*?)," # eval_mse(regression), eval_accuracy (classif), eval_accuracy_score(ner)
            }
        ]
metrics

You can adjust the hyperparametrs of the model/expand them. <br>
For example, you can decrease batch size if you have OOM errors, increase/decrease the max sequence length, etc.

In [None]:
## List of hyperparameters during training (optional)
hyperparameters = {
    "task_name": "multilabel-classif",
    "model_name": "bert-base-uncased",
    "max_steps": "1000",
    "use_bbox": "false",
    "per_device_train_batch_size": "10",
    "per_device_eval_batch_size": "10"
}
#allenai/longformer-base-4096
#bert-base-uncased
#microsoft/layoutlm-base-uncased

Pick an instance type for training

In [None]:
# GPU : ml.g4dn.xlarge (ml.g4dn.xlarge   cpu:4     gpu:1xT4     cpu-ram:16    gpu-ram:16         training/hour$0.822)
## classif: 
# bert : batch = 10 is ok (with text > 512) (70% GPU RAM busy)
# Longformer: batch = 2 ok if text size after tokenization is < 2048 / batch 1: ok till limit! (4096) (89% GPU RAM utilised)

## Token classif (ner)
# bert: batch 10 is ok (with text > 512) (70% GPU RAM busy)
# longformer: idem classif : batch = 2 ok if text size after tokenization is < 2048 / batch 1: ok till limit! (4096) (89% GPU RAM utilised)
# layoutlm: batch 10 same than bert : is ok (with text > 512) (70% GPU RAM busy)

In [None]:
## List of GPU instance to be chosen
#name            CPUs   GPU     RAM  GPU-RAM  TrainingPrice/hour
#ml.p3.2xlarge    8    1xV100    61    16         $4.627         
#ml.p2.xlarge     4     1xK80    61    12         $1.361
#ml.g4dn.xlarge   4     1xT4     16    16         $0.822
#ml.g4dn.2xlarge  8     1xT4     32    16         $1.173  <-
#ml.g4dn.4xlarge  16    1xT4     64    16         $1.879
#ml.g4dn.8xlarge  32    1xT4     128   16         $3.396
#ml.g4dn.12xlarge 48    4xT4     192   64         $6.107
#ml.g4dn.16xlarge 64    1xT4     256   16         $6.794

instance_type = "ml.g4dn.xlarge" # "ml.c4.4xlarge" 

Specify some additional parameters for the training job:
- Training image
- Arn Role
- Model Location
- Instance type for the training job
- Data config:
    - Location for the training data (and potentially test data if needed)

In [None]:
#cpu: ml.c4.4xlarge (16 cpus)

common_training_params = \
{
    "TrainingJobName": training_job_name,
    "AlgorithmSpecification": {
        "TrainingImage": image,
        "TrainingInputMode": "File",
        "MetricDefinitions" : metrics
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": models_s3_uri
    },
    "TensorBoardOutputConfig": { 
      #"LocalPath": "/opt/ml/output/tensorboard", #default value is /opt/ml/output/tensorboard
      "S3OutputPath": models_s3_uri
    },
    "ResourceConfig": {
        "InstanceCount": 1,   
        "InstanceType": instance_type,
        "VolumeSizeInGB": 60
    },
    "HyperParameters": hyperparameters,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 86400,
        "MaxWaitTimeInSeconds": 86400
    },
    "EnableManagedSpotTraining": True,
    "CheckpointConfig": { 
      #"LocalPath": "/opt/ml/checkpoints/", #default value is /opt/ml/checkpoints/
      "S3Uri": checkpoints_s3_uri
   },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": train_s3_uri,
                    "S3DataDistributionType": "FullyReplicated" 
                }
            },
            "ContentType": "text/plain",
            "CompressionType": "None"
        }
    ]
}

print(json.dumps(common_training_params, indent=4))

**Create a training job**

In [None]:
%%time
sm.create_training_job(**common_training_params)

In [None]:
%%time
# monitor the training job
status = sm.describe_training_job(TrainingJobName=training_job_name)['TrainingJobStatus']
print(status)

sm.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=training_job_name)
status = sm.describe_training_job(TrainingJobName=training_job_name)['TrainingJobStatus']
print("Training job ended with status: " + status)
if status == 'Failed':
    message = sm.describe_training_job(TrainingJobName=training_job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

## 2. Create a model from Training Job

Once the training is finished, we can get the trained model

In [None]:
#training_job_name = "ner-bert-base-cased-gpu-2020-06-29-09-09-18"
print(training_job_name)

**Note** that you can specify a different docker image for inference than the one used for the training. <br>
In our case, if we want to use `CPU` instead of `GPU` resources in the inference step, we can set it explicitely by changing the image variable value. In our case: <br>
`image_name = "classif-bert-base-uncased-cpu"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, image_name)`

In [None]:
#### Uncomment if you want to use the CPU based image for Creating the model #####

#image_name = "classif-bert-base-multilingual-cased-cpu"
#image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, image_name)

######################################  End ########################################


# set the model name
model_name = training_job_name + '-m'
print("model_name : {}".format(model_name))

# get model artifacts location
info = sm.describe_training_job(TrainingJobName=training_job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print("model_data : {}".format(model_data))
    
primary_container = {
    'Image': image,
    'ModelDataUrl': model_data
}
print("primary_container : {}".format(primary_container))

# Create model
create_model_response = sm.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])