<center><h1>Train model with SageMaker</center></h1>

## 1. Create a Training Job

In [None]:
import boto3
from time import strftime, gmtime
import json

In [3]:
## Set a sagemaker role  
try:
    # if you are on a sagemaker notebook instance
    import sagemaker
    role = sagemaker.get_execution_role()
except: 
    # if locally, create a Sagemaker execution role in the aws console and assign it here
    iam = boto3.client('iam')
    role_name = "YOUR SAGEMAKER ROLE"
    role = iam.get_role(RoleName=role_name)['Role']['Arn']

If you are using a local notabook, please make sure to modify `role_name` by a proper value. <br>
For more details about roles, please sign in to [AWS Management Console](https://console.aws.amazon.com/iam/) and create a role in the left navigation pane.

In [4]:
region = boto3.Session().region_name # get the region name
account = boto3.Session().client('sts').get_caller_identity()['Account'] # get the account id
sm = boto3.Session().client('sagemaker') # create a sagemaker session
print("role: {}".format(role))
print("region: {}".format(region))
print("account: {}".format(account))

role: arn:aws:iam::612233423258:role/service-role/AmazonSageMaker-ExecutionRole-20200324T172595
region: eu-central-1
account: 612233423258


**Specify the data and model location** <br>
Please change the parameters in the following cells according to the location of your data and where you want to store the model artefacts.

In [5]:
# Data location
bucket_name = "sm-transformers-datasets" # Bucket name where the data is located
train_prefix = "data/dataset_multiclass_500" # folder of train data
models_prefix = "models" # folder where model will be saved
train_s3_uri = "s3://{}/{}".format(bucket_name, train_prefix)
models_s3_uri = "s3://{}/{}".format(bucket_name, models_prefix)
print("Train data location : {}".format(train_s3_uri))
print("Models data location : {}".format(models_s3_uri))

Train data location : s3://sm-transformers-datasets/data/dataset_multiclass_500
Models data location : s3://sm-transformers-datasets/models


**Specify the docker image name**

In [6]:
image_name = "sm-transformers-gpu"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, image_name)
print("image of model: {}".format(image))

image of model: 612233423258.dkr.ecr.eu-central-1.amazonaws.com/sm-transformers-gpu:latest


**Set the training job name**

If you want to explicitly set the training job name, ignore the following cell and change the value of `training_job_name` .

In [7]:
# Training job name 
training_job_name = "{}-{}".format(image_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
# shorten it (should be less than 63 characters)
if len(training_job_name) > 63:
    training_job_name = training_job_name[max(len(training_job_name)-62,0):]
print("training job name : {}".format(training_job_name))

training job name : sm-transformers-gpu-2021-02-17-09-55-35


**Set checkpoints path**

Optionally: You can specify an old training job name to be resumed !

In [8]:
checkpoints_s3_uri = "s3://{}/{}/{}/checkpoints".format(bucket_name, models_prefix, training_job_name) #old_training_job
print("checkpoints will be saved in {}".format(checkpoints_s3_uri))

checkpoints will be saved in s3://sm-transformers-datasets/models/sm-transformers-gpu-2021-02-17-09-55-35/checkpoints


**Define Metrics**

In [9]:
## Metrics to follow during training (by parsing the logs!)
metrics = [
            {
            "Name": "training:epoch",
            "Regex": "'epoch': (.*?)}"
            },
            {
            "Name": "evaluation:loss",
            "Regex": "'eval_loss': (.*?),"
            },
            {
            "Name": "evaluation:accuracy",
            "Regex": "'eval_accuracy': (.*?)," # eval_mse(regression), eval_accuracy (classif), eval_accuracy_score(ner)
            }
        ]
metrics

[{'Name': 'training:epoch', 'Regex': "'epoch': (.*?)}"},
 {'Name': 'evaluation:loss', 'Regex': "'eval_loss': (.*?),"},
 {'Name': 'evaluation:accuracy', 'Regex': "'eval_accuracy': (.*?),"}]

You can adjust the hyperparametrs of the model/expand them. <br>
For example, you can decrease batch size if you have OOM errors, increase/decrease the max sequence length, etc.

In [10]:
## List of hyperparameters during training (optional)
hyperparameters = {
    "task_name": "classif",
    "model_name": "bert-base-uncased",
    "max_steps": "500",
    "use_bbox": "false",
    "per_device_train_batch_size": "10",
    "per_device_eval_batch_size": "10"
}
#allenai/longformer-base-4096
#bert-base-uncased
#microsoft/layoutlm-base-uncased

Pick an instance type for training

In [11]:
# GPU : ml.g4dn.xlarge (ml.g4dn.xlarge   cpu:4     gpu:1xT4     cpu-ram:16    gpu-ram:16         training/hour$0.822)
## classif: 
# bert : batch = 10 is ok (with text > 512) (70% GPU RAM busy)
# Longformer: batch = 2 ok if text size after tokenization is < 2048 / batch 1: ok till limit! (4096) (89% GPU RAM utilised)

## Token classif (ner)
# bert: batch 10 is ok (with text > 512) (70% GPU RAM busy)
# longformer: idem classif : batch = 2 ok if text size after tokenization is < 2048 / batch 1: ok till limit! (4096) (89% GPU RAM utilised)
# layoutlm: batch 10 same than bert : is ok (with text > 512) (70% GPU RAM busy)

In [12]:
## List of GPU instance to be chosen
#name            CPUs   GPU     RAM  GPU-RAM  TrainingPrice/hour
#ml.p3.2xlarge    8    1xV100    61    16         $4.627         
#ml.p2.xlarge     4     1xK80    61    12         $1.361
#ml.g4dn.xlarge   4     1xT4     16    16         $0.822
#ml.g4dn.2xlarge  8     1xT4     32    16         $1.173  <-
#ml.g4dn.4xlarge  16    1xT4     64    16         $1.879
#ml.g4dn.8xlarge  32    1xT4     128   16         $3.396
#ml.g4dn.12xlarge 48    4xT4     192   64         $6.107
#ml.g4dn.16xlarge 64    1xT4     256   16         $6.794

instance_type = "ml.g4dn.xlarge" # "ml.c4.4xlarge" 

Specify some additional parameters for the training job:
- Training image
- Arn Role
- Model Location
- Instance type for the training job
- Data config:
    - Location for the training data (and potentially test data if needed)

In [13]:
#cpu: ml.c4.4xlarge (16 cpus)

common_training_params = \
{
    "TrainingJobName": training_job_name,
    "AlgorithmSpecification": {
        "TrainingImage": image,
        "TrainingInputMode": "File",
        "MetricDefinitions" : metrics
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": models_s3_uri
    },
    "TensorBoardOutputConfig": { 
      #"LocalPath": "/opt/ml/output/tensorboard", #default value is /opt/ml/output/tensorboard
      "S3OutputPath": models_s3_uri
    },
    "ResourceConfig": {
        "InstanceCount": 1,   
        "InstanceType": instance_type,
        "VolumeSizeInGB": 60
    },
    "HyperParameters": hyperparameters,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 86400,
        "MaxWaitTimeInSeconds": 86400
    },
    "EnableManagedSpotTraining": True,
    "CheckpointConfig": { 
      #"LocalPath": "/opt/ml/checkpoints/", #default value is /opt/ml/checkpoints/
      "S3Uri": checkpoints_s3_uri
   },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": train_s3_uri,
                    "S3DataDistributionType": "FullyReplicated" 
                }
            },
            "ContentType": "text/plain",
            "CompressionType": "None"
        }
    ]
}

print(json.dumps(common_training_params, indent=4))

{
    "TrainingJobName": "sm-transformers-gpu-2021-02-17-09-55-35",
    "AlgorithmSpecification": {
        "TrainingImage": "612233423258.dkr.ecr.eu-central-1.amazonaws.com/sm-transformers-gpu:latest",
        "TrainingInputMode": "File",
        "MetricDefinitions": [
            {
                "Name": "training:epoch",
                "Regex": "'epoch': (.*?)}"
            },
            {
                "Name": "evaluation:loss",
                "Regex": "'eval_loss': (.*?),"
            },
            {
                "Name": "evaluation:accuracy",
                "Regex": "'eval_accuracy': (.*?),"
            }
        ]
    },
    "RoleArn": "arn:aws:iam::612233423258:role/service-role/AmazonSageMaker-ExecutionRole-20200324T172595",
    "OutputDataConfig": {
        "S3OutputPath": "s3://sm-transformers-datasets/models"
    },
    "TensorBoardOutputConfig": {
        "S3OutputPath": "s3://sm-transformers-datasets/models"
    },
    "ResourceConfig": {
        "InstanceCount

**Create a training job**

In [14]:
%%time
sm.create_training_job(**common_training_params)

CPU times: user 15.5 ms, sys: 3.57 ms, total: 19.1 ms
Wall time: 588 ms


{'TrainingJobArn': 'arn:aws:sagemaker:eu-central-1:612233423258:training-job/sm-transformers-gpu-2021-02-17-09-55-35',
 'ResponseMetadata': {'RequestId': '1228fb38-14f5-419d-a96b-57ce7d9a4f08',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '1228fb38-14f5-419d-a96b-57ce7d9a4f08',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '117',
   'date': 'Wed, 17 Feb 2021 09:56:22 GMT'},
  'RetryAttempts': 0}}

In [15]:
%%time
# monitor the training job
status = sm.describe_training_job(TrainingJobName=training_job_name)['TrainingJobStatus']
print(status)

sm.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=training_job_name)
status = sm.describe_training_job(TrainingJobName=training_job_name)['TrainingJobStatus']
print("Training job ended with status: " + status)
if status == 'Failed':
    message = sm.describe_training_job(TrainingJobName=training_job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

Completed
Training job ended with status: Completed
CPU times: user 37.3 ms, sys: 49.1 ms, total: 86.4 ms
Wall time: 713 ms


## 2. Create a model from Training Job

Once the training is finished, we can get the trained model

In [16]:
#training_job_name = "ner-bert-base-cased-gpu-2020-06-29-09-09-18"
print(training_job_name)

sm-transformers-gpu-2021-02-17-09-55-35


**Note** that you can specify a different docker image for inference than the one used for the training. <br>
In our case, if we want to use `CPU` instead of `GPU` resources in the inference step, we can set it explicitely by changing the image variable value. In our case: <br>
`image_name = "sm-transformers-cpu"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, image_name)`

In [17]:
#### Uncomment if you want to use the CPU based image for Creating the model #####

#image_name = "sm-transformers-cpu"
#image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, image_name)

######################################  End ########################################


# set the model name
model_name = training_job_name + '-m'
print("model_name : {}".format(model_name))

# get model artifacts location
info = sm.describe_training_job(TrainingJobName=training_job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print("model_data : {}".format(model_data))
    
primary_container = {
    'Image': image,
    'ModelDataUrl': model_data
}
print("primary_container : {}".format(primary_container))

# Create model
create_model_response = sm.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

model_name : sm-transformers-gpu-2021-02-17-09-55-35-m
model_data : s3://sm-transformers-datasets/models/sm-transformers-gpu-2021-02-17-09-55-35/output/model.tar.gz
primary_container : {'Image': '612233423258.dkr.ecr.eu-central-1.amazonaws.com/sm-transformers-gpu:latest', 'ModelDataUrl': 's3://sm-transformers-datasets/models/sm-transformers-gpu-2021-02-17-09-55-35/output/model.tar.gz'}
arn:aws:sagemaker:eu-central-1:612233423258:model/sm-transformers-gpu-2021-02-17-09-55-35-m


## 3. Serve the model

### 3.1 Create a Sagemaker Endpoint

Create config

In [18]:
# set endpoint name
end_point_config_name = "end-point-config-name-{}".format(model_name)
end_point_config_name = end_point_config_name[max(len(end_point_config_name)-50,0):]
end_point_name = "point-{}".format(model_name)
end_point_name = end_point_name[max(len(end_point_name)-50,0):]
print("end_point_config_name: {}".format(end_point_config_name))
print("end_point_name: {}".format(end_point_name))

end_point_config_name: fig-name-sm-transformers-gpu-2021-02-17-09-55-35-m
end_point_name: point-sm-transformers-gpu-2021-02-17-09-55-35-m


In [19]:
instance_type = "ml.g4dn.2xlarge" 
create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName = end_point_config_name,
    ProductionVariants=[{
        'InstanceType': instance_type,
        'InitialVariantWeight': 1,
        'InitialInstanceCount': 1,
        'ModelName': model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Endpoint Config Arn: arn:aws:sagemaker:eu-central-1:612233423258:endpoint-config/fig-name-sm-transformers-gpu-2021-02-17-09-55-35-m


Create Endpoint

In [20]:
import time
create_endpoint_response = sm.create_endpoint(
    EndpointName=end_point_name,
    EndpointConfigName=end_point_config_name)
print("Enpoint Arn: {}".format(create_endpoint_response['EndpointArn']))

resp = sm.describe_endpoint(EndpointName=end_point_name)
status = resp['EndpointStatus']
print("Status: " + status)

Enpoint Arn: arn:aws:sagemaker:eu-central-1:612233423258:endpoint/point-sm-transformers-gpu-2021-02-17-09-55-35-m
Status: Creating


In [21]:
while status=='Creating':
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=end_point_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:eu-central-1:612233423258:endpoint/point-sm-transformers-gpu-2021-02-17-09-55-35-m
Status: InService


### 3.2 Invoke Endpoint

With **JSON** input and **JSON** output

In [25]:
import json

data = {"data":[{'text': 'hello I am Alexis'}, 
                {'text': "how are you"},
                {'text': "are you doing fine??"}]}
query_body = json.dumps(data)


runtime_client = boto3.client('runtime.sagemaker')
response = runtime_client.invoke_endpoint(EndpointName = end_point_name,
                                 ContentType = 'application/json', 
                                 #Accept='application/json', # by default it will return json
                                 Body = query_body)
result = response['Body'].read().decode('ascii')
print(result)

{"predictions": [{"pred": "tech", "proba": 0.553991973400116}, {"pred": "tech", "proba": 0.3334293067455292}, {"pred": "tech", "proba": 0.5057740211486816}]}


With **JSONLINES** input and **JSONLINES** output

In [27]:
import json

data = [{'text': 'hello I am Alexis'}, 
        {'text': "how are you"},
        {'text': "are you doing fine??"}]
query_body = query_body = "\n".join([json.dumps(d) for d in data])

runtime_client = boto3.client('runtime.sagemaker')
response = runtime_client.invoke_endpoint(EndpointName = end_point_name,
                                 ContentType = 'application/jsonlines', 
                                 Accept='application/jsonlines',
                                 Body = query_body)
result = response['Body'].read().decode('ascii')
print(result)

{"predictions": [{"pred": "tech", "proba": 0.553991973400116}, {"pred": "tech", "proba": 0.3334293067455292}, {"pred": "tech", "proba": 0.5057740211486816}]}


### 3.3 Batch Transform

| ContentType | Recommended SplitType |
| --- | --- |
| application/jsonlines | Line |
| application/json | None |



| Accept | Recommended AssembleWith |
| --- | --- |
| application/jsonlines | Line |
| application/json | None |

With **JSONLInes** files as input

To perform batch transform, you should have a s3 location (folder), containing one or mutiple .jsonl files:

```
s3_folder/
    - file1.jsonl
    - file2.jsonl
    - file3.jsonl
    ...
```


each file having the **jsonlines** format. That is each line being:


```
{'text': 'hello I am someone who wants a prediction'}
```

In [41]:
# Data location
bucket_name = "sm-transformers-datasets" 
input_prefix = "batch_data/dataset_multiclass_500/jsonl"
output_prefix = "batch_output/dataset_multiclass_500/jsonl_assembleNone"
batch_input_s3_uri = "s3://{}/{}".format(bucket_name, input_prefix)
batch_output_s3_uri = "s3://{}/{}".format(bucket_name, output_prefix)
print("batch input uri: {}".format(batch_input_s3_uri))
print("batch output uri: {}".format(batch_output_s3_uri))

batch input uri: s3://sm-transformers-datasets/batch_data/dataset_multiclass_500/jsonl
batch output uri: s3://sm-transformers-datasets/batch_output/dataset_multiclass_500/jsonl_assembleNone


In [42]:
batch_job_name = "batch-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
batch_job_name = batch_job_name[max(len(batch_job_name)-50,0):]
print("batch job name: {}".format(batch_job_name))

batch job name: batch-2021-02-17-11-34-53


In [46]:
content_type_input = "application/jsonlines"
max_payload_in_mb = 2
batch_strategy = "MultiRecord" # MultiRecord | SingleRecord
split_type = "Line" # None | Line
# None : input data files are not split, and request payloads contain the entire contents of an input object
# Line : depends on the values of the BatchStrategy and MaxPayloadInMB parameters. 
# If BatchStrategy = MultiRecord, Amazon SageMaker sends the maximum number of records in each request, 
# up to the MaxPayloadInMB limit. 
# If BatchStrategy = SingleRecord, Amazon SageMaker sends individual records in each request.
content_type_output = "application/jsonlines"
assemble_with = "Line" # None | Line
# To concatenate the results in binary format, specify None. 
# To add a newline character at the end of every transformed record, specify Line.

## NB: BatchStrategy=Multirecord + SplitType=Line + AssembleWith=Line will do a mapping between input files and output

In [47]:
# gpu: ml.p2.xlarge
# cpu: ml.m4.xlarge

request = \
{
    "TransformJobName": batch_job_name,
    "ModelName": model_name,
    "BatchStrategy": batch_strategy,
    "MaxPayloadInMB": max_payload_in_mb,
    "ModelClientConfig": { 
      "InvocationsTimeoutInSeconds": 120
       },
    "TransformInput": {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": batch_input_s3_uri 
            }
        },
        "ContentType": content_type_input,
        "SplitType": split_type,
        "CompressionType": "None"
    },
    "TransformOutput": {
        "S3OutputPath": batch_output_s3_uri,
        "Accept": content_type_output,
        "AssembleWith": assemble_with
    },
    "TransformResources": {
            "InstanceType": "ml.p2.xlarge",
            "InstanceCount": 1
    }
}

In [48]:
sm.create_transform_job(**request)

{'TransformJobArn': 'arn:aws:sagemaker:eu-central-1:612233423258:transform-job/batch-2021-02-17-11-34-53',
 'ResponseMetadata': {'RequestId': 'a86aef69-d95b-4d87-ba4b-01cee12a4cb7',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'a86aef69-d95b-4d87-ba4b-01cee12a4cb7',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '105',
   'date': 'Wed, 17 Feb 2021 11:35:15 GMT'},
  'RetryAttempts': 0}}

In [39]:
import time 

while(True):
    response = sm.describe_transform_job(TransformJobName=batch_job_name)
    status = response['TransformJobStatus']
    if  status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = response['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    print("Transform job is still in status: " + status)    
    time.sleep(30)

Transform job is still in status: InProgress
Transform job is still in status: InProgress
Transform job is still in status: InProgress
Transform job is still in status: InProgress
Transform job is still in status: InProgress
Transform job is still in status: InProgress


KeyboardInterrupt: 