# Build a BERT SageMaker Pipeline

https://github.com/kubeflow/pipelines/blob/master/samples/contrib/aws-samples/mnist-kmeans-sagemaker/mnist-classification-pipeline.py

https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/05_Kubeflow_Pipeline/05_04_Pipeline_SageMaker.ipynb

## Install AWS Python SDK (`boto3`)

In [1]:
!pip install boto3

[33mYou are using pip version 19.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Install Kubeflow Pipelines SDK

In [2]:
!pip install https://storage.googleapis.com/ml-pipeline/release/0.1.29/kfp.tar.gz --upgrade

Collecting https://storage.googleapis.com/ml-pipeline/release/0.1.29/kfp.tar.gz
  Using cached https://storage.googleapis.com/ml-pipeline/release/0.1.29/kfp.tar.gz
Building wheels for collected packages: kfp
  Building wheel for kfp (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-4l0092do/wheels/81/b7/33/00ef9dd992b13add014c4875a2c130d9d70288127a793c4af6
Successfully built kfp
Installing collected packages: kfp
  Found existing installation: kfp 0.1.29
    Uninstalling kfp-0.1.29:
      Successfully uninstalled kfp-0.1.29
Successfully installed kfp-0.1.29
[33mYou are using pip version 19.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
# Restart the kernel to pick up pip installed libraries
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [1]:
import boto3

AWS_REGION_AS_SLIST=!curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/'
AWS_REGION = AWS_REGION_AS_SLIST.s
print('Region: {}'.format(AWS_REGION))

AWS_ACCOUNT_ID=boto3.client('sts').get_caller_identity().get('Account')
print('Account ID: {}'.format(AWS_ACCOUNT_ID))

S3_BUCKET='sagemaker-{}-{}'.format(AWS_REGION, AWS_ACCOUNT_ID)
print('S3 Bucket: {}'.format(S3_BUCKET))

Region: us-east-1
Account ID: 835319576252
S3 Bucket: sagemaker-us-east-1-835319576252


## Copy `data` and `valid_data.csv` into your S3 bucket.

In [2]:
!aws s3 cp s3://kubeflow-pipeline-data/mnist_kmeans_example/data s3://$S3_BUCKET/mnist_kmeans_example/data
!aws s3 cp s3://kubeflow-pipeline-data/mnist_kmeans_example/input/valid_data.csv s3://$S3_BUCKET/mnist_kmeans_example/input/

copy: s3://kubeflow-pipeline-data/mnist_kmeans_example/data to s3://sagemaker-us-east-1-835319576252/mnist_kmeans_example/data
copy: s3://kubeflow-pipeline-data/mnist_kmeans_example/input/valid_data.csv to s3://sagemaker-us-east-1-835319576252/mnist_kmeans_example/input/valid_data.csv


# Build Pipeline

In [3]:
import kfp
from kfp import components
from kfp import dsl
from kfp.aws import use_aws_secret

In [4]:
sagemaker_process_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/3ebd075212e0a761b982880707ec497c36a99d80/components/aws/sagemaker/process/component.yaml')


In [5]:
sagemaker_train_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/3ebd075212e0a761b982880707ec497c36a99d80/components/aws/sagemaker/train/component.yaml')


In [6]:
sagemaker_model_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/3ebd075212e0a761b982880707ec497c36a99d80/components/aws/sagemaker/model/component.yaml')


In [7]:
sagemaker_deploy_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/3ebd075212e0a761b982880707ec497c36a99d80/components/aws/sagemaker/deploy/component.yaml')


In [8]:
SAGEMAKER_ROLE_ARN='arn:aws:iam::{}:role/TeamRole'.format(AWS_ACCOUNT_ID)

# Configure your s3 bucket.
S3_PIPELINE_PATH='s3://{}/bert-kubeflow-pipeline'.format(S3_BUCKET)

# TODO:  Implement the other region checks
if AWS_REGION == 'us-west-2':
    AWS_ECR_REGISTRY='174872318107.dkr.ecr.us-west-2.amazonaws.com'

if AWS_REGION == 'us-east-1':
    AWS_ECR_REGISTRY='382416733822.dkr.ecr.us-east-1.amazonaws.com'

# Setup Pre-Processing Code 

In [9]:
processing_code_s3_uri = 's3://{}/processing_code/preprocess-scikit-text-to-bert.py'.format(S3_BUCKET)
print(processing_code_s3_uri)

!aws s3 cp ./preprocess-scikit-text-to-bert.py $processing_code_s3_uri


s3://sagemaker-us-east-1-835319576252/processing_code/preprocess-scikit-text-to-bert.py
upload: ./preprocess-scikit-text-to-bert.py to s3://sagemaker-us-east-1-835319576252/processing_code/preprocess-scikit-text-to-bert.py


# Setup Training Code

In [10]:
!tar -cvzf sourcedir.tar.gz -C code .

./
./inference.py
./requirements.txt
./tf_bert_reviews.py


In [11]:
training_code_s3_uri = 's3://{}/training_code/'.format(S3_BUCKET)
print(training_code_s3_uri)

!aws s3 cp sourcedir.tar.gz $training_code_s3_uri

s3://sagemaker-us-east-1-835319576252/training_code/
upload: ./sourcedir.tar.gz to s3://sagemaker-us-east-1-835319576252/training_code/sourcedir.tar.gz


In [12]:
def processing_input(input_name, s3_uri, local_path):
    return {
        "InputName": input_name,
        "S3Input": {
            "S3Uri": s3_uri,
            "LocalPath": local_path,
            "S3DataType": "S3Prefix",
            "S3InputMode": "File",
        },
    }

def processing_output(output_name, s3_uri, local_path):
    return {
        "OutputName": output_name,
        "S3Output": {
            "S3Uri": s3_uri,
            "LocalPath": local_path,
            "S3UploadMode": "EndOfJob",
        },
    }

In [13]:
def training_input(input_name, s3_uri):
    return {
        "ChannelName": input_name,
        "DataSource": {"S3DataSource": {"S3Uri": s3_uri, "S3DataType": "S3Prefix"}},
    }

# Setup Pipeline

In [21]:
@dsl.pipeline(
    name="BERT Pipeline",
    description="BERT Pipeline",
)
def bert_pipeline(role_arn=SAGEMAKER_ROLE_ARN, bucket_name=S3_BUCKET, region=AWS_REGION):
    
    processing_image='763104351884.dkr.ecr.{}.amazonaws.com/tensorflow-training:1.15.2-gpu-py36-cu100-ubuntu18.04'.format(region)
    train_image='763104351884.dkr.ecr.{}.amazonaws.com/tensorflow-training:1.15.2-gpu-py36-cu100-ubuntu18.04'.format(region)
    serve_image='763104351884.dkr.ecr.{}.amazonaws.com/tensorflow-inference:1.15.2-cpu'.format(region)

    import time
    pipeline_name = 'kubeflow-pipeline-sagemaker-{}'.format(int(time.time()))

    network_isolation=False

    max_seq_length=64
    train_split_percentage=0.90
    validation_split_percentage=0.05
    test_split_percentage=0.05
    balance_dataset=True
    processing_instance_count=2
    processing_instance_type='ml.c5.2xlarge'

    raw_input_data_s3_uri = 's3://{}/amazon-reviews-pds/tsv/'.format(S3_BUCKET)

    processed_train_data_s3_uri = 's3://{}/{}/processing/output/bert-train'.format(S3_BUCKET, pipeline_name)
    processed_validation_data_s3_uri = 's3://{}/{}/processing/output/bert-validation'.format(S3_BUCKET, pipeline_name)
    processed_test_data_s3_uri = 's3://{}/{}/processing/output/bert-test'.format(S3_BUCKET, pipeline_name)

    processing_instance_type = 'ml.c5.2xlarge'
    processing_instance_count = 2    

    # Training input and output location based on bucket name
    process = sagemaker_process_op(
        role=role_arn,
        region=region,
        image=processing_image, #"763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-cpu-py36-ubuntu16.04",
        network_isolation=network_isolation,
        instance_type=processing_instance_type,
        instance_count=processing_instance_count,
        container_arguments=['--train-split-percentage', str(train_split_percentage),
                             '--validation-split-percentage', str(validation_split_percentage),
                             '--test-split-percentage', str(test_split_percentage),
                             '--max-seq-length', str(max_seq_length),
                             '--balance-dataset', str(balance_dataset)],
        container_entrypoint=[
            "python3",
            "/opt/ml/processing/input/code/preprocess-scikit-text-to-bert.py",
        ],
        input_config=[
            processing_input(
                "raw_input",
                "{}".format(raw_input_data_s3_uri),
                "/opt/ml/processing/input",
                # TODO:  Add ShardedByS3Key                
            ),
            processing_input(
                "code",
                "{}".format(processing_code_s3_uri),
                "/opt/ml/processing/input/code",
            ),
        ],
        output_config=[
            processing_output(
                "bert-train",
                "{}".format(processed_train_data_s3_uri),
                "/opt/ml/processing/output/bert/train",
                # TODO:  Add EndOfJob                
            ),
            processing_output(
                "bert-validation",
                "{}".format(processed_validation_data_s3_uri),
                "/opt/ml/processing/output/bert/validation",
                # TODO:  Add EndOfJob
            ),
            processing_output(
                "bert-test",
                "{}".format(processed_test_data_s3_uri),
                "/opt/ml/processing/output/bert/test",
                # TODO:  Add EndOfJob                
            ),
        ],
    )
    
    train_channels = [
        training_input("train", 
                       processed_train_data_s3_uri
                       # TODO:  Add ShardedByS3Key                
        ),
        training_input("validation", 
                       processed_validation_data_s3_uri
                       # TODO:  Add ShardedByS3Key
        ),                       
        training_input("test", 
                       processed_test_data_s3_uri
                       # TODO:  Add ShardedByS3Key                       
        )
    ]

    train_output_location = "s3://{}/{}/output".format(S3_BUCKET, pipeline_name)

    epochs=3
    learning_rate=0.00001
    epsilon=0.00000001
    train_batch_size=128
    validation_batch_size=128
    test_batch_size=128
    train_steps_per_epoch=100
    validation_steps=100
    test_steps=100
    train_volume_size=1024
    use_xla=True
    use_amp=True
    freeze_bert_layer=False
    enable_sagemaker_debugger=False
    enable_checkpointing=False
    enable_tensorboard=False
    input_mode='Pipe'
    run_validation=True
    run_test=True
    run_sample_predictions=True
    
    metrics_definitions = [
        {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
        {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
        {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
        {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
    ]
    
    train_instance_count=1
    train_instance_type='ml.c5.9xlarge'
    
    # .after(process) is explicitly appended below
    training = sagemaker_train_op(
        region=region,
        image=train_image,
        network_isolation=network_isolation,        
        instance_type=train_instance_type,
        instance_count=train_instance_count,
        hyperparameters={'epochs': '{}'.format(epochs),
                         'learning_rate': '{}'.format(learning_rate),
                         'epsilon': '{}'.format(epsilon),
                         'train_batch_size': '{}'.format(train_batch_size),
                         'validation_batch_size': '{}'.format(validation_batch_size),
                         'test_batch_size': '{}'.format(test_batch_size),                                             
                         'train_steps_per_epoch': '{}'.format(train_steps_per_epoch),
                         'validation_steps': '{}'.format(validation_steps),
                         'test_steps': '{}'.format(test_steps),
                         'use_xla': '{}'.format(use_xla),
                         'use_amp': '{}'.format(use_amp),                                             
                         'max_seq_length': '{}'.format(max_seq_length),
                         'freeze_bert_layer': '{}'.format(freeze_bert_layer),
                         'enable_sagemaker_debugger': '{}'.format(enable_sagemaker_debugger),
                         'enable_checkpointing': '{}'.format(enable_checkpointing),
                         'enable_tensorboard': '{}'.format(enable_tensorboard),                                        
                         'run_validation': '{}'.format(run_validation),
                         'run_test': '{}'.format(run_test),
                         'run_sample_predictions': '{}'.format(run_sample_predictions)
                        },
        training_input_mode=input_mode,    
        channels=train_channels,        
        model_artifact_path=train_output_location,
        # TODO:  Add metric definitions and overcome this error
        # for key, val in args['metric_definitions'].items():
        # AttributeError: 'list' object has no attribute 'items'        
#        metric_definitions=metrics_definitions,
        # TODO:  Add rules
        role=role_arn,
        
    ).after(process)

    # .after(process) is implied because we depend on training.outputs[]
    create_model = sagemaker_model_op(
        region=region,
        model_name=training.outputs["job_name"],
        image=serve_image, # training.outputs["training_image"],
        model_artifact_url=training.outputs["model_artifact_url"],
        role=role_arn,
    )

    deploy_instance_count=1
    deploy_instance_type='ml.m5.4xlarge'

    # .after(process) is implied because we depend on create_model.outputs
    sagemaker_deploy_op(
        region=region,
        model_name_1=create_model.output,
        instance_type_1=deploy_instance_type,
        initial_instance_count_1=deploy_instance_count        
    )


# Compile Kubeflow Pipeline

In [22]:
kfp.compiler.Compiler().compile(bert_pipeline, 'bert-pipeline.zip')

In [23]:
!ls -al ./bert-pipeline.zip

-rw-r--r-- 1 root users 2240 Aug 29 05:14 ./bert-pipeline.zip


In [24]:
!unzip -o ./bert-pipeline.zip

Archive:  ./bert-pipeline.zip
  inflating: pipeline.yaml           


In [25]:
!cat pipeline.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    pipelines.kubeflow.org/pipeline_spec: '{"description": "BERT Pipeline", "inputs":
      [{"default": "arn:aws:iam::835319576252:role/TeamRole", "name": "role_arn"},
      {"default": "sagemaker-us-east-1-835319576252", "name": "bucket_name"}, {"default":
      "us-east-1", "name": "region"}], "name": "BERT Pipeline"}'
  generateName: bert-pipeline-
spec:
  arguments:
    parameters:
    - name: role-arn
      value: arn:aws:iam::835319576252:role/TeamRole
    - name: bucket-name
      value: sagemaker-us-east-1-835319576252
    - name: region
      value: us-east-1
  entrypoint: bert-pipeline
  serviceAccountName: pipeline-runner
  templates:
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: region
            value: '{{inputs.parameters.region}}'
          - name: role-arn
            value: '{{inputs.parameters.role-arn}}'
          - name: sagemake

# Launch Pipeline on Kubernetes Cluster

In [26]:
client = kfp.Client()

aws_experiment = client.create_experiment(name='aws')

my_run = client.run_pipeline(aws_experiment.id, 
                             'bert-pipeline', 
                             'bert-pipeline.zip')

## Training

_Note:  The above training job may take 5-10 minutes.  Please be patient._

In the meantime, open the SageMaker Console to monitor the progress of your training job.

![SageMaker Training Job Console](img/sagemaker-training-job-console.png)

## Get the Name of the Deployed Prediction Endpoint
First, we need to get the endpoint name of our newly-deployed SageMaker Prediction Endpoint.

Open AWS console and enter SageMaker service, find the endpoint name as the following picture shows.

![download-pipeline](images/sm-endpoint.jpg)

# Make a Prediction

# _YOU MUST COPY/PASTE THE `ENDPOINT_NAME` BEFORE CONTINUING_
Make sure to include preserve the single-quotes as shown below.

In [20]:
# import pickle, gzip, numpy, urllib.request, json
# from urllib.parse import urlparse
# import json
# import io
# import boto3

# #################################
# #################################
# # Replace ENDPOINT_NAME with the endpoint name in the SageMaker console.
# # Surround with single quotes.
# ENDPOINT_NAME= # 'Endpoint-<your-endpoint-name>'
# #################################
# #################################

# # Load the dataset
# urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
# with gzip.open('mnist.pkl.gz', 'rb') as f:
#     train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

# # Simple function to create a csv from our numpy array
# def np2csv(arr):
#     csv = io.BytesIO()
#     numpy.savetxt(csv, arr, delimiter=',', fmt='%g')
#     return csv.getvalue().decode().rstrip()

# runtime = boto3.Session(region_name=AWS_REGION).client('sagemaker-runtime')

# payload = np2csv(train_set[0][30:31])

# response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
#                                    ContentType='text/csv',
#                                    Body=payload)
# result = json.loads(response['Body'].read().decode())
# print(result)

## Clean up

Go to Sagemaker console and delete `endpoint` and `model`.