# Mnist classification pipeline using Sagemaker

The `mnist-classification-pipeline.py` sample runs a pipeline to train a classficiation model using Kmeans with MNIST dataset on Sagemaker.

We will have all required steps here and for other details like how to get source data, please check [documentation](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/aws-samples/mnist-kmeans-sagemaker).


This sample is based on the [Train a Model with a Built-in Algorithm and Deploy it](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1.html).

The sample trains and deploy a model based on the [MNIST dataset](http://www.deeplearning.net/tutorial/gettingstarted.html).



## Prerequisite

1. Copy dataset

You can create a s3 bucket and follow these instructions to `data` and `valid_data.csv` to your buckets.

```shell
aws s3 cp s3://kubeflow-pipeline-data/mnist_kmeans_example/data s3://your_bucket/mnist_kmeans_example/data
aws s3 cp s3://kubeflow-pipeline-data/mnist_kmeans_example/input/valid_data.csv s3://your_bucket/mnist_kmeans_example/input/
```

2. Grant SageMaker permission

In order to run this pipeline, we need to prepare an IAM Role to run Sagemaker jobs. You need this `role_arn` to run a pipeline. Check [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for details.

This pipeline also use aws-secret to get access to Sagemaker services, please also make sure you have a `aws-secret` in kubeflow namespace.


```yaml
apiVersion: v1
kind: Secret
metadata:
  name: aws-secret
  namespace: kubeflow
type: Opaque
data:
  AWS_ACCESS_KEY_ID: YOUR_BASE64_ACCESS_KEY
  AWS_SECRET_ACCESS_KEY: YOUR_BASE64_SECRET_ACCESS
```

> Note: To get base64 string, try `echo -n $AWS_ACCESS_KEY_ID | base64`


3. Create a Sagemaker service role and get its arn.

## Build pipeline

1. Run the following command to load Kubeflow Pipelines SDK

In [None]:
import kfp
from kfp import components
from kfp import dsl
from kfp.aws import use_aws_secret

2. Load reusable sagemaker components.

In [None]:
sagemaker_train_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/0ad6c28d32e2e790e6a129b7eb1de8ec59c1d45f/components/aws/sagemaker/train/component.yaml')
sagemaker_model_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/0ad6c28d32e2e790e6a129b7eb1de8ec59c1d45f/components/aws/sagemaker/model/component.yaml')
sagemaker_deploy_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/0ad6c28d32e2e790e6a129b7eb1de8ec59c1d45f/components/aws/sagemaker/deploy/component.yaml')
sagemaker_batch_transform_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/0ad6c28d32e2e790e6a129b7eb1de8ec59c1d45f/components/aws/sagemaker/batch_transform/component.yaml')

3. Create pipeline. 

We will create a training job first. Once training job is done, it will persist trained model to S3. 

Then a job will be kicked off to create a `Model` manifest in Sagemaker. 

With this model, batch transformation job can use it to predict on other datasets, prediction service can create an endpoint using it.


> Note: remember to use pass your **role_arn** to successfully run the job.

In [None]:
# Configure your s3 bucket.
S3_BUCKET='<your_s3_bucket>'
S3_PIPELINE_PATH='s3://{}/mnist_kmeans_example'.format(S3_BUCKET)

# Configure your Sagemaker execution role.
SAGEMAKER_ROLE_ARN='<your_sagemaker_role>'


@dsl.pipeline(
    name='MNIST Classification pipeline',
    description='MNIST Classification using KMEANS in SageMaker'
)
def mnist_classification(region='us-west-2',
    image='174872318107.dkr.ecr.us-west-2.amazonaws.com/kmeans:1',
    dataset_path=S3_PIPELINE_PATH + '/data',
    instance_type='ml.c4.8xlarge',
    instance_count='2',
    volume_size='50',
    model_output_path=S3_PIPELINE_PATH + '/model',
    batch_transform_input=S3_PIPELINE_PATH + '/input',
    batch_transform_ouput=S3_PIPELINE_PATH + '/output',
    role_arn=SAGEMAKER_ROLE_ARN
    ):

    training = sagemaker_train_op(
        region=region,
        image=image,
        instance_type=instance_type,
        instance_count=instance_count,
        volume_size=volume_size,
        dataset_path=dataset_path,
        model_artifact_path=model_output_path,
        role=role_arn,
    ).apply(use_aws_secret('aws-secret', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY'))

    create_model = sagemaker_model_op(
        region=region,
        image=image,
        model_artifact_url=training.outputs['model_artifact_url'],
        model_name=training.outputs['job_name'],
        role=role_arn
    ).apply(use_aws_secret('aws-secret', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY'))

    prediction = sagemaker_deploy_op(
        region=region,
        model_name=create_model.output
    ).apply(use_aws_secret('aws-secret', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY'))

    batch_transform = sagemaker_batch_transform_op(
        region=region,
        model_name=create_model.output,
        input_location=batch_transform_input,
        output_location=batch_transform_ouput
    ).apply(use_aws_secret('aws-secret', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY'))

4. Compile your pipeline

In [None]:
kfp.compiler.Compiler().compile(mnist_classification, 'mnist-classification-pipeline.zip')

5. Deploy your pipeline

In [None]:
client = kfp.Client()
aws_experiment = client.create_experiment(name='aws')
my_run = client.run_pipeline(aws_experiment.id, 'mnist-classification-pipeline', 
  'mnist-classification-pipeline.zip')

## Prediction

Open Sagemaker console and find your endpoint name. Please check dataset section to get train_set.

Once your pipeline is done, you can find sagemaker endpoint name and replace `ENDPOINT_NAME` value with your new created endpoint name. 


> Note: make sure you attach `sagemaker:InvokeEndpoint` to the nodegroup running this jupyter notebook.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:InvokeEndpoint"
            ],
            "Resource": "*"
        }
    ]
}

```


In [None]:
!pip install boto3 --user

In [None]:
import pickle, gzip, numpy, urllib.request, json
from urllib.parse import urlparse
import json
import io
import boto3

ENDPOINT_NAME='Endpoint-20190916223205-Y635'

# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

# Simple function to create a csv from our numpy array
def np2csv(arr):
    csv = io.BytesIO()
    numpy.savetxt(csv, arr, delimiter=',', fmt='%g')
    return csv.getvalue().decode().rstrip()

runtime = boto3.Session(region_name='us-west-2').client('sagemaker-runtime')

payload = np2csv(train_set[0][30:31])

response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                   ContentType='text/csv',
                                   Body=payload)
result = json.loads(response['Body'].read().decode())
print(result)

## Clean up

Go to Sagemaker console and delete `endpoint`, `model`.