# End to end experiment: Github Issue Summarization

Currently, this notebook must be run from the Kubeflow JupyterHub installation, as described in the codelab.

In this notebook, we will show how to:

* Interactively define a KubeFlow Pipeline using the Pipelines Python SDK
* Submit and run the pipeline
* Add a step in the pipeline

This example pipeline trains a [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor/) model on Github issue data, learning to predict issue titles from issue bodies. It then exports the trained model and deploys the exported model to [Tensorflow Serving](https://github.com/tensorflow/serving). 
The final step in the pipeline launches a web app which interacts with the TF-Serving instance in order to get model predictions.

## Enviroinment Setup

Before any experiment can be conducted. We need to setup and initialize an environment: ensure all Python modules has been setup and configured, as well as python modules

Setting up python modules

In [None]:
%load_ext extensions

import kfp.dsl as dsl
import kfp.gcp as gcp
import pandas as pd
from ipython_secrets import get_secret
from kfp.compiler import Compiler
from kfp import components
from os import environ
import boto3, kfp

import extensions.kaniko as kaniko
import extensions.pv as pv
import extensions.kubernetes as k8s
import extensions.kaniko.aws as aws
import extensions.seldon as seldon
import extensions.utils as utils

Initialize global namespace variables. It is a good practice to place all global namespace variables in one cell. So, the notebook could be configured all-at-once. 

To enhance readability we would advice to capitalize such variables.

In [None]:
USER = environ.get('NB_USER', 'John Doe')
rubbish = 'latest'

BUILD_CONTEXT = f"buildcontext-{rubbish}"
# DOCKER_TAG = 'latest'
DOCKER_TAG = rubbish
TRAINING_IMAGE = f"{DOCKER_REGISTRY}/training:{DOCKER_TAG}"
SERVING_IMAGE = f"{DOCKER_REGISTRY}/seldon:{DOCKER_TAG}"

MOUNT_PATH = f"/mnt/s3"
DATASET_FILE = f"{MOUNT_PATH}/training-{rubbish}/dataset.csv"
MODEL_FILE = f"{MOUNT_PATH}/training-{rubbish}/training1.h5"
TITLE_PP_FILE = f"{MOUNT_PATH}/training-{rubbish}/title_preprocessor.dpkl"
BODY_PP_FILE = f"{MOUNT_PATH}/training-{rubbish}/body_preprocessor.dpkl"
TRAIN_DF_FILE = f"{MOUNT_PATH}/training-{rubbish}/traindf.csv"
TEST_DF_FILE =  f"{MOUNT_PATH}/training-{rubbish}/testdf.csv"
TRAIN_TITLE_VECS = f"{MOUNT_PATH}/training-{rubbish}/train_title_vecs.npy"
TRAIN_BODY_VECS = f"{MOUNT_PATH}/training-{rubbish}/train_body_vecs.npy"

s3 = boto3.session.Session().client(
    service_name='s3',
    aws_access_key_id=get_secret('aws_access_key_id'),
    aws_secret_access_key=get_secret('aws_secret_access_key'),
    endpoint_url=BUCKET_ENDPOINT
)

client = kfp.Client()
try:
    exp = client.get_experiment(experiment_name=APPLICATION_NAME)
except:
    exp = client.create_experiment(APPLICATION_NAME)

### Prepare dockerfile templates

Docker images can be rendered via `%%template` or `%templatefile` magics (source code [here](extensions/magics/templates.py)). It can intelligently use mustache `{{placeholder}}` templating syntax. Content will be replaced by the user namespace defined variable or system environment variable

You can use flags with the magic function:
* `-v` - to see content of rendered file. 
* `-h` - for more options

### Build a training docker image

Define build pipeline. Yes, we arguably using KFP to build images  that will be de-facto used by final pipeline.

We use [Kaniko](https://github.com/GoogleContainerTools/kaniko) and Kubernetes to handle build operations. Build status can be tracked via KFP pipeline dashboard

In fact build image job can be even combined with primary pipeline as physically it will be different Kubernetes pods. However for sake of general purpose efficiency we schedule build process via separate pipeline step

In [None]:
kaniko_op = components.load_component_from_file('components/kaniko/deploy.yaml')

@dsl.pipeline(
  name='Pipeline images',
  description='Build images that will be used by the pipeline'
)
def build_images(
        image, 
        build_context=None, 
        dockerfile: dsl.PipelineParam=dsl.PipelineParam(name='dockerfile', value='Dockerfile')):
    kaniko_op(
        image=image,
        dockerfile=dockerfile,
        build_context=build_context
    ).apply(
        # docker registry credentials 
        kaniko.use_pull_secret_projection(secret_name=DOCKER_REGISTRY_PULL_SECRET)
    ).apply(
        # s3 bucket volume clame has been injected here        
        pv.use_pvc(name=BUCKET_PVC, mount_to=MOUNT_PATH)
    )
        
Compiler().compile(build_images, 'kaniko.tar.gz')

Compiler transforms Python DSL into an [Argo Workflow](https://argoproj.github.io/docs/argo/readme.html). And stores generated artifacts in `kaniko.tar.gz`. So it could be executed multiple times. Perhaps with different parameters

Next section will upload all files to s3, to share access with the pipeline. Files that should be ignored can be customized in [kanikoignore.txt](./kanikoignore.txt). To understand upload scenario you can review and modify: [aws.py](./extensions/kaniko/aws.py)

In [None]:
aws.upload_to_s3(
    destination=f"s3://{BUCKET_NAME}/{BUILD_CONTEXT}",
    ignorefile='kanikoignore.txt',
    workspace='.',
    s3_client=s3
)

run = client.run_pipeline(
    exp.id, 'Build docker images', 'kaniko.tar.gz', 
    params={
        'image': TRAINING_IMAGE,
        'build-context': f"{MOUNT_PATH}/{BUILD_CONTEXT}/components/training"
    })

Build process can be long a long term. Because often images that has been used for data science tasks are huge. In this case you might want to adjust `timeout` parameter

In [None]:
%%time
# block until job completion
print(f"Waiting for run: {run.id}...")
result = client.wait_for_run_completion(run.id, timeout=720).run.status
print(f"Finished with: {result}")

# Training

In this chapter we will define a pipeline that will do two important steps. It will download a data set in CSV file format (we call this operation **data import**) and 

In [None]:
from components.training import *

@dsl.pipeline(
  name='Data preparation',
  description="""Extract validate transform and load data into object storage. 
  So it could be accessible by the actual training
  """
)
def training_pipeline(
    import_from: dsl.PipelineParam, 
    dataset_file: dsl.PipelineParam,
    dataset_md5: dsl.PipelineParam,
    train_df_file: dsl.PipelineParam,
    test_df_file: dsl.PipelineParam,
    title_pp_file: dsl.PipelineParam,
    body_pp_file: dsl.PipelineParam,
    train_title_vecs: dsl.PipelineParam,
    train_body_vecs: dsl.PipelineParam,
    model_file: dsl.PipelineParam,
    sample_size: dsl.PipelineParam=dsl.PipelineParam(name='sample-size', value='200'),
    learning_rate: dsl.PipelineParam=dsl.PipelineParam(name='learning-rate', value=0.001),
):  
    download = http_download_op(
        url=import_from,
        md5sum=dataset_md5,
        download_to=dataset_file
    ).apply(
        pv.use_pvc(name=BUCKET_PVC, mount_to=MOUNT_PATH)
    )
    
    # Generates the training and test set. Only processes "sample-size" rows.
    process_data = training_op(
        driver='process_data.py',
        arguments=[
            '--input_csv', dataset_file,
            '--sample_size', sample_size,
            '--output_traindf_csv', train_df_file, 
            '--output_testdf_csv', test_df_file,
        ]
    ).apply(
        pv.use_pvc(name=BUCKET_PVC, mount_to=MOUNT_PATH)
    ).after(download)
    
    # Preprocess for deep learning
    preproc_for_ml = training_op(
        driver='preproc.py',
        arguments=[
            '--input_traindf_csv', train_df_file,
            '--output_title_preprocessor_dpkl', title_pp_file,
            '--output_body_preprocessor_dpkl', body_pp_file,
            '--output_train_title_vecs_npy', train_title_vecs,
            '--output_train_body_vecs_npy', train_body_vecs,
        ]
    ).apply(
        pv.use_pvc(name=BUCKET_PVC, mount_to=MOUNT_PATH)
    ).after(process_data)
    
    # Training
    training = training_op(
        driver='train.py',
        arguments=[
            '--input_title_preprocessor_dpkl', title_pp_file,
            '--input_body_preprocessor_dpkl', body_pp_file,
            '--input_train_title_vecs_npy', train_title_vecs,
            '--input_train_body_vecs_npy', train_body_vecs,
            '--script_name_base', '/tmp/seq2seq',
            '--output_model_h5', model_file,
            '--learning_rate', learning_rate,
           '--tempfile', "True",
        ]
    ).apply(
        pv.use_pvc(name=BUCKET_PVC, mount_to=MOUNT_PATH)
    ).after(preproc_for_ml)
        

Compiler().compile(training_pipeline, 'preproc.tar.gz')

Code below will run a pipeline and inject some pipeline parameters. Here we provide two versions of data sets
* `SAMPLE_DATA_SET` - Data set that has just over 2 megabytes. Not enough for sufficient training. However ideal for development, because of faster feedback.
* `FULL_DATA_SET` - Precreated data set with all github issues. 3 gigabytes. Good enough for sufficient model

Depending on your needs you can choose one or another data set and pass it as a pipeline parameter `data-set`

In [None]:
# github issues small: 2Mi data set (best for dev/test)
SAMPLE_DATASET = 'https://s3.us-east-2.amazonaws.com/asi-kubeflow-models/gh-issues/data-sample.csv'
SAMPLE_DATASET_MD5 = '916af946f2fe1d1779b26205d4d8378f'
# data set for 3Gi. (best for training)
FULL_DATASET = 'https://s3.us-east-2.amazonaws.com/asi-kubeflow-models/gh-issues/data-full.csv'
FULL_DATASET_MD5 = '57dc987c04d41a94d0d9daf4d0ebf8ba'

run = client.run_pipeline(exp.id, 'Prepare data', 'preproc.tar.gz',
                          params={
                              'import-from': SAMPLE_DATASET,
                              'dataset-md5': SAMPLE_DATASET_MD5,
                              'dataset-file': DATASET_FILE,
                              'title-pp-file': TITLE_PP_FILE,
                              'body-pp-file': BODY_PP_FILE,
                              'train-df-file': TRAIN_DF_FILE,
                              'test-df-file': TEST_DF_FILE,
                              'train-title-vecs': TRAIN_TITLE_VECS,
                              'train-body-vecs': TRAIN_BODY_VECS,
                              'model-file': MODEL_FILE,
                              'learning-rate': 0.001,
                              'sample-size': 100,
                          })

In [None]:
%%time
# block until job completion
print(f"Waiting for run: {run.id}...")
result = client.wait_for_run_completion(run.id, timeout=720).run.status
print(f"Finished with: {result}")

# Serving with Seldon
Prepping a container for serving. 

Here we define all variables that will be needed for our dockerfile tempalte. 

- `MODEL_WRAPPER`: is a name of a python class that adapts keras model for serving
- `MODEL_NAME`: used in seldon deployment
- `MODEL_VERSION`: one model can be served multiple times with different versions simulteniously
- `SELDON_DEPLOYMENT`: name of kubernetes resource
- `SELDON_OAUTH_KEY`: part of shared secret between `SeldonDeployment` and a client application
- `SELDON_OAUTH_SECRET`: part of shared secret between `SeldonDeployment` and a client application
- `REPLICAS`: number of replicas for `SeldonDeployment` pod

In [None]:
import re

MODEL_WRAPPER = 'IssueSummarizationModel'
MODEL_NAME = re.sub(r'\W+', '-', MODEL_WRAPPER).lower()
MODEL_VERSION = f"{rubbish}"
SELDON_DEPLOYMENT = f"{MODEL_NAME}-{MODEL_VERSION}"
# here we hash a information about model, so it would be predictable
SELDON_OAUTH_KEY = utils.sha1(MODEL_NAME, MODEL_VERSION, NAMESPACE)
# for secure secret we will use hash of user defined shared secret salted with OAUTH_KEY
SELDON_OAUTH_SECRET = utils.sha1(SELDON_OAUTH_KEY, get_secret('USER_SECRET_FOR_MODEL'))
SELDON_APISERVER_ADDR=f"seldon-seldon-apiserver.{NAMESPACE}:8080"

SELDON_DEPLOYMENT_REPLICAS = 1

## Building a  serving application

`SeldonDeployment` needs a docker image that contains a model wrapper written in (but not limited) Python

This step will build a container and serve it

In [None]:
%%template components/serving/Dockerfile
FROM seldonio/seldon-core-s2i-python3

FROM {{TRAINING_IMAGE}}
RUN pip3 install --no-cache-dir -U 'seldon-core'

COPY --from=0 /microservice /microservice
COPY src/serving.py /microservice/{{MODEL_WRAPPER}}.py
COPY src/seq2seq_utils.py /microservice
COPY src/text_utils.py /microservice

WORKDIR /microservice
ENTRYPOINT ["python","-u","microservice.py"]
CMD ["{{MODEL_WRAPPER}}", "REST"]

To be able to serve trained model we build an image with our serving microservice. To achieve this we reuse our kaniko pipeline defined above

In [None]:
aws.upload_to_s3(
    destination=f"s3://{BUCKET_NAME}/{BUILD_CONTEXT}",
    ignorefile='kanikoignore.txt',
    workspace='.',
    s3_client=s3,
)

run = client.run_pipeline(exp.id, 'Build a serving image', 'kaniko.tar.gz', 
                          params={
                              'image': SERVING_IMAGE,
                              'build-context': f"{MOUNT_PATH}/{BUILD_CONTEXT}/components/serving"
                          })

In [None]:
%%time
# block until job completion
print(f"Waiting for run: {run.id}...")
result = client.wait_for_run_completion(run.id, timeout=720).run.status
print(f"Finished with: {result}")

Then we render our `SeldonDeployment` template and deploy it with `kubectl`, similar as we have done before with `pvc` definition. Here we define reference to the model that will be used for serving

In [None]:
%templatefile templates/seldon.yaml -o seldon.yaml
!kubectl apply -f seldon.yaml --wait
!kubectl get -f seldon.yaml -o jsonpath='{.status.state}'

Test model serving by accessing seldon api server. Because Seldon API server provides an oauth, we need to receive a temporrary bearer token. We can receive this token by providing oauth key and secret that has been used in our `SeldonDeployment`

In [None]:
test_payload = {
    "data":{"ndarray": [["try to stop flask from using multiple threads"]]},
}
                         
t = seldon.get_token(
    server=SELDON_APISERVER_ADDR,
    oauth_key=SELDON_OAUTH_KEY,
    oauth_secret=SELDON_OAUTH_SECRET,
)
result = seldon.prediction(
    server=SELDON_APISERVER_ADDR,
    payload=test_payload,
    token=t,
) 
result

In [None]:
test_payload = {
    "data":{"ndarray": [["try to stop flask from using multiple threads"]]},
}
                         
t = seldon.get_token(
    server=SELDON_APISERVER_ADDR,
    oauth_key=SELDON_OAUTH_KEY,
    oauth_secret=SELDON_OAUTH_SECRET,
)
result = seldon.prediction(
    server=SELDON_APISERVER_ADDR,
    payload=test_payload,
    token=t,
) 

pd.DataFrame(data=result['data']['ndarray'], columns=['Predictions'])

# Deploy a client application

This section will be focused on application deployment routines. 

In [None]:
APPLICATION_NAME=f"webapp-github"
APPLICATION_DOCKER_IMAGE = f"{DOCKER_REGISTRY}/library/app:{rubbish}"
APPLICATION_REPLICAS = 1
SAMPLE_DATA='/data/sample.csv'
GITHUB_TOKEN=get_secret('GITHUB_TOKEN')

User application has been implemented inside [src/app.py](src/app.py). We bake this applicaiton inside of docker container and deploy it further

In [None]:
%%template components/flaskapp/Dockerfile
FROM {{TRAINING_IMAGE}}
RUN pip3 install --no-cache-dir -U 'flask>=0.12.3'
ADD {{SAMPLE_DATA_SET}} /data/sample.csv
WORKDIR /app
COPY src/app.py /app
COPY src/templates /app/templates
COPY src/text_utils.py /app
ENTRYPOINT ["python3", "-u", "app.py"]

In [None]:
aws.upload_to_s3(
    destination=f"s3://{BUCKET_NAME}/{BUILD_CONTEXT}",
    ignorefile='kanikoignore.txt',
    workspace='.',
    s3_client=s3,
)

run = client.run_pipeline(exp.id, 'Build a serving image', 'kaniko.tar.gz', 
                          params={
                              'image': APPLICATION_DOCKER_IMAGE,
                              'build-context': f"{MOUNT_PATH}/{BUILD_CONTEXT}/components/flaskapp"
                          })

In [None]:
%%time
# block until job completion
print(f"Waiting for run: {run.id}...")
result = client.wait_for_run_completion(run.id, timeout=720).run.status
print(f"Finished with: {result}")

In [None]:
%templatefile templates/application.yaml -o application.yaml
!kubectl apply -f application.yaml --wait

# Tear down

Uppon completion, let's tear everything down

In [None]:
#!kubectl delete -f bucket-volume.yaml
!kubectl delete -f seldon.yaml
!kubectl delete -f application.yaml