# End to end experiment: Github Issue Summarization

Currently, this notebook must be run from the Kubeflow JupyterHub installation, as described in the codelab.

In this notebook, we will show how to:

* Interactively define a KubeFlow Pipeline using the Pipelines Python SDK
* Submit and run the pipeline
* Add a step in the pipeline

This example pipeline trains a [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor/) model on Github issue data, learning to predict issue titles from issue bodies. It then exports the trained model and deploys the exported model to [Tensorflow Serving](https://github.com/tensorflow/serving). 
The final step in the pipeline launches a web app which interacts with the TF-Serving instance in order to get model predictions.

## Enviroinment Setup

Before any experiment can be conducted. We need to setup and initialize an environment: ensure all Python modules has been setup and configured, as well as python modules

Setting up python modules

In [None]:
!pip3 install --upgrade 'pip' > /dev/null
!pip3 install --upgrade 'https://storage.googleapis.com/ml-pipeline/release/0.1.16/kfp.tar.gz' > /dev/null
!pip3 install --upgrade './extensions' > /dev/null
%load_ext extensions

import sys, boto3, re
sys.path.insert(0, 'src')

import kfp
import kfp.dsl as dsl
import kfp.gcp as gcp
import kfp.notebook

from ipython_secrets import get_secret
from kfp.compiler import Compiler

import extensions
import extensions.kaniko as kaniko
import extensions.pv as pv
import extensions.kubernetes as k8s
import extensions.kaniko.aws as aws

from os import environ

client = kfp.Client()

Initialize global namespace variables. It is a good practice to place all global namespace variables in one cell. So, the notebook could be configured all-at-once. 

To enhance readability we would advice to capitalize such variables.

In [None]:
USER = environ['JUPYTERHUB_USER']
EXPERIMENT_NAME = f'Github Issues {USER}'
DOCKER_REGISTRY = 'kfp-pack-harbor.svc.cluster-red.antoncloud.superhub.io'
DOCKER_REGISTRY_SECRET = 'kfp-pack-harbor-pull-secret'
# DOCKER_TAG = 'latest'
DOCKER_TAG = 'latest'
TRAINING_IMAGE = f"{DOCKER_REGISTRY}/library/training:{DOCKER_TAG}"

S3_ENDPOINT = 'https://kfp-pack-minio.app.cluster-red.antoncloud.superhub.io'

s3 = boto3.session.Session().client(
    service_name='s3',
    aws_access_key_id=get_secret('ACCESS_KEY'),
    aws_secret_access_key=get_secret('SECRET_KEY'),
    endpoint_url=S3_ENDPOINT
)

try:
    exp = client.get_experiment(experiment_name=EXPERIMENT_NAME)
except:
    exp = client.create_experiment(EXPERIMENT_NAME)

## Create bucket


Here we will generate a new bucket and create a `pvc` that will represent it as a file system inside of the wofkflow pod. To do this we need to define few variables
- `S3_BUCKET` - name of the bucket to create. By defaults we mutate name of the user
- `NAMESPACE` - points to the current namespace.
- `PVC_NAME` - derived form `S3_BUCKET`. This is a kubernetes `pvc` name. This name will be used by pipeline container (`ContainerOp` objects)

In [None]:
NAMESPACE = k8s.current_namespace()
# we want to create a unique bucket for each user
# we use regexp and replace non alphanumeric symbols with dash
# this this will make username as the qualified bucket name
S3_BUCKET = re.sub(r'\W+', '-', USER).lower() + "-new"
PVC_NAME = f"{S3_BUCKET}-bucket"
MOUNT_DIR = f"/mnt/s3"

try:
    s3.create_bucket(Bucket=S3_BUCKET, ACL='private')
except:
    pass

%templatefile templates/bucket-pvc.yaml.template -o bucket-volume.yaml
!kubectl apply -f bucket-volume.yaml

### Prepare dockerfile templates

Docker images can be rendered via `%%template` or `%templatefile` magics (source code [here](extensions/magics/templates.py)). It can intelligently use mustache `{{placeholder}}` templating syntax. Content will be replaced by the user namespace defined variable or system environment variable

You can use flags with the magic function:
* `-v` - to see content of rendered file. 
* `-h` - for more options

In [None]:
%%template Dockerfile.keras
ARG buildfrom=python:3.6
FROM ${buildfrom}
ENV PATH "/src:${PATH}"

WORKDIR /tmp
RUN apt-get update && apt-get install -y --no-install-recommends \
    python-pandas git \
    && pip3 install -U scikit-learn \
    && pip3 install -U ktext \
    && pip3 install -U IPython \
    && pip3 install -U annoy \
    && pip3 install -U tqdm \
    && pip3 install -U nltk \
    && pip3 install -U matplotlib \
    && pip3 install -U bernoulli \
    && pip3 install -U h5py \
    && git clone https://github.com/google/seq2seq.git \
    && pip3 install -e ./seq2seq/ \
    && apt-get clean \
    && rm -rf \
    /var/lib/apt/lists/* \
    /tmp/* \
    /var/tmp/* \
    /usr/share/man \
    /usr/share/doc \
    /usr/share/doc-base

COPY src /src
WORKDIR /src

ENTRYPOINT /usr/local/bin/python

### Define pipeline to build images

Define build pipeline. Yes, we arguably using KFP to build images  that will be de-facto used by final pipeline.

We use [Kaniko](https://github.com/GoogleContainerTools/kaniko) and Kubernetes to handle build operations. Build status can be tracked via KFP pipeline dashboard

In fact build image job can be even combined with primary pipeline as physically it will be different Kubernetes pods. However for sake of general purpose efficiency we schedule build process via separate pipeline step

In [None]:
@dsl.pipeline(
  name='Pipeline images',
  description='Build images that will be used by the pipeline'
)
def build_images(training_image, build_context, build_from):
    dsl.ContainerOp(
        name='training-image',
        image='gcr.io/kaniko-project/executor:latest',
        arguments=['--cache',
                   '--destination', training_image,
                   '--dockerfile', 'Dockerfile.keras',
                   '--context', build_context,
                   '--build-arg', f"buildfrom={build_from}"]
    ).apply(
        # docker registry credentials 
        kaniko.use_pull_secret_projection(secret_name=DOCKER_REGISTRY_SECRET)
    ).apply(
        # s3 bucket volume clame has been injected here        
        extensions.pv.use_pvc(name=PVC_NAME, mount_to=MOUNT_DIR)
    )
        
Compiler().compile(build_images, 'kaniko.tar.gz')

Compiler transforms Python DSL into an [Argo Workflow](https://argoproj.github.io/docs/argo/readme.html). And stores generated artifacts in `kaniko.tar.gz`. So it could be executed multiple times. Perhaps with different parameters

Next section will upload all files to s3, to share access with the pipeline. Files that should be ignored can be customized in [kanikoignore.txt](./kanikoignore.txt). To understand upload scenario you can review and modify: [aws.py](./extensions/kaniko/aws.py)

In [None]:
context_dir = 'buildcontext'

aws.upload_to_s3(
    destination=f"s3://{S3_BUCKET}/{context_dir}",
    ignorefile='kanikoignore.txt',
    workspace='.',
    s3_client=s3
)

run = client.run_pipeline(exp.id, 'Build docker images', 'kaniko.tar.gz', 
                          params={
                              'training-image': TRAINING_IMAGE,
                              'build-context': f"{MOUNT_DIR}/{context_dir}",
                              'build-from': 'python:3.7'})

Build process can be long a long term. Because often images that has been used for data science tasks are huge. In this case you might want to adjust `timeout` parameter

In [None]:
# block till completion
client.wait_for_run_completion(run.id, timeout=720).run.status

# Data Preparation

In this chapter we will define a pipeline that will do two important steps. It will download a data set in CSV file format (we call this operation **data import**) and 

In [None]:
IMPORT_DATASET = True

def training_op(name, arguments=[]):
    """ A template function to encapsulate similar container ops
    """
    return dsl.ContainerOp(
        name=name,
        image=TRAINING_IMAGE,
        command=['/usr/local/bin/python'],
        arguments=arguments
    ).apply(
        extensions.pv.use_pvc(name=PVC_NAME, mount_to=MOUNT_DIR)
    )
    

@dsl.pipeline(
  name='Data preparation',
  description="""Extract validate transform and load data into object storage. 
  So it could be accessible by the actual training
  """
)
def prepare_data(
    import_from: dsl.PipelineParam, 
    dataset_file: dsl.PipelineParam, 
    workdir: dsl.PipelineParam,
    model_file: dsl.PipelineParam,
    sample_size: dsl.PipelineParam=dsl.PipelineParam(name='sample-size', value='200'),
    learning_rate: dsl.PipelineParam=dsl.PipelineParam(name='learning-rate', value=0.001),
):  
    
    # Generates the training and test set. Only processes "sample-size" rows.
    process_data = training_op(
        name='process-data',
        arguments=[
            'process_data.py', 
            '--input_csv', f"{workdir}/{dataset_file}",
            '--sample_size', sample_size,
            '--output_traindf_csv', f"{workdir}/traindf.csv", 
            '--output_testdf_csv', f"{workdir}/testdf.csv"
        ]
    )
    
    # Preprocess for deep learning
    preproc_for_ml = training_op(
        name = 'preproc-for-ml',
        arguments=[
            'preproc.py',
            '--input_traindf_csv', f"{workdir}/traindf.csv",
            '--output_body_preprocessor_dpkl', f"{workdir}/body_preprocessor.dpkl",
            '--output_title_preprocessor_dpkl', f"{workdir}/title_preprocessor.dpkl",
            '--output_train_title_vecs_npy', f"{workdir}/train_title_vecs.npy",
            '--output_train_body_vecs_npy', f"{workdir}/train_body_vecs.npy",
        ]
    )
    preproc_for_ml.after(process_data)
    
    # Training
    training = training_op(
        name = 'training',
        arguments=[
            'train.py',
            '--input_body_preprocessor_dpkl', f"{workdir}/body_preprocessor.dpkl",
            '--input_title_preprocessor_dpkl', f"{workdir}/title_preprocessor.dpkl",
            '--input_train_title_vecs_npy', f"{workdir}/train_title_vecs.npy",
            '--input_train_body_vecs_npy', f"{workdir}/train_body_vecs.npy",
            '--script_name_base', f"/tmp/seq2seq",
            '--output_model_h5', f"{workdir}/{model_file}",
            '--learning_rate', learning_rate,
           '--tempfile', "True",
        ]
    )
    training.after(preproc_for_ml)
    
    # we put optional component def  on the bottom
    # so we could properly orient our DAG
    if IMPORT_DATASET:
        import_data = dsl.ContainerOp(
            name='import-data',
            image='appropriate/curl',
            arguments=['-#Lv', '--create-dirs', '-o', f"{workdir}/{dataset_file}", import_from]
        ).apply(
            pv.use_pvc(name=PVC_NAME, mount_to=MOUNT_DIR)
        )        
        process_data.after(import_data)

Compiler().compile(prepare_data, 'preproc.tar.gz')

Code below will run a pipeline and inject some pipeline parameters. Here we provide two versions of data sets
* `SAMPLE_DATA_SET` - Data set that has just over 2 megabytes. Not enough for sufficient training. However ideal for development, because of faster feedback.
* `FULL_DATA_SET` - Precreated data set with all github issues. 3 gigabytes. Good enough for sufficient model

Depending on your needs you can choose one or another data set and pass it as a pipeline parameter `data-set`

In [None]:
# github issues small: 2Mi data set (best for dev/test)
SAMPLE_DATA_SET = 'https://s3.us-east-2.amazonaws.com/asi-kubeflow-models/gh-issues/data-sample.csv'
# data set for 3Gi. (best for training)
FULL_DATA_SET = 'https://s3.us-east-2.amazonaws.com/asi-kubeflow-models/gh-issues/data-full.csv'

run = client.run_pipeline(exp.id, 'Prepare data', 'preproc.tar.gz',
                          params={
                              'import-from': SAMPLE_DATA_SET,
                              'workdir': f"{MOUNT_DIR}/data",
                              'dataset-file': 'dataset.csv',
                              'model-file': f"training1.h5",
                              'learning-rate': 0.001,
                          })

In [None]:
# block till completion
client.wait_for_run_completion(run.id, timeout=720).run.status

# Serving with Seldon
Prepping a container for serving.

In [None]:
%%template Dockerfile.seldon
FROM seldonio/seldon-core-s2i-python3

COPY src/serving.py /microservice/IssueSummarization.py
COPY src/seq2seq.py /microservice/seq2seq.py
COPY src/requirements.txt /requirements.txt

RUN pip3 install --upgrade -r /requirements.txt

ENTRYPOINT ["python","-u","microservice.py","serving","REST","--service-type","MODEL","--persistence","0"]

# Tear down

Uppon completion, let's tear everything down

In [None]:
!kubectl delete -f bucket-volume.yaml