# End to end experiment: Github Issue Summarization

Currently, this notebook must be run from the Kubeflow JupyterHub installation, as described in the codelab.

In this notebook, we will show how to:

* Interactively define a KubeFlow Pipeline using the Pipelines Python SDK
* Submit and run the pipeline
* Add a step in the pipeline

This example pipeline trains a [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor/) model on Github issue data, learning to predict issue titles from issue bodies. It then exports the trained model and deploys the exported model to [Tensorflow Serving](https://github.com/tensorflow/serving). 
The final step in the pipeline launches a web app which interacts with the TF-Serving instance in order to get model predictions.

## Enviroinment Setup

Before any experiment can be conducted. We need to setup and initialize an environment: ensure all Python modules has been setup and configured, as well as python modules

Setting up python modules

In [1]:
!pip3 install --upgrade 'https://storage.googleapis.com/ml-pipeline/release/0.1.10/kfp.tar.gz' > /dev/null
!pip3 install --upgrade './extensions' > /dev/null
%load_ext extensions

import sys
sys.path.insert(0, 'src')

import kfp
import kfp.dsl as dsl
import kfp.gcp as gcp
import kfp.notebook

from ipython_secrets import get_secret
from kfp.compiler import Compiler

import extensions
import extensions.kaniko as kaniko
import extensions.pv as pv
import extensions.kaniko.aws as aws

from os import environ

client = kfp.Client()

[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Initialize global namespace variables. It is a good practice to place all global namespace variables in one cell. So, the notebook could be configured all-at-once. 

To enhance readability we would advice to capitalize such variables.

In [None]:
USER = environ['JUPYTERHUB_USER']
EXPERIMENT_NAME = f'Github issues {USER}'
DOCKER_REGISTRY = get_secret('DOCKER_REGISTRY')
DOCKER_REGISTRY_SECRET = get_secret('DOCKER_REGISTRY_SECRET')
AWS_SECRET = 'jupyter-awscreds'
DOCKER_TAG = 'latest'

AWS_S3_BUCKET = 'files.dev4.demo10.superhub.io'

DATA_FILE = '/home/jovyan/data/data-sample.csv'
try:
    exp = client.get_experiment(experiment_name=EXPERIMENT_NAME)
except:
    exp = client.create_experiment(EXPERIMENT_NAME)


### Prepare dockerfile templates

Docker images can be rendered via `%%template` or `%templatefile` magics. It can intelligently use mustache `{{placeholder}}` templating syntax. Content will be replaced by the user namespace defined variable or system environment variable

You can use flags with the magic function:
* `-v` - to see content of rendered file. 
* `-h` - for more options



In [None]:
%%template Dockerfile.keras
FROM tensorflow/tensorflow:latest-py3
COPY src /app
WORKDIR /app
RUN pip3 install --upgrade --no-cache-dir -r requirements.txt
ENTRYPOINT ['python3']

### Define build pipeline

Define build pipeline. Yes, we arguably using KFP to build images  that will be de-facto used by final pipeline.

We use [Kaniko](https://github.com/GoogleContainerTools/kaniko) and Kubernetes to handle build operations. Build status can be tracked via KFP pipeline dashboard

In fact build image job can be even combined with primary pipeline as physically it will be different Kubernetes pods. However for sake of general purpose efficiency we schedule build process via separate pipeline step

In [None]:
build_ctx=f"s3://{AWS_S3_BUCKET}/{USER}/{EXPERIMENT_NAME}/dockerbuild.tar.gz"
kaniko.upload_build_context_to_s3(build_ctx)

def kaniko_op(name, destination, dockerfile,
              context=build_ctx, aws_secret=AWS_SECRET, 
              pull_secret=DOCKER_REGISTRY_SECRET):
    """ template function for kaniko build operation
    """
    return dsl.ContainerOp(
        name=name,
        image='gcr.io/kaniko-project/executor:latest',
        arguments=['--destination', destination,
                   '--dockerfile', dockerfile,
                   '--context', context]
    ).apply(
        aws.use_aws_region_envvar()
    ).apply(
        kaniko.use_pull_secret_projection(pull_secret)
    )

@dsl.pipeline(
  name='Pipeline images',
  description='Build images that will be used by the pipeline'
)
def build_images():
    kaniko_op(
        name='keras',
        destination=f"{DOCKER_REGISTRY}/library/keras:{DOCKER_TAG}",
        dockerfile='Dockerfile.keras'
    )
    
Compiler().compile(build_images, 'kaniko.tar.gz')

By default pipeline steps (`ContainerOp`) are running in parallel. However if you need a DAG, then you can link these teps with function `after()`.

Compiler transforms Python DSL into an [Argo Workflow](https://argoproj.github.io/docs/argo/readme.html). And stores generated artifacts in `kaniko.tar.gz`. So it could be executed multiple times. Perhaps with different parameters

In [None]:
run = client.run_pipeline(exp.id, 'Build images', 'kaniko.tar.gz')

Build process can be long a long term. Because often images that has been used for data science tasks are huge. In this case you might want to adjust `timeout` parameter

In [None]:
# block till completion
client.wait_for_run_completion(run.id, timeout=720).run.status

# Data Preparation

In this chapter we will define a pipeline that will do two important steps. It will download a data set in CSV file format (we call this operation **data import**) and 

In [None]:
@dsl.pipeline(
  name='Data preparation',
  description="""Extract validate transform and load data into object storage. 
  So it could be accessible by the actual training
  """
)
def prepare_data(
    data_set: dsl.PipelineParam,
    sample_size: dsl.PipelineParam=dsl.PipelineParam(name='sample-size', value='200'),
    import_dataset: dsl.PipelineParam=dsl.PipelineParam(name='import-dataset', value=True),
):
    
    with dsl.Condition('{{inputs.parameters.import-dataset}} == true'):
        import_data = dsl.ContainerOp(
            name='import-data',
            image=f"appropriate/curl",
            arguments=['-#LC', '-o', '/mnt/bucket/data-set.csv', data_set]
        ).apply(
            pv.use_pvc(name='nfs-bucket', mount_to='/mnt/bucket')
        )
        
#     preproc_data = dsl.ContainerOp(
#         name='preproc-data',
#         image=f"{DOCKER_REGISTRY}/library/keras:{DOCKER_TAG}",
#         arguments=['preproc_data.py', 
#                    '--input_csv', '/mnt/bucket/data-set.csv',
#                    '--sample_size', sample_size,
#                    '--output_traindf_csv', '/mnt/bucket/github_issues_medium_train.csv', 
#                    '--output_body_preprocessor_dpkl', '/mnt/bucket/github_issues_medium_test.csv',
#                   ]
#     ).apply(
#         extensions.pv.use_pvc(name='nfs-bucket', mount_to='/mnt/bucket')
#     )
    
#     preproc_data.after(import_data)
    
#     dataprep = dsl.ContainerOp(
#         name='preproc',
#         image=f"{DOCKER_REGISTRY}/library/keras:{DOCKER_TAG}",
#         arguments=['preproc.py', 
#                    '--input_traindf_csv', csv_file,
#                    '--output_body_preprocessor_dpkl', body_dpkl,
                   
#     ).apply(
#         use_aws_region_envvar(S3_REGION)
#     ).apply(
#         use_aws_envvars_from_secret(AWS_SECRET)
#     )

Compiler().compile(prepare_data, 'preproc.tar.gz')

Code below will run a pipeline and inject some pipeline parameters. Here we provide two versions of data sets
* `SAMPLE_DATA_SET` - Data set that has just over 2 megabytes. Not enough for sufficient training. However ideal for development, because of faster feedback.
* `FULL_DATA_SET` - Precreated data set with all github issues. 3 gigabytes. Good enough for sufficient model

Depending on your needs you can choose one or another data set and pass it as a pipeline parameter `data-set`

In [None]:
# github issues small: 2Mi data set (best for dev/test)
SAMPLE_DATA_SET = 'https://s3.us-east-2.amazonaws.com/asi-kubeflow-models/gh-issues/data-sample.csv'
# data set for 3Gi. (best for training)
FULL_DATA_SET = 'https://s3.us-east-2.amazonaws.com/asi-kubeflow-models/gh-issues/data-full.csv'

run = client.run_pipeline(exp.id, 'Prepare data', 'preproc.tar.gz',
                          params={'data-set': SAMPLE_DATA_SET})
