# From Notebook to Kubeflow Pipeline using Flower Classification

In this notebook, we will walk you through the steps of converting a machine learning model, which you may already have on a jupyter notebook, into a Kubeflow pipeline. As an example, we will make use of flower classification use case.

In this example we use:

* **Kubeflow pipelines** - [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/) is a machine learning workflow platform that is helping data scientists and ML engineers tackle experimentation and productionization of ML workloads. It allows users to easily orchestrate scalable workloads using an SDK right from the comfort of a Jupyter Notebook.

**Note:** This notebook is to be run on a notebook server inside the Kubeflow environment. 

## Kubeflow pipeline building
we will make use of the containerized approach provided by Kubeflow to allow our model to be run using Kubernetes.

### 1. Install Kubeflow pipelines SDK

 The first step is to install the Kubeflow Pipelines SDK package.

In [1]:
# !pip install --user --upgrade kfp

After the installation, we need to restart kernel for changes to take effect:

Check if the install was successful:

In [2]:
# !which dsl-compile

You should see /usr/local/bin/dsl-compile above.

### 2. Build Container Components

The following cells define functions that will be transformed into lightweight container components. It is recommended to look at the corresponding Flower Classification notebook to match what you see here to the original code.

In [3]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

<table>
  <tr><td>
    <img src="https://www.kubeflow.org/docs/images/pipelines-sdk-lightweight.svg"
         alt="Fashion MNIST sprite"  width="600">
  </td></tr>
  <tr><td align="center">
  </td></tr>
</table>

Components are self-contained pieces of code: Python functions.

The function must be completely self-contained. No code (incl. imports) can be defined outside of the body itself. All imports must be included in the function body itself! Imported packages must be available in the base image.

Why? Because each component will be packaged as a Docker image. The base image must therefore contain all dependencies. Any dependencies you install manually in the notebook are invisible to the Python function once it is inside the image. The function itself becomes the entrypoint of the image, which is why all auxiliary functions must be defined inside the function. That does cause some unfortunate duplication, but it also means you do not have to worry about the mechanism of packaging.

For this pipeline, we can define three components:

- Download the Flower data set
- Train the TensorFlow model
- Evaluate the trained model

##### Import Kubeflow SDK

In [4]:
from typing import NamedTuple

import kfp
from kfp import dsl, components
from kfp.components import InputBinaryFile, OutputBinaryFile, func_to_container_op, InputPath, OutputPath
import time
from functools import partial
from kfserving import utils



Define a fucntion to converts a Python function to a component and returns a task using `kfp.components.func_to_container_op()`

In [5]:
func_to_container_op = partial(
    components.func_to_container_op,
    base_image='zdou001/only_tests:flower-nightly',
)

##### Component 1: Create standalone python function - load_task()

In [6]:
@func_to_container_op
def load_task(
    dataset_url: str,
    data_dir: OutputPath(str)
):
    """Download flower data"""
    import os
    from pathlib import Path
    import urllib.request
    import tarfile

    """Download the data from dataset_url"""
    ###### your code start ########


    print(f'data saved to {data_dir}/flower_photos')

##### Component 2: Create standalone python function - train_task()
For both the training and evaluation, divide the integer-valued pixel values by 255 to scale all values into the [0, 1] (floating-point) range. This function must be copied into both component functions (cf. normalize_image).

In [7]:
@func_to_container_op
def train_task(
    data_dir: InputPath(str),
    batch_size: int,
    epochs: int,
    model_dir: OutputPath(str)):

    from pathlib import Path
    import numpy as np
    import os
    import PIL
    import PIL.Image
    import tensorflow as tf
    import tensorflow_datasets as tfds

    """Load flower data to split to train_ds and val_ds using a Keras Utility"""
    ###### your code start ########
    
    
    

    """Standardize the data"""
    ###### your code start ########
    
    
    
    
    """Configure the dataset for performance"""
    ###### your code start ########
    
    
    
    
    

    """Define the model"""
    ###### your code start ########
    
    
    
    
    
    
    
    
    
    

    Path(model_dir).mkdir(parents=True, exist_ok=True)
    model.save(model_dir)
    print(f'Model exported to: {model_dir}')
    print(os.listdir(model_dir))

##### Component 3: Create standalone python function - evaluate_task()
Evaluate the model with the following Python function. The metrics metadata (loss and accuracy) is available to the Kubeflow Pipelines UI. All metadata can automatically be visualized with output viewer(s).

In [8]:
@func_to_container_op
def evaluate_task(
    data_dir: InputPath(str),
    model_dir: InputPath(str),
    batch_size: int,
    metrics_path: OutputPath(str)
) -> NamedTuple("EvaluationOutput", [("mlpipeline_metrics", "Metrics")]
    ):
    """Loads a saved model from file and uses a pre-downloaded dataset for evaluation.
    Model metrics are persisted to `/mlpipeline-metrics.json` for Kubeflow Pipelines
    metadata."""
    import tensorflow as tf
    import tensorflow_hub as hub
    import json
    import os
    from collections import namedtuple

    """Load test flower dataset using a Keras Utility"""
    ###### your code start ########
    
    
    
    

    """Configure the dataset for performance"""
    ###### your code start ########
      


    """Load model and get evaluation metrics and save"""
    ###### your code start ########
    
    
    
    
    

    return out_tuple(json.dumps(metrics_dict))

### 3. Combine the Components into a Pipeline

Note that up to this point you have not yet used the Kubeflow Pipelines SDK!

With the four components (i.e. self-contained functions) defined, wire up the dependencies with Kubeflow Pipelines.

The call components.func_to_container_op(f, base_image=img)(*args) has the following ingredients:

- `f` is the Python function that defines a component
- `img` is the base (Docker) image used to package the function
- `*arg`s lists the arguments to f

What the `*args` mean is best explained by going forward through the graph:

- `downloadOp` is the first step and has no dependencies; it therefore has no `InputPath`. Its output (i.e., `OutputPath`) is stored in `data_dir`
- `trainOp` needs the data downloaded from `downloadOp` and its signature lists `data_dir` (input) and `model_dir` (output). It depends on `downloadOp.output` (i.e., the previous step’s output) and stores its own outputs in `model_dir`, which can be used by another step. `downloadOp` is the parent of `trainOp`, as required.
- `evaluateOp`'s function takes three arguments: `data_dir` (i.e., `downloadOp.output`), `model_dir` (i.e., `trainOp.output`), and `metrics_path`, which is where the function stores its evaluation metrics. That way, `evaluateOp` can only run after the successful completion of both `downloadOp` and `trainOp`.

##### Build Kubeflow Pipeline

Our next step will be to create the various components that will make up the pipeline. Define the pipeline using the *@dsl.pipeline* decorator.

The pipeline function is defined and includes a number of paramters that will be fed into our various components throughout execution. Kubeflow Pipelines are created decalaratively. This means that the code is not run until the pipeline is compiled. 

Define the pipeline and define parameters to be fed into pipeline

In [9]:
@dsl.pipeline(
    ##### fill in the pipeline params ##########
    name=' ',
    description=' ',
)
def flower_classifier_pipeline(
    dataset_url=' ',
    batch_size= ,
    epochs= ,
    namespace=utils.get_default_target_namespace(),        
):
    """ Orchestrate all the componnet"""
    ####### fill in the corresponding params in the func_to_container_op #########
    downloadOp = load_task(#params)

    trainOp = train_task(#params)
    trainOp.after(downloadOp)
    # trainOp.container.set_gpu_limit(1)

    evaluateOp = evaluate_task(#params)
    # evaluateOp.after(trainOp)
    # evaluateOp.container.set_gpu_limit(1)


##### Run pipeline

Finally we feed our pipeline definition into the compiler and run it as an experiment. This will give us 2 links at the bottom that we can follow to the [Kubeflow Pipelines UI](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/) where you can check logs, artifacts, inputs/outputs, and visually see the progress of your pipeline.

Create a client to enable communication with the Pipelines API server.

In [10]:
client = kfp.Client()

[E 220920 04:04:50 _satvolumecredentials:51] Failed to read a token from file '/var/run/secrets/kubeflow/pipelines/token' ([Errno 2] No such file or directory: '/var/run/secrets/kubeflow/pipelines/token').
[W 220920 04:04:50 _client:372] Failed to set up default credentials. Proceeding without credentials...


Compile and Run the pipeline

In [11]:
kfp.compiler.Compiler().compile(flower_classifier_pipeline, 'tf_flower_classifier_pipeline.yaml')
pipeline_func=flower_classifier_pipeline
experiment_name = 'flower_classifier_pipeline'
run_name = pipeline_func.__name__ + ' run'
run_result = client.create_run_from_pipeline_func(pipeline_func, 
                                              experiment_name=experiment_name, 
                                              run_name=run_name + '-' + time.strftime("%Y%m%d-%H%M%S"), 
                                              arguments={})

