# Katip: Hyperparameter Tuning on Kubernetes

Currently, Katip support following optimization algorithms:

* Random
* Grid
* Hyperband
* Bayesian optimization


### Table of Contents
0. Prerequisite
1. Preparation
2. StudyJob
3. Results
4. Cleanup

## 0. Prerequisite
* Docker (if you create your own docker image) - To install, see [docker site](https://docs.docker.com/install/).
* Kubernetes, kubectl, and Kubeflow - See [setup](./setup.ipynb).

Check Kubernetes cluster by using `kubectl`:

In [None]:
!kubectl get nodes

## 1. Preparation

#### Training script
First, prepare a training python script we will use for hyperparameter tuning:

[tf_mnist.py](./src/tf_mnist.py)

#### Docker image
Prepare a docker image we will use for training MNIST model. 

To build a new docker image,
* Prepare Dockerfile containing:
    ```
    FROM tensorflow/tensorflow:1.12.0-gpu-py3
    ENV PYTHONPATH /app
    COPY ./src /app/src/
    ```
* Build a docker image:
    ```
    sudo docker build -t <DOCKER-USERNAME>/mlads2019-tf-mnist:gpu -f <DOCKER-FILENAME> .
    sudo docker push <DOCKER-USERNAME>/mlads2019-tf-mnist:gpu
    ```

#### Worker template
Our StudyJobs will create workers by using a worker template. Create ConfigMap object with `gpuWorkerConfigMap.yaml` which contains `gpuWorkerTemplate.yaml`.

`gpuWorkerTemplate` looks like:
```
image: <DOCKER-USERNAME>/mlads2019-tf-mnist:gpu
command:
    - "python"
    - "/app/kube_mnist.py"
    {{- with .HyperParameters}}
    ...
    resources:
      limits:
        nvidia.com/gpu: 1
```

To deploy the template, run:

`kubectl apply -f ./kubeflow/workerConfigMap.yaml`

To delete existing template, run:

`kubectl delete configmap worker-template`

## 2. StudyJob

First, set study job name

In [None]:
STUDYNAME = None  # Set unique name here

We use StudyJob yaml files to create hyperparameter tuning job.
We implemented helper functions to generate StudyJob yaml files as well as query results.

In [None]:
%load_ext autoreload
%autoreload 2

from src.kubeflow.utils import (
    generate_hyperparameter_tuning_yaml,
    generate_model_testing_yaml,
    get_study_metrics,
    get_study_result,
    get_best_model_id,
)

### 2.1 Random sampling hyperparameter search

In [None]:
# Here, we run 2 trials at a time
RANDOM_STUDYNAME, RANDOM_STUDYJOB = generate_hyperparameter_tuning_yaml(STUDYNAME, 'random', 2)

In [None]:
# Delete existing StudyJob
!kubectl delete studyjob {RANDOM_STUDYNAME}

# Create StudyJob
!kubectl create -f {RANDOM_STUDYJOB}

Check our StudyJob

In [None]:
!kubectl describe studyjob {RANDOM_STUDYNAME}

To see list of StudyJobs, run:

`!kubectl get studyjob`

To check the status of each tfjob and pod in the StudyJob, run:

`!kubectl describe tfjob <tfjob-id>`

`!kubectl logs <pod-id>`

### 2.2 Bayesian sampling hyperparameter search

In [None]:
# Here, we run 2 trials at a time
BAYESIAN_STUDYNAME, BAYESIAN_STUDYJOB = generate_hyperparameter_tuning_yaml(STUDYNAME, 'bayesian', 2)

In [None]:
# To delete existing studyjob, run:
!kubectl delete studyjob {BAYESIAN_STUDYNAME}

# Create StudyJob
!kubectl create -f {BAYESIAN_STUDYJOB}

In [None]:
!kubectl describe studyjob {BAYESIAN_STUDYNAME}

## 3. Results

If you are using your local machine for `kubectl`, you can port-forward and browse Katib Dashboard by running

`kubectl port-forward svc/katib-ui 8080:80` and open `localhost:8080`

Studyjob view | Trial view
---|---
<img src="media/katib_01.jpg"/> | <img src="media/katib_02.jpg"/>

In [None]:
study_result = get_study_result(
    RANDOM_STUDYNAME,
    result_dir="results",
    verbose=False,
)

Katib stores the results in `vizier-db`. You can access it by using REST API via `6790` port:
```
kubectl port-forward svc/vizier-core-rest 6790:80
```

Here, we use our helper functions instead.

In [None]:
# Get the best model id
model_id = get_best_model_id(study_result)
model_id

We use TFJOB to test our model. The job will load the saved model as well as test dataset and predict the samples.

In [None]:
TEST_NAME, TEST_TFJOB = generate_model_testing_yaml(
    RANDOM_STUDYNAME,
    study_id=study_result['Status']['Studyid'],
    model_id=model_id
)

In [None]:
# To delete existing tfjob, run:
!kubectl delete tfjob {TEST_NAME}

# Create model testing TFJob
!kubectl create -f {TEST_TFJOB}

To check our TFJob,

In [None]:
!kubectl describe tfjob {TEST_NAME}

In [None]:
!kubectl logs <Your-pod-name>

To see the list of TFJobs, run:

`!kubectl get tfjob`

------
## Further readings

* [Katib example](https://github.com/kubeflow/katib/tree/master/examples/v1alpha1)

