Skip to content

Commit

Permalink
feat: Add doc for implementing new algorithms (kubeflow#769)
Browse files Browse the repository at this point in the history
* feat: Add doc

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* feat: Update

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* feat: Update

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* fix: Update

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* fix: Address comments

Signed-off-by: Ce Gao <gaoce@caicloud.io>
  • Loading branch information
gaocegege authored and k8s-ci-robot committed Sep 30, 2019
1 parent e0659b4 commit fb865e7
Show file tree
Hide file tree
Showing 3 changed files with 247 additions and 137 deletions.
35 changes: 6 additions & 29 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,13 @@ vet: depend generate
update:
hack/update-gofmt.sh

# Deploy katib v1alpha2 manifests into a k8s cluster
# Deploy katib v1alpha3 manifests into a k8s cluster
deploy:
bash scripts/v1alpha2/deploy.sh
bash scripts/v1alpha3/deploy.sh

# Undeploy katib v1alpha2 manifests into a k8s cluster
# Undeploy katib v1alpha3 manifests into a k8s cluster
undeploy:
bash scripts/v1alpha2/undeploy.sh
bash scripts/v1alpha3/undeploy.sh

# Generate code
generate:
Expand All @@ -48,28 +48,5 @@ ifndef GOPATH
endif
go generate ./pkg/... ./cmd/...

############################################################
# Build docker image section for v1alpha2
############################################################
images: katib-controller katib-manager katib-manager-rest metrics-collector katib-ui tfevent-metrics-collector suggestion-random

katib-controller: depend generate
docker build -t ${PREFIX}/v1alpha2/katib-controller -f ${CMD_PREFIX}/katib-controller/v1alpha2/Dockerfile .

katib-manager: depend generate
docker build -t ${PREFIX}/v1alpha2/katib-manager -f ${CMD_PREFIX}/manager/v1alpha2/Dockerfile .

katib-manager-rest: depend generate
docker build -t ${PREFIX}/v1alpha2/katib-manager-rest -f ${CMD_PREFIX}/manager-rest/v1alpha2/Dockerfile .

metrics-collector: depend generate
docker build -t ${PREFIX}/v1alpha2/metrics-collector -f ${CMD_PREFIX}/metricscollector/v1alpha2/Dockerfile .

katib-ui: depend generate
docker build -t ${PREFIX}/v1alpha2/katib-ui -f ${CMD_PREFIX}/ui/v1alpha2/Dockerfile .

tfevent-metrics-collector: depend generate
docker build -t ${PREFIX}/v1alpha2/tfevent-metrics-collector -f ${CMD_PREFIX}/tfevent-metricscollector/v1alpha2/Dockerfile .

suggestion-random: depend generate
docker build -t ${PREFIX}/v1alpha2/suggestion-random -f ${CMD_PREFIX}/suggestion/random/v1alpha2/Dockerfile .
build: depend generate
bash scripts/v1alpha3/build.sh
258 changes: 241 additions & 17 deletions docs/developer-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,39 +21,263 @@

Check source code as follows:

```bash
make build
```
make check

You can deploy katib v1alpha3 manifests into a k8s cluster as follows:

```bash
make deploy
```

If there are some errors for go fmt, update the go fmt as follows:
You can undeploy katib v1alpha3 manifests from a k8s cluster as follows:

```bash
make undeploy
```
make update

## Implement a new algorithm and use it in katib

### Implement the algorithm

The design of katib follows the [`ask-and-tell` pattern](https://scikit-optimize.github.io/notebooks/ask-and-tell.html):

> They often follow a pattern a bit like this: 1. ask for a new set of parameters 1. walk to the experiment and program in the new parameters 1. observe the outcome of running the experiment 1. walk back to your laptop and tell the optimizer about the outcome 1. go to step 1
When an experiment is created, one algorithm service will be created. Then katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, katib creates new trials according to the sets and observe the outcome. When the trials are finished, katib tells the metrics of the finished trials to the algorithm, and ask another new sets.

The new algorithm needs to implement `Suggestion` service defined in [api.proto](../pkg/apis/manager/v1alpha3/api.proto). One sample algorithm looks like:

```python
from pkg.apis.manager.v1alpha3.python import api_pb2
from pkg.apis.manager.v1alpha3.python import api_pb2_grpc
from pkg.suggestion.v1alpha3.internal.search_space import HyperParameter, HyperParameterSearchSpace
from pkg.suggestion.v1alpha3.internal.trial import Trial, Assignment
from pkg.suggestion.v1alpha3.hyperopt.base_hyperopt_service import BaseHyperoptService
from pkg.suggestion.v1alpha3.base_health_service import HealthServicer


# Inherit SuggestionServicer and implement GetSuggestions
class HyperoptService(
api_pb2_grpc.SuggestionServicer, HealthServicer):
def ValidateAlgorithmSettings(self, request, context):
# Optional, it is used to validate algorithm settings defined by users.
pass
def GetSuggestions(self, request, context):
# Convert the experiment in GRPC request to the search space.
# search_space example:
# HyperParameterSearchSpace(
# goal: MAXIMIZE,
# params: [HyperParameter(name: param-1, type: INTEGER, min: 1, max: 5, step: 0),
# HyperParameter(name: param-2, type: CATEGORICAL, list: cat1, cat2, cat3),
# HyperParameter(name: param-3, type: DISCRETE, list: 3, 2, 6),
# HyperParameter(name: param-4, type: DOUBLE, min: 1, max: 5, step: )]
# )
search_space = HyperParameterSearchSpace.convert(request.experiment)
# Convert the trials in GRPC request to the trials in algorithm side.
# trials example:
# [Trial(
# assignment: [Assignment(name=param-1, value=2),
# Assignment(name=param-2, value=cat1),
# Assignment(name=param-3, value=2),
# Assignment(name=param-4, value=3.44)],
# target_metric: Metric(name="metric-2" value="5643"),
# additional_metrics: [Metric(name=metric-1, value=435),
# Metric(name=metric-3, value=5643)],
# Trial(
# assignment: [Assignment(name=param-1, value=3),
# Assignment(name=param-2, value=cat2),
# Assignment(name=param-3, value=6),
# Assignment(name=param-4, value=4.44)],
# target_metric: Metric(name="metric-2" value="3242"),
# additional_metrics: [Metric(name=metric=1, value=123),
# Metric(name=metric-3, value=543)],
trials = Trial.convert(request.trials)
#--------------------------------------------------------------
# Your code here
# Implment the logic to generate new assignments for the given request number.
# For example, if request.request_number is 2, you should return:
# [
# [Assignment(name=param-1, value=3),
# Assignment(name=param-2, value=cat2),
# Assignment(name=param-3, value=3),
# Assignment(name=param-4, value=3.22)
# ],
# [Assignment(name=param-1, value=4),
# Assignment(name=param-2, value=cat4),
# Assignment(name=param-3, value=2),
# Assignment(name=param-4, value=4.32)
# ],
# ]
list_of_assignments = your_logic(search_space, trials, request.request_number)
#--------------------------------------------------------------
# Convert list_of_assignments to
return api_pb2.GetSuggestionsReply(
trials=Assignment.generate(list_of_assignments)
)
```

You can build all images from source for v1alpha2 as follows:
### Make the algorithm a GRPC server

```bash
make images
Create a package under [cmd/suggestion](../cmd/suggestion). Then create the main function and Dockerfile. The new GRPC server should serve in port 6789.

Here is an example: [cmd/suggestion/hyperopt](../cmd/suggestion/hyperopt). Then build the Docker image.

### Use the algorithm in katib.

Update the [katib-config](../manifests/v1alpha3/katib-controller/katib-config.yaml), add a new object:

```json
suggestion: |-
{
"tpe": {
"image": "gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt"
},
"random": {
"image": "gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt"
},
"<new-algorithm-name>": {
"image": "image built in the previous stage"
}
}
```

You can deploy katib v1alpha2 manifests into a k8s cluster as follows:
### Contribute the algorithm to katib

```bash
make deploy
If you want to contribute the algorithm to katib, you could add unit test or e2e test for it in CI and submit a PR.

#### Unit Test

Here is an example [test_hyperopt_service.py](../test/suggestion/v1alpha3/test_hyperopt_service.py):

```python
import grpc
import grpc_testing
import unittest

from pkg.apis.manager.v1alpha3.python import api_pb2_grpc
from pkg.apis.manager.v1alpha3.python import api_pb2

from pkg.suggestion.v1alpha3.hyperopt_service import HyperoptService

class TestHyperopt(unittest.TestCase):
def setUp(self):
servicers = {
api_pb2.DESCRIPTOR.services_by_name['Suggestion']: HyperoptService()
}

self.test_server = grpc_testing.server_from_dictionary(
servicers, grpc_testing.strict_real_time())


if __name__ == '__main__':
unittest.main()
```

You can undeploy katib v1alpha2 manifests from a k8s cluster as follows:
You can setup the GRPC server using `grpc_testing`, then define you own test cases.

#### E2E Test (Optional)

E2e tests help katib verify that the algorithm works well. To add a e2e test for the new algorithm, you need to:

Create a new script `run-suggestion-xxx.sh` in [test/scripts/v1alpha3](../test/scripts/v1alpha3). Here is an example [test/scripts/v1alpha3/build-suggestion-hyperopt.sh](../test/scripts/v1alpha3/build-suggestion-hyperopt.sh) (Replace `<name>` with the new algorithm name):

```bash
make undeploy
```
#!/bin/bash

# Copyright 2018 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This shell script is used to build a cluster and create a namespace from our
# argo workflow

set -o errexit
set -o nounset
set -o pipefail

CLUSTER_NAME="${CLUSTER_NAME}"
ZONE="${GCP_ZONE}"
PROJECT="${GCP_PROJECT}"
NAMESPACE="${DEPLOY_NAMESPACE}"
GO_DIR=${GOPATH}/src/github.com/${REPO_OWNER}/${REPO_NAME}

echo "Activating service-account"
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}

## Implement new suggestion algorithm
echo "Configuring kubectl"

Suggestion API is defined as GRPC service at `pkg/api/v1alpha1/api.proto`. Source code is [here](https://github.com/kubeflow/katib/blob/master/pkg/api/v1alpha1/api.proto). You can attach new algorithm easily.
echo "CLUSTER_NAME: ${CLUSTER_NAME}"
echo "ZONE: ${GCP_ZONE}"
echo "PROJECT: ${GCP_PROJECT}"

- implement suggestion API
- make k8s service named `vizier-suggestion-{ algorithm-name }` and expose port 6789
gcloud --project ${PROJECT} container clusters get-credentials ${CLUSTER_NAME} \
--zone ${ZONE}
kubectl config set-context $(kubectl config current-context) --namespace=default
USER=`gcloud config get-value account`

And to add new suggestion service, you don't need to stop components ( vizier-core, modeldb, and anything) that are already running.
echo "All Katib components are running."
kubectl version
kubectl cluster-info
echo "Katib deployments"
kubectl -n kubeflow get deploy
echo "Katib services"
kubectl -n kubeflow get svc
echo "Katib pods"
kubectl -n kubeflow get pod

mkdir -p ${GO_DIR}
cp -r . ${GO_DIR}/
cp -r pkg/apis/manager/v1alpha3/python/* ${GO_DIR}/test/e2e/v1alpha3
cd ${GO_DIR}/test/e2e/v1alpha3

echo "Running e2e <name> experiment"
export KUBECONFIG=$HOME/.kube/config
go run run-e2e-experiment.go ../../../examples/v1alpha3/<name>-example.yaml
kubectl -n kubeflow describe suggestion
kubectl -n kubeflow delete experiment <name>-example
exit 0
```

Then add a new step in our CI to run the new e2e test case in [test/workflows/components/workflows-v1alpha3.libsonnet](../test/workflows/components/workflows-v1alpha3.libsonnet) (Replace `<name>` with the new algorithm name):

```diff
// ...
{
name: "run-nasrl-e2e-tests",
template: "run-nasrl-e2e-tests",
},
{
name: "run-hyperband-e2e-tests",
template: "run-hyperband-e2e-tests",
},
{
name: "run-tpe-e2e-tests",
template: "run-tpe-e2e-tests",
},
+ {
+ name: "run-<name>-e2e-tests",
+ template: "run-<name>-e2e-tests",
+ },
// ...
$.parts(namespace, name, overrides).e2e(prow_env, bucket).buildTemplate("run-tpe-e2e-tests", testWorkerImage, [
"test/scripts/v1alpha3/run-suggestion-tpe.sh",
]), // run tpe algorithm
$.parts(namespace, name, overrides).e2e(prow_env, bucket).buildTemplate("run-hyperband-e2e-tests", testWorkerImage, [
"test/scripts/v1alpha3/run-suggestion-hyperband.sh",
]), // run hyperband algorithm
+ $.parts(namespace, name, overrides).e2e(prow_env, bucket).buildTemplate("run-<name>-e2e-tests", testWorkerImage, [
+ "test/scripts/v1alpha3/run-suggestion-<name>.sh",
+ ]), // run <name> algorithm
```
Loading

0 comments on commit fb865e7

Please sign in to comment.