# Test Katib Integration

This example notebook is loosely based on [this](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/sdk/cmaes-and-resume-policies.ipynb) upstream example.

- create Katib Experiment
- monitor its execution
- get optimal HyperParameters
- get Trials
- get Suggestion
- delete Experiment

## Setup

In [1]:
# Please check the requirements.in file for more details
!pip install -r requirements.txt

### Import required packages

In [2]:
from kubeflow.katib import (
    KatibClient,
    V1beta1AlgorithmSpec,
    V1beta1Experiment,
    V1beta1ExperimentSpec,
    V1beta1FeasibleSpace,
    V1beta1ObjectiveSpec,
    V1beta1ParameterSpec,
    V1beta1TrialTemplate,
    V1beta1TrialParameterSpec,
)
from kubernetes.client import V1ObjectMeta

from tenacity import retry, stop_after_attempt, wait_exponential

### Initialise Katib Client

We will be using the Katib SDK for any actions executed as part of this example.

In [3]:
client = KatibClient()

## Define a Katib Experiment

Define a Katib Experiment object before deploying it. This Experiment is similar to [this](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/hp-tuning/cma-es.yaml) example.

In [4]:
EXPERIMENT_NAME = "cmaes-example"

In [5]:
metadata = V1ObjectMeta(
    name=EXPERIMENT_NAME,
)

algorithm_spec=V1beta1AlgorithmSpec(
    algorithm_name="cmaes"
)

objective_spec=V1beta1ObjectiveSpec(
    type="minimize",
    goal= 0.001,
    objective_metric_name="loss",
    additional_metric_names=["Train-accuracy"]
)

# experiment search space
# in this example we tune learning rate, number of layer, and optimizer
parameters=[
    V1beta1ParameterSpec(
        name="lr",
        parameter_type="double",
        feasible_space=V1beta1FeasibleSpace(
            min="0.01",
            max="0.06"
        ),
    ),
    V1beta1ParameterSpec(
        name="momentum",
        parameter_type="double",
        feasible_space=V1beta1FeasibleSpace(
            min="0.5",
            max="0.9"
        ),
    ),
]

# JSON template specification for the Trial's Worker Kubernetes Job
trial_spec={
    "apiVersion": "batch/v1",
    "kind": "Job",
    "spec": {
        "template": {
            "metadata": {
                "annotations": {
                    "sidecar.istio.io/inject": "false"
                }
            },
            "spec": {
                "containers": [
                    {
                        "name": "training-container",
                        "image": "docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0",
                        "command": [
                            "python3",
                            "/opt/pytorch-mnist/mnist.py",
                            "--epochs=1",
                            "--batch-size=64",
                            "--lr=${trialParameters.learningRate}",
                            "--momentum=${trialParameters.momentum}",
                        ]
                    }
                ],
                "restartPolicy": "Never"
            }
        }
    }
}

trial_template=V1beta1TrialTemplate(
    primary_container_name="training-container",
    trial_parameters=[
        V1beta1TrialParameterSpec(
            name="learningRate",
            description="Learning rate for the training model",
            reference="lr"
        ),
        V1beta1TrialParameterSpec(
            name="momentum",
            description="Momentum for the training model",
            reference="momentum"
        ),
    ],
    trial_spec=trial_spec
)

experiment = V1beta1Experiment(
    api_version="kubeflow.org/v1beta1",
    kind="Experiment",
    metadata=metadata,
    spec=V1beta1ExperimentSpec(
        max_trial_count=3,
        parallel_trial_count=2,
        max_failed_trial_count=1,
        algorithm=algorithm_spec,
        objective=objective_spec,
        parameters=parameters,
        trial_template=trial_template,
    )
)

Print the Experiment's info to verify it before submission.

In [6]:
print("Name:", experiment.metadata.name)
print("Algorithm:", experiment.spec.algorithm.algorithm_name)
print("Objective:", experiment.spec.objective.objective_metric_name)
print("Trial Parameters:")
for param in experiment.spec.trial_template.trial_parameters:
    print(f"- {param.name}: {param.description}")
print("Max Trial Count:", experiment.spec.max_trial_count)
print("Max Failed Trial Count:", experiment.spec.max_failed_trial_count)
print("Parallel Trial Count:", experiment.spec.parallel_trial_count)

Name: cmaes-example
Algorithm: cmaes
Objective: loss
Trial Parameters:
- learningRate: Learning rate for the training model
- momentum: Momentum for the training model
Max Trial Count: 3
Max Failed Trial Count: 1
Parallel Trial Count: 2


## List existing Katib Experiments

List Katib Experiments in the current namespace.

In [7]:
[exp.metadata.name for exp in client.list_experiments()]

[]

## Create Katib Experiment

Create a Katib Experiment using the SDK.

In [8]:
client.create_experiment(experiment)

Experiment user/cmaes-example has been created


## Get Katib Experiment

Get the created Katib Experiment by name and check its data.  
Make sure that it completes successfully before proceeding. 

In [9]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=10),
    stop=stop_after_attempt(90),
    reraise=True,
)
def assert_experiment_succeeded(client, experiment):
    """Wait for the Katib Experiment to complete successfully."""
    assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."

In [10]:
# verify that the Experiment was created successfully
# raises an error if it doesn't exist
client.get_experiment(name=EXPERIMENT_NAME)

# wait for the Experiment to complete successfully
assert_experiment_succeeded(client, EXPERIMENT_NAME)

In [11]:
exp = client.get_experiment(name=EXPERIMENT_NAME)
print("Experiment:", exp.metadata.name, end="\n\n")
print("Experiment Spec:", exp.spec, sep="\n", end="\n\n")
print("Experiment Status:", exp.status, sep="\n", end="\n\n")

Experiment: cmaes-example

Experiment Spec:
{'algorithm': {'algorithm_name': 'cmaes', 'algorithm_settings': None},
 'early_stopping': None,
 'max_failed_trial_count': 1,
 'max_trial_count': 3,
 'metrics_collector_spec': {'collector': {'custom_collector': None,
                                          'kind': 'StdOut'},
                            'source': None},
 'nas_config': None,
 'objective': {'additional_metric_names': ['Train-accuracy'],
               'goal': 0.001,
               'metric_strategies': [{'name': 'loss', 'value': 'min'},
                                     {'name': 'Train-accuracy',
                                      'value': 'min'}],
               'objective_metric_name': 'loss',
               'type': 'minimize'},
 'parallel_trial_count': 2,
 'parameters': [{'feasible_space': {'list': None,
                                    'max': '0.06',
                                    'min': '0.01',
                                    'step': None},
              

### Get Experiment conditions

Check the current Experiment conditions and verify that the last one is "Succeeded".

In [12]:
conditions = client.get_experiment_conditions(name=EXPERIMENT_NAME)
print(conditions)

[{'last_transition_time': datetime.datetime(2024, 3, 25, 14, 53, 57, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2024, 3, 25, 14, 53, 57, tzinfo=tzlocal()),
 'message': 'Experiment is created',
 'reason': 'ExperimentCreated',
 'status': 'True',
 'type': 'Created'}, {'last_transition_time': datetime.datetime(2024, 3, 25, 14, 55, 58, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2024, 3, 25, 14, 55, 58, tzinfo=tzlocal()),
 'message': 'Experiment is running',
 'reason': 'ExperimentRunning',
 'status': 'False',
 'type': 'Running'}, {'last_transition_time': datetime.datetime(2024, 3, 25, 14, 55, 58, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2024, 3, 25, 14, 55, 58, tzinfo=tzlocal()),
 'message': 'Experiment has succeeded because max trial count has reached',
 'reason': 'ExperimentMaxTrialsReached',
 'status': 'True',
 'type': 'Succeeded'}]


In [13]:
assert conditions[-1].type == "Succeeded"

### Get the optimal HyperParameters

Get the optimal HyperParameters at the end of the tuning Experiment.  
Each metric comes with the max, min and latest value.

In [14]:
client.get_optimal_hyperparameters(name=EXPERIMENT_NAME)

{'best_trial_name': 'cmaes-example-dphxbch7',
 'observation': {'metrics': [{'latest': '0.3130',
                              'max': '2.2980',
                              'min': '0.2691',
                              'name': 'loss'},
                             {'latest': 'unavailable',
                              'max': 'unavailable',
                              'min': 'unavailable',
                              'name': 'Train-accuracy'}]},
 'parameter_assignments': [{'name': 'lr', 'value': '0.04511033252270099'},
                           {'name': 'momentum', 'value': '0.6980954001565728'}]}

## List Katib Trials

Get a list of the current Trials with the latest status.

In [15]:
trial_list = client.list_trials(experiment_name=EXPERIMENT_NAME)
for trial in trial_list:
    print("Trial:", trial.metadata.name)
    print("Trial Status:", trial.status.conditions[-1], sep="\n", end="\n\n")

Trial: cmaes-example-dphxbch7
Trial Status:
{'last_transition_time': datetime.datetime(2024, 3, 25, 14, 55, 25, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2024, 3, 25, 14, 55, 25, tzinfo=tzlocal()),
 'message': 'Trial has succeeded',
 'reason': 'TrialSucceeded',
 'status': 'True',
 'type': 'Succeeded'}

Trial: cmaes-example-9pjzlnzc
Trial Status:
{'last_transition_time': datetime.datetime(2024, 3, 25, 14, 55, 27, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2024, 3, 25, 14, 55, 27, tzinfo=tzlocal()),
 'message': 'Trial has succeeded',
 'reason': 'TrialSucceeded',
 'status': 'True',
 'type': 'Succeeded'}

Trial: cmaes-example-7zhq4s49
Trial Status:
{'last_transition_time': datetime.datetime(2024, 3, 25, 14, 55, 58, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2024, 3, 25, 14, 55, 58, tzinfo=tzlocal()),
 'message': 'Trial has succeeded',
 'reason': 'TrialSucceeded',
 'status': 'True',
 'type': 'Succeeded'}



In [16]:
# verify that the max trial count was reached
assert len(trial_list) == experiment.spec.max_trial_count

# verify that all trials were successful
for trial in trial_list:
    assert trial.status.conditions[-1].type == "Succeeded"

## Get Katib Suggestion

Inspect the Suggestion object for more information.

In [17]:
suggestion = client.get_suggestion(name=EXPERIMENT_NAME)
print("Suggestion:", suggestion.metadata.name, end="\n\n")
print("Suggestion Spec:", suggestion.spec, sep="\n", end="\n\n")
print("Suggestion Status:", suggestion.status, sep="\n", end="\n\n")

Suggestion: cmaes-example

Suggestion Spec:
{'algorithm': {'algorithm_name': 'cmaes', 'algorithm_settings': None},
 'early_stopping': None,
 'requests': 3,
 'resume_policy': 'Never'}

Suggestion Status:
{'algorithm_settings': None,
 'completion_time': None,
 'conditions': [{'last_transition_time': datetime.datetime(2024, 3, 25, 14, 53, 57, tzinfo=tzlocal()),
                 'last_update_time': datetime.datetime(2024, 3, 25, 14, 53, 57, tzinfo=tzlocal()),
                 'message': 'Suggestion is created',
                 'reason': 'SuggestionCreated',
                 'status': 'True',
                 'type': 'Created'},
                {'last_transition_time': datetime.datetime(2024, 3, 25, 14, 55, 58, tzinfo=tzlocal()),
                 'last_update_time': datetime.datetime(2024, 3, 25, 14, 55, 58, tzinfo=tzlocal()),
                 'message': 'Suggestion is not running',
                 'reason': 'Suggestion is succeeded',
                 'status': 'False',
                 '

In [18]:
assert suggestion.status.conditions[-1].type == "Succeeded"

## Delete Katib Experiment

Delete the created Experiment and check that all created resources were removed as well.

In [19]:
client.delete_experiment(name=EXPERIMENT_NAME)

Experiment user/cmaes-example has been deleted


In [20]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=10),
    stop=stop_after_attempt(30),
    reraise=True,
)
def assert_katib_resources_removed(client, experiment_name):
    """Wait for Katib resources to be removed."""
    # fetch the existing Experiment names
    # verify that the Experiment was deleted successfully
    experiments = {exp.metadata.name for exp in client.list_experiments()}
    assert experiment_name not in experiments, f"Failed to delete Katib Experiment {experiment_name}!"

    # fetch the existing Trials and retrieve the names of the Experiments these belong to
    # verify that the Trials were removed successfully
    trials = {tr.metadata.labels.get("katib.kubeflow.org/experiment") for tr in client.list_trials()}
    assert experiment_name not in trials, f"Katib Trials of Experiment {experiment_name} were not removed!"

    # fetch the existing Suggestion names
    # verify that the Suggestion was removed successfully
    suggestions = {sugg.metadata.name for sugg in client.list_suggestions()}
    assert experiment_name not in suggestions, f"Katib Suggestion {experiment_name} was not removed!"

In [21]:
# wait for Katib resources to be removed successfully
assert_katib_resources_removed(client, EXPERIMENT_NAME)