# Test Training Operator Integration

This example notebook is loosely based on the following upstream examples:
* [TFJob](https://github.com/kubeflow/training-operator/blob/964a6e836eedff11edfe79cc9f4e5b7c623cbe88/examples/tensorflow/image-classification/create-tfjob.ipynb)
* [PyTorchJob](https://github.com/kubeflow/training-operator/blob/964a6e836eedff11edfe79cc9f4e5b7c623cbe88/examples/pytorch/image-classification/create-pytorchjob.ipynb)
* [PaddleJob](https://github.com/kubeflow/training-operator/blob/964a6e836eedff11edfe79cc9f4e5b7c623cbe88/examples/paddlepaddle/simple-cpu.yaml)

Note that the above can get out of sync with the actual testing upstream does, so make sure to also check out [upstream E2E tests](https://github.com/kubeflow/training-operator/tree/964a6e836eedff11edfe79cc9f4e5b7c623cbe88/sdk/python/test/e2e) for updating the notebook.

The workflow for each job (TFJob, PyTorchJob, and PaddleJob) is:
- create training job
- monitor its execution
- get training logs
- delete job

## Setup

In [None]:
!pip install kubeflow-training tenacity -q

### Import required packages

In [None]:
from kubeflow.training import (
    KubeflowOrgV1PaddleJob,
    KubeflowOrgV1PaddleJobSpec,
    KubeflowOrgV1PyTorchJob,
    KubeflowOrgV1PyTorchJobSpec,
    KubeflowOrgV1TFJob,
    KubeflowOrgV1TFJobSpec,
    TrainingClient,
    V1ReplicaSpec,
    V1RunPolicy,
)
from kubernetes.client import (
    V1Container,
    V1ContainerPort,
    V1ObjectMeta,
    V1PodSpec,
    V1PodTemplateSpec,
)
from tenacity import retry, stop_after_attempt, wait_exponential

### Initialise Training Client

We will be using the Training SDK for any actions executed as part of this example.

In [None]:
client = TrainingClient()

### Define Helper to print training logs

In [None]:
def print_training_logs(client, job_name: str, container: str, is_master: bool = True):
    logs = client.get_job_logs(name=job_name, container=container, is_master=is_master)
    print(logs)

### Define Helper to check that Job succeeded

In [None]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=30),
    stop=stop_after_attempt(50),
    reraise=True,
)
def assert_job_succeeded(client, job_name, job_kind):
    """Wait for the Job to complete successfully."""
    assert client.is_job_succeeded(
        name=job_name, job_kind=job_kind
    ), f"Job {job_name} was not successful."

## Test TFJob

### Define a TFJob

Define a TFJob object before deploying it.

In [None]:
TFJOB_NAME = "mnist"
TFJOB_CONTAINER = "tensorflow"
TFJOB_IMAGE = "gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0"

In [None]:
container = V1Container(
    name=TFJOB_CONTAINER,
    image=TFJOB_IMAGE,
    command=[
        "python",
        "/var/tf_mnist/mnist_with_summaries.py",
        "--log_dir=/train/logs",
        "--learning_rate=0.01",
        "--batch_size=150",
    ],
)

worker = V1ReplicaSpec(
    replicas=2,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

chief = V1ReplicaSpec(
    replicas=1,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

ps = V1ReplicaSpec(
    replicas=1,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

tfjob = KubeflowOrgV1TFJob(
    api_version="kubeflow.org/v1",
    kind="TFJob",
    metadata=V1ObjectMeta(name=TFJOB_NAME),
    spec=KubeflowOrgV1TFJobSpec(
        run_policy=V1RunPolicy(clean_pod_policy="None"),
        tf_replica_specs={"Worker": worker, "Chief": chief, "PS": ps},
    ),
)

Print the Job's info to verify it before submission.

In [None]:
print("Name:", tfjob.metadata.name)
print("Spec:", tfjob.spec.tf_replica_specs)

### List existing TFJobs

List TFJobs in the current namespace.

In [None]:
[job.metadata.name for job in client.list_tfjobs()]

### Create TFJob

Create a TFJob using the SDK.

In [None]:
client.create_tfjob(tfjob)

### Get TFJob
Get the created TFJob by name and check its data.  
Make sure that it completes successfully before proceeding. 

In [None]:
# verify that the Job was created successfully
# raises an error if it doesn't exist
tfjob = client.get_tfjob(name=TFJOB_NAME)

In [None]:
# wait for the Job to complete successfully
assert_job_succeeded(client, TFJOB_NAME, job_kind="TFJob")

In [None]:
print("Job:", tfjob.metadata.name, end="\n\n")
print("Job Spec:", tfjob.spec, sep="\n", end="\n\n")
print("Job Status:", tfjob.status, sep="\n", end="\n\n")

### Get TFJob Training logs
Get and print the training logs of the TFJob with the training steps 

In [None]:
print_training_logs(client, TFJOB_NAME, container=TFJOB_CONTAINER)

### Delete TFJob

Delete the created TFJob.

In [None]:
client.delete_tfjob(name=TFJOB_NAME)

In [None]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=10),
    stop=stop_after_attempt(30),
    reraise=True,
)
def assert_tfjob_removed(client, job_name):
    """Wait for TFJob to be removed."""
    # fetch the existing TFJob names
    # verify that the Job was deleted successfully
    jobs = {job.metadata.name for job in client.list_tfjobs()}
    assert job_name not in jobs, f"Failed to delete TFJob {job_name}!"

In [None]:
# wait for TFJob resources to be removed successfully
assert_tfjob_removed(client, TFJOB_NAME)

## Test PyTorchJob

### Define a PyTorchJob
Define a PyTorchJob object before deploying it. This PyTorchJob is similar to [this](https://github.com/kubeflow/training-operator/blob/11b7a115e6538caeab405344af98f0d5b42a4c96/sdk/python/examples/kubeflow-pytorchjob-sdk.ipynb) example.

In [None]:
PYTORCHJOB_NAME = "pytorch-mnist-gloo"
PYTORCHJOB_CONTAINER = "pytorch"
PYTORCHJOB_IMAGE = "kubeflowkatib/pytorch-mnist-cpu:v1beta1-57ed828"
# The image above should be updated with each release with the latest available in the registry.
# Note that instead of using the [image from training-operator repository](https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/Dockerfile),
# the one [from Katib](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.cpu) is being used
# due to the large size of the first one.

In [None]:
container = V1Container(
    name=PYTORCHJOB_CONTAINER,
    image=PYTORCHJOB_IMAGE,
    args=["--backend", "gloo", "--epochs", "2"],
)

replica_spec = V1ReplicaSpec(
    replicas=1,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

pytorchjob = KubeflowOrgV1PyTorchJob(
    api_version="kubeflow.org/v1",
    kind="PyTorchJob",
    metadata=V1ObjectMeta(name=PYTORCHJOB_NAME),
    spec=KubeflowOrgV1PyTorchJobSpec(
        run_policy=V1RunPolicy(clean_pod_policy="None"),
        pytorch_replica_specs={"Master": replica_spec, "Worker": replica_spec},
    ),
)

Print the Job's info to verify it before submission.

In [None]:
print("Name:", pytorchjob.metadata.name)
print("Spec:", pytorchjob.spec.pytorch_replica_specs)

### List existing PyTorchJobs

List PyTorchJobs in the current namespace.

In [None]:
[job.metadata.name for job in client.list_pytorchjobs()]

### Create PyTorchJob

Create a PyTorchJob using the SDK.

In [None]:
client.create_pytorchjob(pytorchjob)

### Get PyTorchJob
Get the created PyTorchJob by name and check its data.  
Make sure that it completes successfully before proceeding. 

In [None]:
# verify that the Job was created successfully
# raises an error if it doesn't exist
pytorchjob = client.get_pytorchjob(name=PYTORCHJOB_NAME)

In [None]:
# wait for the Job to complete successfully
assert_job_succeeded(client, PYTORCHJOB_NAME, job_kind="PyTorchJob")

In [None]:
print("Job:", pytorchjob.metadata.name, end="\n\n")
print("Job Spec:", pytorchjob.spec, sep="\n", end="\n\n")
print("Job Status:", pytorchjob.status, sep="\n", end="\n\n")

### Get PyTorchJob Training logs
Get and print the training logs of the PyTorchJob with the training steps 

In [None]:
print_training_logs(client, PYTORCHJOB_NAME, container=PYTORCHJOB_CONTAINER)

### Delete PyTorchJob

Delete the created PyTorchJob.

In [None]:
client.delete_pytorchjob(name=PYTORCHJOB_NAME)

In [None]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=10),
    stop=stop_after_attempt(30),
    reraise=True,
)
def assert_pytorchjob_removed(client, job_name):
    """Wait for PyTorchJob to be removed."""
    # fetch the existing PyTorchJob names
    # verify that the Job was deleted successfully
    jobs = {job.metadata.name for job in client.list_pytorchjobs()}
    assert job_name not in jobs, f"Failed to delete PyTorchJob {job_name}!"

In [None]:
# wait for PyTorch job to be removed successfully
assert_pytorchjob_removed(client, PYTORCHJOB_NAME)

## Test PaddlePaddle

### Define a PaddleJob

Define a PaddleJob object before deploying it.

In [None]:
PADDLEJOB_NAME = "paddle-simple-cpu"
PADDLEJOB_CONTAINER = "paddle"
PADDLEJOB_IMAGE = "docker.io/paddlepaddle/paddle:2.4.0rc0-cpu"

In [None]:
port = V1ContainerPort(container_port=37777, name="master")

container = V1Container(
    name=PADDLEJOB_CONTAINER,
    image=PADDLEJOB_IMAGE,
    command=["python"],
    args=["-m", "paddle.distributed.launch", "run_check"],
    ports=[port],
)

replica_spec = V1ReplicaSpec(
    replicas=2,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

paddlejob = KubeflowOrgV1PaddleJob(
    api_version="kubeflow.org/v1",
    kind="PaddleJob",
    metadata=V1ObjectMeta(name=PADDLEJOB_NAME),
    spec=KubeflowOrgV1PaddleJobSpec(
        run_policy=V1RunPolicy(clean_pod_policy="None"),
        paddle_replica_specs={"Worker": replica_spec},
    ),
)

Print the Job's info to verify it before submission.

In [None]:
print("Name:", paddlejob.metadata.name)
print("Spec:", paddlejob.spec.paddle_replica_specs)

### List existing PaddleJobs

List PaddleJobs in the current namespace.

In [None]:
[job.metadata.name for job in client.list_paddlejobs()]

### Create PaddleJob

Create a PaddleJob using the SDK.

In [None]:
client.create_paddlejob(paddlejob)

### Get PaddleJob
Get the created PaddleJob by name and check its data.  
Make sure that it completes successfully before proceeding. 

In [None]:
# verify that the Job was created successfully
# raises an error if it doesn't exist
paddlejob = client.get_paddlejob(name=PADDLEJOB_NAME)

In [None]:
# wait for the Job to complete successfully
assert_job_succeeded(client, PADDLEJOB_NAME, job_kind="PaddleJob")

In [None]:
print("Job:", paddlejob.metadata.name, end="\n\n")
print("Job Spec:", paddlejob.spec, sep="\n", end="\n\n")
print("Job Status:", paddlejob.status, sep="\n", end="\n\n")

### Get PaddleJob Training logs
Get and print the training logs of the PaddleJob with the training steps 

In [None]:
# set is_master to False because this example does not include a master replica type
print_training_logs(client, PADDLEJOB_NAME, container=PADDLEJOB_CONTAINER, is_master=False)

### Delete PaddleJob

Delete the created PaddleJob.

In [None]:
client.delete_paddlejob(name=PADDLEJOB_NAME)

In [None]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=10),
    stop=stop_after_attempt(30),
    reraise=True,
)
def assert_paddlejob_removed(client, job_name):
    """Wait for PaddleJob to be removed."""
    # fetch the existing PaddleJob names
    # verify that the Job was deleted successfully
    jobs = {job.metadata.name for job in client.list_paddlejobs()}
    assert job_name not in jobs, f"Failed to delete PaddleJob {job_name}!"

In [None]:
# wait for PaddleJob to be removed successfully
assert_paddlejob_removed(client, PADDLEJOB_NAME)