# Test Training Operator Integration

This example notebook is loosely based on [this](https://github.com/kubeflow/training-operator/blob/master/sdk/python/examples/kubeflow-tfjob-sdk.ipynb) upstream example.

- create training job of type: TFJob, PyTorchJob, and PaddleJob
- monitor its execution
- get training logs
- delete job

## Setup

In [1]:
!pip install kubeflow-training tenacity -q

### Import required packages

In [2]:
from kubernetes.client import V1PodTemplateSpec
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container
from kubernetes.client import V1ContainerPort

from kubeflow.training import constants
from kubeflow.training.utils import utils
from kubeflow.training import V1ReplicaSpec
from kubeflow.training import KubeflowOrgV1TFJob
from kubeflow.training import KubeflowOrgV1TFJobSpec
from kubeflow.training import KubeflowOrgV1PyTorchJob
from kubeflow.training import KubeflowOrgV1PyTorchJobSpec
from kubeflow.training import KubeflowOrgV1PaddleJob
from kubeflow.training import KubeflowOrgV1PaddleJobSpec
from kubeflow.training import V1RunPolicy
from kubeflow.training import TrainingClient

from tenacity import retry, stop_after_attempt, wait_exponential

### Initialise Training Client

We will be using the Training SDK for any actions executed as part of this example.

In [3]:
client = TrainingClient()

### Define Helper to print training logs

In [4]:
def print_training_logs(client, job_name: str, container: str, is_master: bool = True):
    logs = client.get_job_logs(name=job_name, container=container, is_master=is_master)
    print(logs)

### Define Helper to check that Job succeeded

In [5]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=30),
    stop=stop_after_attempt(50),
    reraise=True,
)
def assert_job_succeeded(client, job_name, job_kind):
    """Wait for the Job to complete successfully."""
    assert client.is_job_succeeded(
        name=job_name, job_kind=job_kind
    ), f"Job {job_name} was not successful."

# Test TFJob

## Define a TFJob

Define a TFJob object before deploying it. This TFJob is similar to [this](https://github.com/kubeflow/training-operator/blob/master/sdk/python/examples/kubeflow-tfjob-sdk.ipynb) example.

In [6]:
TFJOB_NAME = "mnist"

In [7]:
container = V1Container(
    name="tensorflow",
    image="gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0",
    command=[
        "python",
        "/var/tf_mnist/mnist_with_summaries.py",
        "--log_dir=/train/logs",
        "--learning_rate=0.01",
        "--batch_size=150",
    ],
)

worker = V1ReplicaSpec(
    replicas=2,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

chief = V1ReplicaSpec(
    replicas=1,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

ps = V1ReplicaSpec(
    replicas=1,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

tfjob = KubeflowOrgV1TFJob(
    api_version="kubeflow.org/v1",
    kind="TFJob",
    metadata=V1ObjectMeta(name=TFJOB_NAME),
    spec=KubeflowOrgV1TFJobSpec(
        run_policy=V1RunPolicy(clean_pod_policy="None"),
        tf_replica_specs={"Worker": worker, "Chief": chief, "PS": ps},
    ),
)

Print the Job's info to verify it before submission.

In [8]:
print("Name:", tfjob.metadata.name)
print("Spec:", tfjob.spec.tf_replica_specs)

Name: mnist
Spec: {'Worker': {'replicas': 2,
 'restart_policy': 'Never',
 'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
                           'creation_timestamp': None,
                           'deletion_grace_period_seconds': None,
                           'deletion_timestamp': None,
                           'finalizers': None,
                           'generate_name': None,
                           'generation': None,
                           'labels': None,
                           'managed_fields': None,
                           'name': None,
                           'namespace': None,
                           'owner_references': None,
                           'resource_version': None,
                           'self_link': None,
                           'uid': None},
              'spec': {'active_deadline_seconds': None,
                       'affinity': None,
                       'automount_service_account_token'

## List existing TFJob

List TFJobs in the current namespace.

In [9]:
[job.metadata.name for job in client.list_tfjobs()]

[]

## Create TFJob

Create a TFJob using the SDK.

In [10]:
client.create_tfjob(tfjob)

TFJob admin/mnist has been created


## Get TFJob
Get the created TFJob by name and check its data.  
Make sure that it completes successfully before proceeding. 

In [11]:
# verify that the Job was created successfully
# raises an error if it doesn't exist
client.get_tfjob(name=TFJOB_NAME);

In [12]:
# wait for the Job to complete successfully
assert_job_succeeded(client, TFJOB_NAME, job_kind="TFJob")

In [13]:
job = client.get_tfjob(name=TFJOB_NAME)
print("Job:", job.metadata.name, end="\n\n")
print("Job Spec:", job.spec, sep="\n", end="\n\n")
print("Job Status:", job.status, sep="\n", end="\n\n")

Job: mnist

Job Spec:
{'enable_dynamic_worker': None,
 'run_policy': {'active_deadline_seconds': None,
                'backoff_limit': None,
                'clean_pod_policy': 'None',
                'scheduling_policy': None,
                'ttl_seconds_after_finished': None},
 'success_policy': None,
 'tf_replica_specs': {'Chief': {'replicas': 1,
                                'restart_policy': 'Never',
                                'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
                                                          'creation_timestamp': None,
                                                          'deletion_grace_period_seconds': None,
                                                          'deletion_timestamp': None,
                                                          'finalizers': None,
                                                          'generate_name': None,
                                                

## Get TFJob Training logs
Get and print the training logs of the TFJob with the training steps 

In [14]:
print_training_logs(client, TFJOB_NAME, container="tensorflow")

The logs of pod mnist-chief-0:
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Instructions for updating:
Please use tf.data to implement this functionality.
Instructions for updating:
Please use tf.data to implement this functionality.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2023-08-23 14:50:10.879767: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Succe

None


## Delete TFJob

Delete the created TFJob and check that all created resources were removed as well.

In [15]:
client.delete_tfjob(name=TFJOB_NAME)

TFJob admin/mnist has been deleted


In [16]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=10),
    stop=stop_after_attempt(30),
    reraise=True,
)
def assert_tfjob_removed(client, job_name):
    """Wait for TFJob be removed."""
    # fetch the existing TFJob names
    # verify that the Job was deleted successfully
    jobs = {job.metadata.name for job in client.list_tfjobs()}
    assert job_name not in jobs, f"Failed to delete TFJob {job_name}!"

In [17]:
# wait for TFJob resources to be removed successfully
assert_tfjob_removed(client, TFJOB_NAME)

# Test PyTorchJob

## Define a PyTorchJob
Define a PyTorchJob object before deploying it. This PyTorchJob is similar to [this](https://github.com/kubeflow/training-operator/blob/11b7a115e6538caeab405344af98f0d5b42a4c96/sdk/python/examples/kubeflow-pytorchjob-sdk.ipynb) example.

In [18]:
PYTORCHJOB_NAME = "pytorch-dist-mnist-gloo"

In [19]:
container = V1Container(
    name="pytorch",
    image="gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0",
    args=["--backend", "gloo"],
)

replica_spec = V1ReplicaSpec(
    replicas=1,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

pytorchjob = KubeflowOrgV1PyTorchJob(
    api_version="kubeflow.org/v1",
    kind="PyTorchJob",
    metadata=V1ObjectMeta(name=PYTORCHJOB_NAME),
    spec=KubeflowOrgV1PyTorchJobSpec(
        run_policy=V1RunPolicy(clean_pod_policy="None"),
        pytorch_replica_specs={"Master": replica_spec, "Worker": replica_spec},
    ),
)

Print the Job's info to verify it before submission.

In [20]:
print("Name:", pytorchjob.metadata.name)
print("Spec:", pytorchjob.spec.pytorch_replica_specs)

Name: pytorch-dist-mnist-gloo
Spec: {'Master': {'replicas': 1,
 'restart_policy': 'OnFailure',
 'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
                           'creation_timestamp': None,
                           'deletion_grace_period_seconds': None,
                           'deletion_timestamp': None,
                           'finalizers': None,
                           'generate_name': None,
                           'generation': None,
                           'labels': None,
                           'managed_fields': None,
                           'name': None,
                           'namespace': None,
                           'owner_references': None,
                           'resource_version': None,
                           'self_link': None,
                           'uid': None},
              'spec': {'active_deadline_seconds': None,
                       'affinity': None,
                       'automount_

## List existing PyTorchJob

List PyTorchJob in the current namespace.

In [21]:
[job.metadata.name for job in client.list_pytorchjobs()]

[]

## Create PyTorchJob

Create a PyTorchJob using the SDK.

In [22]:
client.create_pytorchjob(pytorchjob)

PyTorchJob admin/pytorch-dist-mnist-gloo has been created


## Get PyTorchJob
Get the created PyTorchJob by name and check its data.  
Make sure that it completes successfully before proceeding. 

In [23]:
# verify that the Job was created successfully
# raises an error if it doesn't exist
client.get_pytorchjob(name=PYTORCHJOB_NAME);

In [24]:
# wait for the Job to complete successfully
assert_job_succeeded(client, PYTORCHJOB_NAME, job_kind="PyTorchJob")

In [25]:
job = client.get_pytorchjob(name=PYTORCHJOB_NAME)
print("Job:", job.metadata.name, end="\n\n")
print("Job Spec:", job.spec, sep="\n", end="\n\n")
print("Job Status:", job.status, sep="\n", end="\n\n")

Job: pytorch-dist-mnist-gloo

Job Spec:
{'elastic_policy': None,
 'pytorch_replica_specs': {'Master': {'replicas': 1,
                                      'restart_policy': 'OnFailure',
                                      'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
                                                                'creation_timestamp': None,
                                                                'deletion_grace_period_seconds': None,
                                                                'deletion_timestamp': None,
                                                                'finalizers': None,
                                                                'generate_name': None,
                                                                'generation': None,
                                                                'labels': None,
                                                                'managed_f

## Get PyTorchJob Training logs
Get and print the training logs of the PyTorchJob with the training steps 

In [26]:
print_training_logs(client, PYTORCHJOB_NAME, container="pytorch")

The logs of pod pytorch-dist-mnist-gloo-master-0:
 Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!

accuracy=0.9664




None


## Delete PyTorchJob

Delete the created PyTorchJob and check that all created resources were removed as well.

In [27]:
client.delete_pytorchjob(name=PYTORCHJOB_NAME)

PyTorchJob admin/pytorch-dist-mnist-gloo has been deleted


In [28]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=10),
    stop=stop_after_attempt(30),
    reraise=True,
)
def assert_pytorchjob_removed(client, job_name):
    """Wait for PyTorchJob be removed."""
    # fetch the existing PyTorchJob names
    # verify that the Job was deleted successfully
    jobs = {job.metadata.name for job in client.list_pytorchjobs()}
    assert job_name not in jobs, f"Failed to delete PyTorchJob {job_name}!"

In [29]:
# wait for PyTorch job to be removed successfully
assert_pytorchjob_removed(client, PYTORCHJOB_NAME)

# Test PaddlePaddle

## Define a PaddleJob

Define a PaddleJob object before deploying it. This PaddleJob is using [this](https://github.com/kubeflow/training-operator/blob/11b7a115e6538caeab405344af98f0d5b42a4c96/examples/paddlepaddle/simple-cpu.yaml) example.

In [30]:
PADDLEJOB_NAME = "paddle-simple-cpu"

In [31]:
port = V1ContainerPort(container_port=37777, name="master")

container = V1Container(
    name="paddle",
    image="registry.baidubce.com/paddlepaddle/paddle:2.4.0rc0-cpu",
    command=["python"],
    args=["-m", "paddle.distributed.launch", "run_check"],
    ports=[port],
)

replica_spec = V1ReplicaSpec(
    replicas=2,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(annotations={"sidecar.istio.io/inject": "false"}),
        spec=V1PodSpec(containers=[container]),
    ),
)

paddlejob = KubeflowOrgV1PaddleJob(
    api_version="kubeflow.org/v1",
    kind="PaddleJob",
    metadata=V1ObjectMeta(name=PADDLEJOB_NAME),
    spec=KubeflowOrgV1PaddleJobSpec(
        run_policy=V1RunPolicy(clean_pod_policy="None"),
        paddle_replica_specs={"Worker": replica_spec},
    ),
)

Print the Job's info to verify it before submission.

In [32]:
print("Name:", paddlejob.metadata.name)
print("Spec:", paddlejob.spec.paddle_replica_specs)

Name: paddle-simple-cpu
Spec: {'Worker': {'replicas': 2,
 'restart_policy': 'OnFailure',
 'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
                           'creation_timestamp': None,
                           'deletion_grace_period_seconds': None,
                           'deletion_timestamp': None,
                           'finalizers': None,
                           'generate_name': None,
                           'generation': None,
                           'labels': None,
                           'managed_fields': None,
                           'name': None,
                           'namespace': None,
                           'owner_references': None,
                           'resource_version': None,
                           'self_link': None,
                           'uid': None},
              'spec': {'active_deadline_seconds': None,
                       'affinity': None,
                       'automount_servic

## List existing PaddleJobs

List PaddleJobs in the current namespace.

In [33]:
[job.metadata.name for job in client.list_paddlejobs()]

[]

## Create PaddleJob

Create a PaddleJob using the SDK.

In [34]:
client.create_paddlejob(paddlejob)

PaddleJob admin/paddle-simple-cpu has been created


## Get PaddleJob
Get the created PaddleJob by name and check its data.  
Make sure that it completes successfully before proceeding. 

In [35]:
# verify that the Job was created successfully
# raises an error if it doesn't exist
client.get_paddlejob(name=PADDLEJOB_NAME)

{'api_version': 'kubeflow.org/v1',
 'kind': 'PaddleJob',
 'metadata': {'annotations': None,
              'creation_timestamp': datetime.datetime(2023, 8, 23, 15, 3, 5, tzinfo=tzlocal()),
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': 1,
              'labels': None,
              'managed_fields': [{'api_version': 'kubeflow.org/v1',
                                  'fields_type': 'FieldsV1',
                                  'fields_v1': {'f:spec': {'.': {},
                                                           'f:paddleReplicaSpecs': {'.': {},
                                                                                    'f:Worker': {'.': {},
                                                                                                 'f:replicas': {},
                                                                          

In [36]:
# wait for the Job to complete successfully
assert_job_succeeded(client, PADDLEJOB_NAME, job_kind="PaddleJob")

In [37]:
job = client.get_paddlejob(name=PADDLEJOB_NAME)
print("Job:", job.metadata.name, end="\n\n")
print("Job Spec:", job.spec, sep="\n", end="\n\n")
print("Job Status:", job.status, sep="\n", end="\n\n")

Job: paddle-simple-cpu

Job Spec:
{'elastic_policy': None,
 'paddle_replica_specs': {'Worker': {'replicas': 2,
                                     'restart_policy': 'OnFailure',
                                     'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
                                                               'creation_timestamp': None,
                                                               'deletion_grace_period_seconds': None,
                                                               'deletion_timestamp': None,
                                                               'finalizers': None,
                                                               'generate_name': None,
                                                               'generation': None,
                                                               'labels': None,
                                                               'managed_fields': None,
   

## Get PaddleJob Training logs
Get and print the training logs of the PaddleJob with the training steps 

In [38]:
print_training_logs(client, PADDLEJOB_NAME, container="paddle", is_master=False)

The logs of pod paddle-simple-cpu-worker-1:
 LAUNCH INFO 2023-08-23 15:03:11,720 Paddle Distributed Test begin...
LAUNCH INFO 2023-08-23 15:03:11,729 -----------  Configuration  ----------------------
LAUNCH INFO 2023-08-23 15:03:11,729 devices: None
LAUNCH INFO 2023-08-23 15:03:11,729 elastic_level: -1
LAUNCH INFO 2023-08-23 15:03:11,729 elastic_timeout: 30
LAUNCH INFO 2023-08-23 15:03:11,729 gloo_port: 6767
LAUNCH INFO 2023-08-23 15:03:11,729 host: None
LAUNCH INFO 2023-08-23 15:03:11,729 ips: None
LAUNCH INFO 2023-08-23 15:03:11,729 job_id: paddle-simple-cpu
LAUNCH INFO 2023-08-23 15:03:11,729 legacy: False
LAUNCH INFO 2023-08-23 15:03:11,729 log_dir: log
LAUNCH INFO 2023-08-23 15:03:11,729 log_level: INFO
LAUNCH INFO 2023-08-23 15:03:11,729 master: paddle-simple-cpu-worker-0:37777
LAUNCH INFO 2023-08-23 15:03:11,729 max_restart: 3
LAUNCH INFO 2023-08-23 15:03:11,729 nnodes: 2
LAUNCH INFO 2023-08-23 15:03:11,729 nproc_per_node: None
LAUNCH INFO 2023-08-23 15:03:11,729 rank: -1
LAUNC

None


## Delete PaddleJob

Delete the created PaddleJob and check that all created resources were removed as well.

In [39]:
client.delete_paddlejob(name=PADDLEJOB_NAME)

PaddleJob admin/paddle-simple-cpu has been deleted


In [40]:
@retry(
    wait=wait_exponential(multiplier=2, min=1, max=10),
    stop=stop_after_attempt(30),
    reraise=True,
)
def assert_paddlejob_removed(client, job_name):
    """Wait for PaddleJob be removed."""
    # fetch the existing PaddleJob names
    # verify that the Job was deleted successfully
    jobs = {job.metadata.name for job in client.list_paddlejobs()}
    assert job_name not in jobs, f"Failed to delete PaddleJob {job_name}!"

In [41]:
# wait for PaddleJob to be removed successfully
assert_paddlejob_removed(client, PADDLEJOB_NAME)