## Notebook 1 - Experiment Phase

#### This notebook shall be used as a first step to experiment running distributed training using PyTorch Training Operators on Kuberntes. 

#### This notebook is designed to create the PyTorchJob custom resource manifest using Kubeflow training and Kubernetes python clients. The PyTorch Training Operators makes it easy to run distributed or non-distributed PyTorch jobs on Kubernetes. However, please feel free to log in to cloud9 or other clients  which connect to your Kubernetes cluster to run kubectl commands 

In [None]:
#Please run the below commands to install necessary libraries 

#!pip install kfp==1.8.4
#!pip install kubeflow-training
#!pip install kubernetes

In [17]:
# Install Kubernetes client and kubeflow training operator pythion libraries. We will use this to create PyTorchJob manifest yaml file 

from kubernetes.client import V1PodTemplateSpec
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container
from kubernetes.client import V1ResourceRequirements
from kubernetes.client import V1VolumeMount
from kubernetes.client import V1Volume
from kubernetes.client import V1PersistentVolumeClaimVolumeSource

from kubeflow.training import constants
from kubeflow.training.utils import utils
from kubeflow.training import V1ReplicaSpec
from kubeflow.training import V1PyTorchJob
from kubeflow.training import V1PyTorchJobSpec
from kubeflow.training import PyTorchJobClient
from kubeflow.training import V1RunPolicy

import kfp
from kfp import components

from kfp import dsl
from kfp import compiler
from pytorch_dist_utility import *
import time

In [18]:
# Initialize global variables 

user_namespace = utils.get_default_target_namespace()

pytorch_distributed_jobname=f'pytorch-cnn-dist-job-{time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())}'

efs_mount_point='efs-sc-claim'

aws_dlc_pytorch_gpu_image='763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu116-ubuntu20.04-e3'

## Create PyTorch Job CRD Yaml File

In [19]:
# Create Volume specification for PyTorchJob to be claimed by master and worker pods 

persistent_vol_claim = V1PersistentVolumeClaimVolumeSource(
    claim_name=efs_mount_point
)

efs_volume = V1Volume(
    name=efs_mount_point,
    persistent_volume_claim=persistent_vol_claim
)

In [20]:
# Create container specification for PyTorchJob master and worker pods 

# Mount volume to container pods
efsvolumemount = V1VolumeMount(
    mount_path="/"+efs_mount_point,
    name=efs_mount_point
)

# Resource configuration for master and worker containers
resource_reqs = V1ResourceRequirements(
    limits={'nvidia.com/gpu':'1'}
)

# Create master and worker container spec 
container = V1Container(
    name="pytorch",
    image=aws_dlc_pytorch_gpu_image,
    args=["python","./"+efs_mount_point+"/cifar10-distributed-gpu-final.py","--epochs","3","--seed","7","--log-interval","60","--efs-mount-path",efs_mount_point,"--efs-dir-path","cifar10-dataset"],    
    volume_mounts=[efsvolumemount],
    resources=resource_reqs
)

In [21]:
# Create master specification 
master = V1ReplicaSpec(
    replicas=1,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            annotations={'sidecar.istio.io/inject': 'false'}
        ),
        spec=V1PodSpec(
            containers=[container],
            volumes=[efs_volume]
        )
    )
)

# Create worker specification 
worker = V1ReplicaSpec(
    replicas=2, # How many gpus or cpus shall be needed to distribute the training across
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            annotations={'sidecar.istio.io/inject': 'false'}
        ),
        spec=V1PodSpec(
            containers=[container],
            volumes=[efs_volume]
        )
    )
)


In [22]:
# Define PyTorchJob custom resource manifest 
pytorchjob = V1PyTorchJob(
    api_version="kubeflow.org/v1",
    kind="PyTorchJob",
    metadata=V1ObjectMeta(name=pytorch_distributed_jobname,namespace=user_namespace),
    spec=V1PyTorchJobSpec(
        run_policy=V1RunPolicy(clean_pod_policy="None"),
        pytorch_replica_specs={"Master": master,
                               "Worker": worker}
    )
)

In [23]:
pytorchjob_client = PyTorchJobClient()

try:
  if(pytorchjob_client.get(pytorch_distributed_jobname, namespace=user_namespace)):
    pytorchjob_client.delete(pytorch_distributed_jobname)
    print("Existing job: %s deleted"%(pytorch_distributed_jobname))
except:
  print("There is no existing job: %s. Please go ahead and create a new one"%(pytorch_distributed_jobname))

There is no existing job: pytorch-cnn-dist-job-2022-08-03-15-59-50-215. Please go ahead and create a new one


In [24]:
# Creates and Submits PyTorchJob custom resource file to Kubernetes
pytorch_job_manifest=pytorchjob_client.create(pytorchjob)

In [25]:
# Print the submitted PyTorchJob custom resource file for reference 

# pytorch_job_manifest

In [None]:
# Function Definition: def save_master_worker_spec(pytorch_client: PyTorchJobClient, pytorch_jobname: str) -> str:
#   Function also extracts master and worker spec that could be used for the creating the pipeline 

save_master_worker_spec(pytorchjob_client, pytorch_distributed_jobname)

In [28]:
#  Function Definition: read_logs(pyTorchClient: str, jobname: str, namespace: str, log_type: str) -> None:
#    log_type: all, worker:all, master:all, worker:0, worker:1

read_logs(pytorchjob_client, pytorch_distributed_jobname, user_namespace, "master:0")

# Useful Commands to run on Kubernetes control plane or in notebook using !. Substitute your namespace and pod names
#  !kubectl get pods -n <aws-hybrid-training-ns>  
#  !kubectl logs <pod-name> -n <aws-hybrid-training-ns> -f

The logs of Pod pytorch-cnn-dist-job-2022-08-03-15-59-50-215-master-0:
 Starting the script.
Distributed training - True
args.hosts - 3
args.current_host - 0
Initialized the distributed environment: 'gloo' backend on 3 nodes. 
data dir path - /efs-sc-claim/cifar10-dataset
Get train data loader
Get test data loader
Processes 16667/50000 (33%) of train data
Processes 10000/10000 (100%) of test data
Test set: Average loss: -0.0276, Accuracy: 0.19

Test set: Average loss: -0.6475, Accuracy: 0.29




Waiting for Pod condition to be Running
Master and Worker Pods are Running now
**** PyTorchJob status **** 
Running
*************************** 


**** Pod names of the PyTorchJob **** 
{'pytorch-cnn-dist-job-2022-08-03-15-59-50-215-worker-0', 'pytorch-cnn-dist-job-2022-08-03-15-59-50-215-master-0', 'pytorch-cnn-dist-job-2022-08-03-15-59-50-215-worker-1'}
*************************** 



In [29]:
# Check if the job succeeded

pytorchjob_client.is_job_succeeded(pytorch_distributed_jobname, user_namespace)

True

# CleanUp

In [34]:
# Delete all previously submitted PyTorchJobs through this command. You can run in notebook as well on kubernetes cli 

#!kubectl get pytorchjob --no-headers=true -A | awk '/pytorch-cnn-dist/{print $2}' | xargs  kubectl delete pytorchjob  -n aws-hybrid-training-ns              