## Reduction Server PyTorch Tutorial

This notebook is a demonstration of how to use the Reduction Server feature on Vertex AI Training.

First, install Vertex AI Python SDK

In [9]:
!pip install -U google-cloud-aiplatform

Collecting google-cloud-aiplatform
  Using cached google_cloud_aiplatform-1.1.1-py2.py3-none-any.whl (1.2 MB)
  Using cached google_cloud_aiplatform-1.1.0-py2.py3-none-any.whl (1.2 MB)


Before you run the following scripts, make sure you run `gcloud auth application-default login` in your terminal to authenticate the SDK.

In [10]:
PROJECT = 'curious-entropy-222019'
REGION = 'us-central1'
API_ENDPOINT = f'{REGION}-aiplatform.googleapis.com'
TRAINING_IMAGE = f'gcr.io/{PROJECT}/rs-test-pytorch:latest'

Build the training image. See `Dockerfile` for details about how to prepare your training image for reduction servers. 

In [11]:
!docker build -t $TRAINING_IMAGE .
!docker push $TRAINING_IMAGE


Step 1/7 : FROM gcr.io/deeplearning-platform-release/pytorch-gpu.1-7
 ---> 5bc07240b483
Step 2/7 : WORKDIR /root
 ---> Using cache
 ---> cf847519748d
Step 3/7 : RUN apt-get update &&     apt-get remove -y google-fast-socket &&     apt-get install -y libcupti-dev google-reduction-server
 ---> Using cache
 ---> d0bf607b199a
Step 4/7 : ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
 ---> Using cache
 ---> 012b86c7bb23
Step 5/7 : ENV NCCL_DEBUG=INFO
 ---> Using cache
 ---> 70333525c831
Step 6/7 : COPY pytorch_distributed_trainer.py pytorch_distributed_trainer.py
 ---> 48d44ea5e092
Step 7/7 : ENTRYPOINT ["python",   "-m", "torch.distributed.launch",   "--nproc_per_node", "8",   "--nnodes", "2",   "pytorch_distributed_trainer.py" ]
 ---> Running in 65b51b289929
Removing intermediate container 65b51b289929
 ---> b66b18bd73a6
Successfully built b66b18bd73a6
Successfully tagged gcr.io/curious-entropy-222019/rs-test-pytorch:latest
The push refers to repository [gcr.io

Create training jobs. In this example, we use two `a2-highgpu-8g` as worker nodes, and 8 `n1-highcpu-16` as reducer nodes.

In [12]:
from google.cloud import aiplatform, aiplatform_v1beta1

aiplatform.init(
    # your Google Cloud Project ID or number
    # environment default used is not set
    project=PROJECT,

    # the Vertex AI region you will use
    # defaults to us-central1
    location=REGION,
)

custom_job_spec = {
   'display_name': 'reduction-server-job',
   'job_spec': {
       'worker_pool_specs': [
           {
               'container_spec': {
                   'image_uri': TRAINING_IMAGE
                },
                'machine_spec': {
                    'accelerator_count': 8,
                    'accelerator_type': 'NVIDIA_TESLA_A100',
                    'machine_type': 'a2-highgpu-8g'
                },
                'replica_count': 1
            },
            {
                'container_spec': {
                    'image_uri': TRAINING_IMAGE
                },
                'machine_spec': {
                    'accelerator_count': 8,
                    'accelerator_type': 'NVIDIA_TESLA_A100',
                    'machine_type': 'a2-highgpu-8g'
                },
                'replica_count': 1
            },
            # {
            #     'container_spec': {
            #         'image_uri': 'us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest'
            #     },
            #     'machine_spec': {
            #         'machine_type': 'n1-highcpu-16'
            #     },
            #     'replica_count': 8
            # },
        ]
   }
}

options = dict(api_endpoint=API_ENDPOINT)
client = aiplatform_v1beta1.services.job_service.JobServiceClient(client_options=options)
parent = f"projects/{PROJECT}/locations/{REGION}"
client.create_custom_job(
   parent=parent, custom_job=custom_job_spec
)


name: "projects/697926852371/locations/us-central1/customJobs/4654624146515296256"
display_name: "reduction-server-job"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "a2-highgpu-8g"
      accelerator_type: NVIDIA_TESLA_A100
      accelerator_count: 8
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/curious-entropy-222019/rs-test-pytorch:latest"
    }
  }
  worker_pool_specs {
    machine_spec {
      machine_type: "a2-highgpu-8g"
      accelerator_type: NVIDIA_TESLA_A100
      accelerator_count: 8
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/curious-entropy-222019/rs-test-pytorch:latest"
    }
  }
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-highcpu-16"
    }
    replica_count: 8
    disk_spec {
      boot_disk_type: "pd-ssd"
  

Now you should be able to see your training job on Cloud Console.