## Reduction Server PyTorch Tutorial

This notebook is a demonstration of how to use the Reduction Server feature on Vertex AI Training.

First, install Vertex AI Python SDK

In [29]:
!pip install -U google-cloud-aiplatform

Collecting google-cloud-aiplatform
  Using cached google_cloud_aiplatform-1.1.1-py2.py3-none-any.whl (1.2 MB)
  Using cached google_cloud_aiplatform-1.1.0-py2.py3-none-any.whl (1.2 MB)


Before you run the following scripts, make sure you run `gcloud auth application-default login` in your terminal to authenticate the SDK.

In [30]:
PROJECT = 'curious-entropy-222019'
REGION = 'us-central1'
API_ENDPOINT = f'{REGION}-aiplatform.googleapis.com'
TRAINING_IMAGE = f'gcr.io/{PROJECT}/rs-test-pytorch:latest'

Build the training image. See `Dockerfile` for details about how to prepare your training image for reduction servers. 

In [31]:
!docker build -t $TRAINING_IMAGE .
!docker push $TRAINING_IMAGE


Step 1/8 : FROM gcr.io/deeplearning-platform-release/pytorch-gpu.1-7
 ---> 5bc07240b483
Step 2/8 : WORKDIR /root
 ---> Using cache
 ---> cf847519748d
Step 3/8 : RUN apt-get update &&     apt-get remove -y google-fast-socket &&     apt-get install -y libcupti-dev google-reduction-server
 ---> Using cache
 ---> d0bf607b199a
Step 4/8 : ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
 ---> Using cache
 ---> 012b86c7bb23
Step 5/8 : ENV NCCL_DEBUG=INFO
 ---> Using cache
 ---> 70333525c831
Step 6/8 : COPY mnist_trainer.py mnist_trainer.py
 ---> Using cache
 ---> bc8e4cfd76b1
Step 7/8 : COPY run.sh run.sh
 ---> 793b80322286
Step 8/8 : ENTRYPOINT ["run.sh"]
 ---> Running in d688a0d18086
Removing intermediate container d688a0d18086
 ---> 7fe1943c85e2
Successfully built 7fe1943c85e2
Successfully tagged gcr.io/curious-entropy-222019/rs-test-pytorch:latest
The push refers to repository [gcr.io/curious-entropy-222019/rs-test-pytorch]

[1B
[1B
[1B
[1B
[1B
[1B
[1B
[1

Create training jobs. In this example, we use two `a2-highgpu-8g` as worker nodes, and 8 `n1-highcpu-16` as reducer nodes.

In [32]:
from google.cloud import aiplatform, aiplatform_v1beta1

aiplatform.init(
    # your Google Cloud Project ID or number environment default used is not set
    project=PROJECT,

    # the Vertex AI region you will use defaults to us-central1
    location=REGION,
)

custom_job_spec = {
   'display_name': 'reduction-server-job',
   'job_spec': {
       'worker_pool_specs': [
           {
               'container_spec': {
                   'image_uri': TRAINING_IMAGE
                },
                'machine_spec': {
                    'accelerator_count': 8,
                    'accelerator_type': 'NVIDIA_TESLA_A100',
                    'machine_type': 'a2-highgpu-8g'
                },
                'replica_count': 1
            },
            {
                'container_spec': {
                    'image_uri': TRAINING_IMAGE
                },
                'machine_spec': {
                    'accelerator_count': 8,
                    'accelerator_type': 'NVIDIA_TESLA_A100',
                    'machine_type': 'a2-highgpu-8g'
                },
                'replica_count': 1
            },
            {
                'container_spec': {
                    'image_uri': 'us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest'
                },
                'machine_spec': {
                    'machine_type': 'n1-highcpu-16'
                },
                'replica_count': 8
            },
        ]
   }
}

options = dict(api_endpoint=API_ENDPOINT)
client = aiplatform_v1beta1.services.job_service.JobServiceClient(client_options=options)
parent = f"projects/{PROJECT}/locations/{REGION}"
client.create_custom_job(
   parent=parent, custom_job=custom_job_spec
)


name: "projects/697926852371/locations/us-central1/customJobs/6490966894575616000"
display_name: "reduction-server-job"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "a2-highgpu-8g"
      accelerator_type: NVIDIA_TESLA_A100
      accelerator_count: 8
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/curious-entropy-222019/rs-test-pytorch:latest"
    }
  }
  worker_pool_specs {
    machine_spec {
      machine_type: "a2-highgpu-8g"
      accelerator_type: NVIDIA_TESLA_A100
      accelerator_count: 8
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/curious-entropy-222019/rs-test-pytorch:latest"
    }
  }
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-highcpu-16"
    }
    replica_count: 8
    disk_spec {
      boot_disk_type: "pd-ssd"
  

Now you should be able to see your training job on Cloud Console.