## Reduction Server PyTorch Tutorial

This notebook is a demonstration of how to use the Reduction Server feature on Vertex AI Training.

First, install Vertex AI Python SDK

In [None]:
!pip install -U google-cloud-aiplatform

Before you run the following scripts, make sure you run `gcloud auth application-default login` in your terminal to authenticate the SDK.

In [None]:
PROJECT = 'YOUR PROJECT' # Replace with your GCP project name
REGION = 'us-central1'
API_ENDPOINT = f'{REGION}-aiplatform.googleapis.com'
TRAINING_IMAGE = f'gcr.io/{PROJECT}/rs-test-pytorch:latest'

Build the training image. See `Dockerfile` for details about how to prepare your training image for reduction servers. 

In [None]:
!docker build -t $TRAINING_IMAGE .
!docker push $TRAINING_IMAGE

Create training jobs. In this example, we use two `a2-highgpu-8g` as worker nodes, and 8 `n1-highcpu-16` as reducer nodes.

In [None]:
from google.cloud import aiplatform, aiplatform_v1beta1

aiplatform.init(
    # your Google Cloud Project ID or number environment default used is not set
    project=PROJECT,

    # the Vertex AI region you will use defaults to us-central1
    location=REGION,
)

custom_job_spec = {
   'display_name': 'reduction-server-job',
   'job_spec': {
       'worker_pool_specs': [
           {
               'container_spec': {
                   'image_uri': TRAINING_IMAGE
                },
                'machine_spec': {
                    'accelerator_count': 8,
                    'accelerator_type': 'NVIDIA_TESLA_A100',
                    'machine_type': 'a2-highgpu-8g'
                },
                'replica_count': 1
            },
            {
                'container_spec': {
                    'image_uri': TRAINING_IMAGE
                },
                'machine_spec': {
                    'accelerator_count': 8,
                    'accelerator_type': 'NVIDIA_TESLA_A100',
                    'machine_type': 'a2-highgpu-8g'
                },
                'replica_count': 1
            },
            {
                'container_spec': {
                    'image_uri': 'us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest'
                },
                'machine_spec': {
                    'machine_type': 'n1-highcpu-16'
                },
                'replica_count': 8
            },
        ]
   }
}

options = dict(api_endpoint=API_ENDPOINT)
client = aiplatform_v1beta1.services.job_service.JobServiceClient(client_options=options)
parent = f"projects/{PROJECT}/locations/{REGION}"
client.create_custom_job(
   parent=parent, custom_job=custom_job_spec
)


Now you should be able to see your training job on Cloud Console.