![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FTips&file=Python+Training.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Tips/Python%20Training.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FTips%2FPython%2520Training.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Tips/Python%20Training.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Tips/Python%20Training.ipynb">
      <img width="32px" src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Python Training - Vertex AI Training Custom Jobs

ML Training with Python code as a Vertex AI Training Custom Job

Why?  This notebook is an IDE that happens to also happen to have:
- **compute**: CPU, Memory, GPU
- **software**: container running with Python and loaded packages like TensorFlow, PyTorch, ...
- **code**: user-written instruction for ML training

But scaling this notebook instance to run our ML training code has limitations:
- paying `$$$$` while typing and troubleshooting
- running training code multiple times with different data sources
- running training code with multiple configuration of hyperparameters for tuning
- automating training code execution based on time or events

Rather than scaling this notebook up to larger **compute** we want to launch a fit for purpose job that runs our training **code** using the **software** of choice on the needed **compute** to handle the size of our training data.  That is made simple with Vertex AI Training Custom Jobs.  

Our training code can be in many locations and forms:
- local files
    - single script
    - folders/modules
    - Python Package Distribution
- GCS Bucket
    - single script
    - folders/modules
    - Python Package Distribution
- GitHub
    - single script
    - folders/modules
    - Python Package Distribution
- Repository
    - Python Package hosted on Artifact Registry
    
Vertex AI Training Custom Jobs can use training code from:
- local files: single script
- GCS Bucket: Python Source Distribution
- Custom Container
    - Built with code originating at any of the locations and forms above!

<p align="center" width="100%">
    <img src="../architectures/overview/training.png" width="45%">
    &nbsp; &nbsp; &nbsp; &nbsp;
    <img src="../architectures/overview/training2.png" width="45%">
</p>

---

**Prerequisites:**

The examples below use:
- the code in various formats created in the [Python Packages](./Python%20Packages.ipynb) notebook
- the custom containers created in multiple workflows by the [Python Custom Containers](./Python%20Custom%20Containers.ipynb) notebook



---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Tips/Python%20Training.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [4]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [6]:
REGION = 'us-central1'
EXPERIMENT = 'training'
SERIES = 'tips'

# specify a GCS Bucket
GCS_BUCKET = PROJECT_ID

packages:

In [58]:
import os, shutil
import importlib
from datetime import datetime

from google.cloud import aiplatform
from google.cloud import storage

clients:

In [59]:
aiplatform.init(project = PROJECT_ID, location = REGION)
gcs = storage.Client(project = PROJECT_ID)

parameters:

In [9]:
DIR = f'code/{EXPERIMENT}'

In [10]:
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'1026793852137-compute@developer.gserviceaccount.com'

List the service accounts current roles:

In [11]:
!gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$SERVICE_ACCOUNT" --format='table(bindings.role)' --flatten="bindings[].members"

ROLE
roles/bigquery.admin
roles/owner
roles/run.admin
roles/secretmanager.secretAccessor
roles/storage.objectAdmin


>Note: If the resulting list is missing [roles/storage.objectAdmin](https://cloud.google.com/storage/docs/access-control/iam-roles) then [revisit the setup notebook](../00%20-%20Setup/00%20-%20Environment%20Setup.ipynb#permissions) and add this permission to the service account with the provided instructions.

environment:

In [12]:
if not os.path.exists(DIR):
    os.makedirs(DIR)
else:
    shutil.rmtree(DIR, ignore_errors = True)
    os.makedirs(DIR)
    
# list contents of directory one level higher than DIR
os.listdir(DIR + '/../')

['training', 'containers', 'packages']

---
## Local Files

Some of the method covered below either launch Vertex AI Training from local code or actuall run the code locally.  If the prior notebooks in the series have been run in this enviornment then the code is already present: 

If this notebook is being run in isolation (like Colab) then the following cell will copy the results from a prior run of these notebooks to the local drive from GCS.

Make sure training code is in local directory (if not already included in a clone of this repository):

In [69]:
train_dir = f'code/packages'
if not os.path.exists(train_dir):
    print('not found locally, copying from GCS...')
    bucket = gcs.lookup_bucket(GCS_BUCKET)
    for blob in list(bucket.list_blobs(prefix = f'{SERIES}/{train_dir}')):
        #print(blob.name)
        file_path = f'./{train_dir}' + blob.name.split(f'{SERIES}/{train_dir}')[-1]
        if not os.path.exists(os.path.dirname(file_path)):
            os.makedirs(os.path.dirname(file_path))
        blob.download_to_filename(f'./{train_dir}' + blob.name.split(f'{SERIES}/{train_dir}')[-1])
    print('Training code download')
else:
    print('Training code found')

Training code found


---
## Vertex AI Training Custom Jobs Example Workflows

Vertex AI Training Custom Jobs can use:
- a local script
- GCS housed Python source distribution
- a custom container
    - all the workflows from the [Python Custom Containers](./Python%20Custom%20Container.ipynb) notebook

This section show examples of running Vertex AI Custom Jobs in many different workflows.  It also shows how to uses the workflow and test the training script locally, in the notebook instance.

**Examples Workflows**
- [Custom Container - Workflow 1 - Copy Script To Container](#workflow1)
- [Custom Container - Workflow 2 - Copy Folder To Container](#workflow2)
- [Custom Container - Workflow 3 - Copy Package To Container](#workflow3)
- [Custom Container - Workflow 4 - pip install package from GCS to container](#workflow4)
- [Custom Container - Workflow 5 - pip install package from GitHub to container](#workflow5)
- [Custom Container - Workflow 6 - pip install package from Artifact Registry to container](#workflow6)
- [Local Script](#script)
- [Python Source Distribution](#source)
- [Running in Notebook](#notebook)

---
### Common Prep for Examples

#### Inputs & Parameters

In [13]:
# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id,splits' # add more variables to the string with space delimiters
EPOCHS = 10
BATCH_SIZE = 100

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# Experiment Tracking
FRAMEWORK = 'tf'
TASK = 'classification'
MODEL_TYPE = 'dnn'
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-{FRAMEWORK}-{TASK}-{MODEL_TYPE}'

# Resources
TRAIN_COMPUTE = 'n1-standard-4'
TRAIN_IMAGE = 'us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-12.py310:latest'
REPOSITORY = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{PROJECT_ID}-docker"

# parameters
BUCKET = PROJECT_ID
URI = f"gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}"
DIR = f"temp/{EXPERIMENT}"

#### Tensorboard

The example test jobs below are based on jobs in the `05 - TensorFlow` series and takes advantage of Vertex AI Experiments and mangaed TensorBoard.  This section creates a TensorBoard instance and gets other inputs for the jobs:

In [14]:
tb = aiplatform.Tensorboard.list(filter=f"labels.series={SERIES}")
if tb:
    tb = tb[0]
else: 
    tb = aiplatform.Tensorboard.create(display_name = SERIES, labels = {'series' : f'{SERIES}'})

In [15]:
tb.resource_name

'projects/1026793852137/locations/us-central1/tensorboards/8386058747931262976'

#### Vertex AI Experiments

The code in this section initializes the experiment that represents this notebook.  Throughout the notebook sections the model training and evaluation information will be logged to the experiment using as an experiment run using:

- [.log_params](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform#google_cloud_aiplatform_log_params)
- [.log_metrics](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform#google_cloud_aiplatform_log_metrics)
- [.log_time_series_metrics](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform#google_cloud_aiplatform_log_time_series_metrics)

In [16]:
aiplatform.init(experiment = EXPERIMENT_NAME, experiment_tensorboard = tb.resource_name)

#### Vertex AI Training Custom Job Parameters

In [54]:
CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" #updated by each workflow below
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": '', # will be filled in below by the workflow
            "command": [],
            "args": [] # will be filled in below by the workflow
        }
    }
]

---
<a id = 'workflow1'></a>
### Custom Container - Workflow 1 - Copy Script To Container

The custom container used here was created by [Python Custom Containers - Workflow 1](./Python%20Custom%20Containers.ipynb#workflow1).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).


Job Parameters:

In [18]:
WORKFLOW = 'workflow_1'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"

CMDARGS[-1] = "--run_name=" + RUN_NAME
WORKER_POOL_SPEC[0]['container_spec']['image_uri'] = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"
WORKER_POOL_SPEC[0]['container_spec']['args'] = CMDARGS
WORKER_POOL_SPEC

[{'replica_count': 1,
  'machine_spec': {'machine_type': 'n1-standard-4', 'accelerator_count': 0},
  'container_spec': {'image_uri': 'us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/tips_trainer_workflow_1',
   'command': [],
   'args': ['--epochs=10',
    '--batch_size=100',
    '--var_target=Class',
    '--var_omit=transaction_id,splits',
    '--project_id=statmike-mlops-349915',
    '--bq_project=statmike-mlops-349915',
    '--bq_dataset=fraud',
    '--bq_table=fraud_prepped',
    '--region=us-central1',
    '--experiment=training',
    '--series=tips',
    '--experiment_name=experiment-tips-training-tf-classification-dnn',
    '--run_name=run-workflow-1-20231222150338']}}]

Define the `aiplatform.CustomJob`:

In [19]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [20]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/6701374593627586560
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/6701374593627586560')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6701374593627586560?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+6701374593627586560
CustomJob projects/1026793852137/locations/us-central1/customJobs/6701374593627586560 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/6701374593627586560 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/6701374593627586560 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [21]:
customJob.display_name

'training_tips_workflow_1_20231222150338'

In [22]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/6701374593627586560'

In [23]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/6701374593627586560/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+6701374593627586560


---
<a id = 'workflow2'></a>
### Custom Container - Workflow 2 - Copy Folder To Container

The custom container used here was created by [Python Custom Containers - Workflow 2](./Python%20Custom%20Containers.ipynb#workflow2).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

Job Parameters:

In [24]:
WORKFLOW = 'workflow_2'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"


CMDARGS[-1] = "--run_name=" + RUN_NAME
WORKER_POOL_SPEC[0]['container_spec']['image_uri'] = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"
WORKER_POOL_SPEC[0]['container_spec']['args'] = CMDARGS

WORKER_POOL_SPEC

[{'replica_count': 1,
  'machine_spec': {'machine_type': 'n1-standard-4', 'accelerator_count': 0},
  'container_spec': {'image_uri': 'us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/tips_trainer_workflow_2',
   'command': [],
   'args': ['--epochs=10',
    '--batch_size=100',
    '--var_target=Class',
    '--var_omit=transaction_id,splits',
    '--project_id=statmike-mlops-349915',
    '--bq_project=statmike-mlops-349915',
    '--bq_dataset=fraud',
    '--bq_table=fraud_prepped',
    '--region=us-central1',
    '--experiment=training',
    '--series=tips',
    '--experiment_name=experiment-tips-training-tf-classification-dnn',
    '--run_name=run-workflow-2-20231222151326']}}]

Define the `aiplatform.CustomJob`:

In [25]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [26]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/7323997242111557632
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/7323997242111557632')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/7323997242111557632?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+7323997242111557632
CustomJob projects/1026793852137/locations/us-central1/customJobs/7323997242111557632 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/7323997242111557632 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/7323997242111557632 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [27]:
customJob.display_name

'training_tips_workflow_2_20231222151326'

In [28]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/7323997242111557632'

In [29]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/7323997242111557632/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+7323997242111557632


---
<a id = 'workflow3'></a>
### Custom Container - Workflow 3 - Copy Package To Container

The custom container used here was created by [Python Custom Containers - Workflow 3](./Python%20Custom%20Containers.ipynb#workflow3).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

Job Parameters:

In [30]:
WORKFLOW = 'workflow_3'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"


CMDARGS[-1] = "--run_name=" + RUN_NAME
WORKER_POOL_SPEC[0]['container_spec']['image_uri'] = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"
WORKER_POOL_SPEC[0]['container_spec']['args'] = CMDARGS

WORKER_POOL_SPEC

[{'replica_count': 1,
  'machine_spec': {'machine_type': 'n1-standard-4', 'accelerator_count': 0},
  'container_spec': {'image_uri': 'us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/tips_trainer_workflow_3',
   'command': [],
   'args': ['--epochs=10',
    '--batch_size=100',
    '--var_target=Class',
    '--var_omit=transaction_id,splits',
    '--project_id=statmike-mlops-349915',
    '--bq_project=statmike-mlops-349915',
    '--bq_dataset=fraud',
    '--bq_table=fraud_prepped',
    '--region=us-central1',
    '--experiment=training',
    '--series=tips',
    '--experiment_name=experiment-tips-training-tf-classification-dnn',
    '--run_name=run-workflow-3-20231222152234']}}]

Define the `aiplatform.CustomJob`:

In [31]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [32]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/5643591631148941312
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/5643591631148941312')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/5643591631148941312?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+5643591631148941312
CustomJob projects/1026793852137/locations/us-central1/customJobs/5643591631148941312 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/5643591631148941312 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/5643591631148941312 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [33]:
customJob.display_name

'training_tips_workflow_3_20231222152234'

In [34]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/5643591631148941312'

In [35]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/5643591631148941312/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+5643591631148941312


---
<a id = 'workflow4'></a>
### Custom Container - Workflow 4 - pip install package from GCS to container

The custom container used here was created by [Python Custom Containers - Workflow 4](./Python%20Custom%20Containers.ipynb#workflow4).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

In [36]:
WORKFLOW = 'workflow_4'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"


CMDARGS[-1] = "--run_name=" + RUN_NAME
WORKER_POOL_SPEC[0]['container_spec']['image_uri'] = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"
WORKER_POOL_SPEC[0]['container_spec']['args'] = CMDARGS

WORKER_POOL_SPEC

[{'replica_count': 1,
  'machine_spec': {'machine_type': 'n1-standard-4', 'accelerator_count': 0},
  'container_spec': {'image_uri': 'us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/tips_trainer_workflow_4',
   'command': [],
   'args': ['--epochs=10',
    '--batch_size=100',
    '--var_target=Class',
    '--var_omit=transaction_id,splits',
    '--project_id=statmike-mlops-349915',
    '--bq_project=statmike-mlops-349915',
    '--bq_dataset=fraud',
    '--bq_table=fraud_prepped',
    '--region=us-central1',
    '--experiment=training',
    '--series=tips',
    '--experiment_name=experiment-tips-training-tf-classification-dnn',
    '--run_name=run-workflow-4-20231222153207']}}]

Define the `aiplatform.CustomJob`:

In [37]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [38]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/5738730173277143040
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/5738730173277143040')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/5738730173277143040?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+5738730173277143040
CustomJob projects/1026793852137/locations/us-central1/customJobs/5738730173277143040 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/5738730173277143040 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/5738730173277143040 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [39]:
customJob.display_name

'training_tips_workflow_4_20231222153207'

In [40]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/5738730173277143040'

In [41]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/5738730173277143040/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+5738730173277143040


---
<a id = 'workflow5'></a>
### Custom Container - Workflow 5 - pip install package from GitHub to container

The custom container used here was created by [Python Custom Containers - Workflow 5](./Python%20Custom%20Containers.ipynb#workflow5).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

In [42]:
WORKFLOW = 'workflow_5'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"


CMDARGS[-1] = "--run_name=" + RUN_NAME
WORKER_POOL_SPEC[0]['container_spec']['image_uri'] = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"
WORKER_POOL_SPEC[0]['container_spec']['args'] = CMDARGS

WORKER_POOL_SPEC

[{'replica_count': 1,
  'machine_spec': {'machine_type': 'n1-standard-4', 'accelerator_count': 0},
  'container_spec': {'image_uri': 'us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/tips_trainer_workflow_5',
   'command': [],
   'args': ['--epochs=10',
    '--batch_size=100',
    '--var_target=Class',
    '--var_omit=transaction_id,splits',
    '--project_id=statmike-mlops-349915',
    '--bq_project=statmike-mlops-349915',
    '--bq_dataset=fraud',
    '--bq_table=fraud_prepped',
    '--region=us-central1',
    '--experiment=training',
    '--series=tips',
    '--experiment_name=experiment-tips-training-tf-classification-dnn',
    '--run_name=run-workflow-5-20231222154055']}}]

Define the `aiplatform.CustomJob`:

In [43]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [44]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/6620872750288338944
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/6620872750288338944')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6620872750288338944?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+6620872750288338944
CustomJob projects/1026793852137/locations/us-central1/customJobs/6620872750288338944 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/6620872750288338944 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/6620872750288338944 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [45]:
customJob.display_name

'training_tips_workflow_5_20231222154055'

In [46]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/6620872750288338944'

In [47]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/6620872750288338944/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+6620872750288338944


---
<a id = 'workflow6'></a>
### Custom Container - Workflow 6 - pip install package from Artifact Registry to container

The custom container used here was created by [Python Custom Containers - Workflow 6](./Python%20Custom%20Containers.ipynb#workflow6).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

In [48]:
WORKFLOW = 'workflow_6'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"


CMDARGS[-1] = "--run_name=" + RUN_NAME
WORKER_POOL_SPEC[0]['container_spec']['image_uri'] = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"
WORKER_POOL_SPEC[0]['container_spec']['args'] = CMDARGS

WORKER_POOL_SPEC

[{'replica_count': 1,
  'machine_spec': {'machine_type': 'n1-standard-4', 'accelerator_count': 0},
  'container_spec': {'image_uri': 'us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/tips_trainer_workflow_6',
   'command': [],
   'args': ['--epochs=10',
    '--batch_size=100',
    '--var_target=Class',
    '--var_omit=transaction_id,splits',
    '--project_id=statmike-mlops-349915',
    '--bq_project=statmike-mlops-349915',
    '--bq_dataset=fraud',
    '--bq_table=fraud_prepped',
    '--region=us-central1',
    '--experiment=training',
    '--series=tips',
    '--experiment_name=experiment-tips-training-tf-classification-dnn',
    '--run_name=run-workflow-6-20231222155043']}}]

Define the `aiplatform.CustomJob`:

In [49]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [50]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/3664822544873029632
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/3664822544873029632')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/3664822544873029632?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+3664822544873029632
CustomJob projects/1026793852137/locations/us-central1/customJobs/3664822544873029632 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3664822544873029632 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3664822544873029632 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [51]:
customJob.display_name

'training_tips_workflow_6_20231222155043'

In [52]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/3664822544873029632'

In [53]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/3664822544873029632/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+3664822544873029632


---
<a id = 'script'></a>
### Local Script

Run a single file training script with `aiplatform.CustomJob.from_local_script()`

Notes:
- This uses a single file `filename.py` from the local directory, not a GCS URI
- When you run `aiplatform.CustomJob.from_local_script()` it responds with a message confirming the local script was copied to the GCS URI provide in the parameter `staging_bucket = `.

This is a modified version of notebook [05a - Vertex AI Custom Model - TensorFlow - Custom Job With Python File](../05%20-%20TensorFlow/05a%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Python%20File.ipynb) that uses the local script for this project.

In [70]:
WORKFLOW = 'workflow_script'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"

CMDARGS[-1] = "--run_name=" + RUN_NAME
CMDARGS

['--epochs=10',
 '--batch_size=100',
 '--var_target=Class',
 '--var_omit=transaction_id,splits',
 '--project_id=statmike-mlops-349915',
 '--bq_project=statmike-mlops-349915',
 '--bq_dataset=fraud',
 '--bq_table=fraud_prepped',
 '--region=us-central1',
 '--experiment=training',
 '--series=tips',
 '--experiment_name=experiment-tips-training-tf-classification-dnn',
 '--run_name=run-workflow-script-20231222164919']

In [71]:
customJob = aiplatform.CustomJob.from_local_script(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    script_path = f"./code/packages/tips_trainer/src/tips_trainer/train.py",
    container_uri = TRAIN_IMAGE,
    args = CMDARGS,
    requirements = ['tensorflow_io', f'google-cloud-aiplatform>={aiplatform.__version__}', 'db-dtypes', f"protobuf>={importlib.metadata.version('protobuf')}"],
    replica_count = 1,
    machine_type = TRAIN_COMPUTE,
    accelerator_count = 0,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Training script copied to:
gs://statmike-mlops-349915/tips/training/workflow_script/20231222164919/aiplatform-2023-12-22-16:51:05.462-aiplatform_custom_trainer_script-0.1.tar.gz.


In [72]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/6200067660105908224
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/6200067660105908224')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6200067660105908224?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+6200067660105908224
CustomJob projects/1026793852137/locations/us-central1/customJobs/6200067660105908224 current state:
JobState.JOB_STATE_QUEUED
CustomJob projects/1026793852137/locations/us-central1/customJobs/6200067660105908224 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/6200067660105908224 current state:
JobState.JOB_STATE_PENDING
C

In [73]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
print(f'Review the Job here:\n{job_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/6200067660105908224/cpu?cloudshell=false&project=statmike-mlops-349915


In [74]:
print(f'Review the model output here:\nhttps://console.cloud.google.com/storage/browser/{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/{WORKFLOW}/{TIMESTAMP}?project={PROJECT_ID}')

Review the model output here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/tips/training/workflow_script/20231222164919?project=statmike-mlops-349915


---
<a id = 'source'></a>
### Python Source Distribution

Use the Python Source Distribution to run with `aiplatform.CustomJob(..., worker_pool_specs = )` by specifying `python_package_spec = ` in the `worker_pool_specs`.

Notes:
- This uses a Python Source Distribution which is in the format `.tar.gz`, a compressed tarball
- This project has a prepared Python Source Distribution in `./code/tips_trainer/dist/` which was created by [Python Packages](./Python%20Packages.ipynb)
- The `python_package_spec` parameter of the `worker_pool_specs` has subparameter `package_uris` which allows a list of up to 100 source distributions.  These must be provided as GCS URIs like `gs://bucketname/path_to_file.tar.gz`

This is a modified version of notebook [05b - Vertex AI Custom Model - TensorFlow - Custom Job With Python Source Distribution](../05%20-%20TensorFlow/05b%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Python%20Source%20Distribution.ipynb) that uses the source distribution stored in GCS for this project.

In [75]:
WORKFLOW = 'workflow_source'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"

CMDARGS[-1] = "--run_name=" + RUN_NAME

# remove container_spec and replace with python_package_spec:
SOURCE_WORKER_POOL_SPEC = WORKER_POOL_SPEC
SOURCE_WORKER_POOL_SPEC[0].pop('container_spec', None)
SOURCE_WORKER_POOL_SPEC[0]['python_package_spec'] = {
            "executor_image_uri": TRAIN_IMAGE,
            "package_uris": [f"gs://{GCS_BUCKET}/{SERIES}/code/packages/tips_trainer/dist/tips_trainer-0.1.tar.gz"],
            "python_module": "tips_trainer.train",
            "args": CMDARGS
}

SOURCE_WORKER_POOL_SPEC

[{'replica_count': 1,
  'machine_spec': {'machine_type': 'n1-standard-4', 'accelerator_count': 0},
  'python_package_spec': {'executor_image_uri': 'us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-12.py310:latest',
   'package_uris': ['gs://statmike-mlops-349915/tips/code/packages/tips_trainer/dist/tips_trainer-0.1.tar.gz'],
   'python_module': 'tips_trainer.train',
   'args': ['--epochs=10',
    '--batch_size=100',
    '--var_target=Class',
    '--var_omit=transaction_id,splits',
    '--project_id=statmike-mlops-349915',
    '--bq_project=statmike-mlops-349915',
    '--bq_dataset=fraud',
    '--bq_table=fraud_prepped',
    '--region=us-central1',
    '--experiment=training',
    '--series=tips',
    '--experiment_name=experiment-tips-training-tf-classification-dnn',
    '--run_name=run-workflow-source-20231222170120']}}]

In [76]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = SOURCE_WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

In [77]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/1635669437765910528
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/1635669437765910528')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1635669437765910528?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+1635669437765910528
CustomJob projects/1026793852137/locations/us-central1/customJobs/1635669437765910528 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1635669437765910528 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1635669437765910528 current state:
JobState.JOB_STATE_PENDING


In [78]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
print(f'Review the Job here:\n{job_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/1635669437765910528/cpu?cloudshell=false&project=statmike-mlops-349915


In [79]:
print(f'Review the model output here:\nhttps://console.cloud.google.com/storage/browser/{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/{WORKFLOW}/{TIMESTAMP}?project={PROJECT_ID}')

Review the model output here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/tips/training/workflow_source/20231222170120?project=statmike-mlops-349915


---
<a id = 'notebook'></a>
## Notebook - local code


Use the training script in the local notebook environment.  While the script is authored and packaged for running in a Vertex AI Training Custom Job it can also be used locally.  This is helpful for testing.  

The job will be launched like any Python job with `python -m tips_trainer.train <list_of_args_here>`.  Choices for making the `tips_trainer.train` module/file/script available are:
- pip install from Artifact Registry: `pip install --index-url https://{REGION}-python.pkg.dev/{PROJECT_ID}/{PROJECT_ID}-python/simple tips-trainer`
- pip install from local directory: `pip install /code/packages/tips_trainer/dist/*.whl`
- pip install from GitHub: `pip install https://github.com/statmike/vertex-ai-mlops/blob/main/Tips/code/packages/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl?raw=true`
- run from local directory: `./code/packages/tips_trainer/src` or copy to directory of choice

Notes:
- Vetex AI Training Jobs set [environment variables for Cloud Storage locations](https://cloud.google.com/vertex-ai/docs/training/code-requirements#environment-variables).  Since this example is running in the training code in the local notebook instance rather than in a custom job, these will need to be set manually.
    - `AIP_MODEL_DIR` - this extends `base_output_directory` with `/model`
    - `AIP_TENSORBOARD_LOG_DIR` - this extends `base_output_directory` with `/log`

In [80]:
WORKFLOW = 'workflow_nb_local'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"

CMDARGS[-1] = "--run_name=" + RUN_NAME
CMDARGS

['--epochs=10',
 '--batch_size=100',
 '--var_target=Class',
 '--var_omit=transaction_id,splits',
 '--project_id=statmike-mlops-349915',
 '--bq_project=statmike-mlops-349915',
 '--bq_dataset=fraud',
 '--bq_table=fraud_prepped',
 '--region=us-central1',
 '--experiment=training',
 '--series=tips',
 '--experiment_name=experiment-tips-training-tf-classification-dnn',
 '--run_name=run-workflow-nb-local-20231222171048']

Set environment variable the code expects:

In [81]:
base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}"

os.environ["AIP_MODEL_DIR"] = base_output_dir + '/model'
os.environ["AIP_TENSORBOARD_LOG_DIR"] = base_output_dir + '/logs'

In [82]:
%%bash
echo $AIP_MODEL_DIR
echo $AIP_TENSORBOARD_LOG_DIR

gs://statmike-mlops-349915/tips/training/workflow_nb_local/20231222171048/model
gs://statmike-mlops-349915/tips/training/workflow_nb_local/20231222171048/logs


Run the training code locally:

In [83]:
!cd ./code/packages/tips_trainer/src && python -m tips_trainer.train {(' ').join(CMDARGS)}

2023-12-22 17:10:56.202103: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-22 17:10:58.884932: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 17:10:58.885089: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 17:10:59.118144: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-22 17:10:59.924073: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-22 17:10:59.926912: I tensorflow/core/platform/cpu_feature_guard.cc:1

In [86]:
print(f"Review the Cloud Storage contents for this job here:\nhttps://console.cloud.google.com/storage/browser/{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/{WORKFLOW}/{TIMESTAMP}?project={PROJECT_ID}")

Review the Cloud Storage contents for this job here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/tips/training/workflow_nb_local/20231222171048?project=statmike-mlops-349915


In [85]:
print(f"Review the TensorBoard for this Experiment here:\nhttps://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{EXPERIMENT_NAME}")

Review the TensorBoard for this Experiment here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+8386058747931262976+experiments+experiment-tips-training-tf-classification-dnn
