<a href="https://colab.research.google.com/github/deep-diver/Continuous-Adaptation-for-Machine-Learning-System-to-Data-Changes/blob/main/notebooks/%2002_TFX_Training_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook assumes you are familiar with the basics of Vertex AI, TFX (especially custom components), and TensorFlow. 

## References

This notebook refers to the following resources and also reuses parts of the code from there: 
* [Simple TFX Pipeline for Vertex Pipelines](https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/gcp/vertex_pipelines_simple.ipynb)
* [Vertex AI Training with TFX and Vertex Pipelines](https://www.tensorflow.org/tfx/tutorials/tfx/gcp/vertex_pipelines_vertex_training)
* [Importing models to Vertex AI](https://cloud.google.com/vertex-ai/docs/general/import-model)
* [Deploying a model using the Vertex AI API](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api)
* [MLOPs with Vertex AI](https://github.com/GoogleCloudPlatform/mlops-with-vertex-ai)
* [Custom components TFX](https://www.tensorflow.org/tfx/tutorials/tfx/python_function_component)

## Setup

In [38]:
# Use the latest version of pip.
%%capture
!pip install --upgrade tfx==1.2.0 kfp==1.6.1
!pip install -q --upgrade google-cloud-aiplatform

### ***Please restart runtime before continuing.*** 

In [1]:
!gcloud init

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
component_manager:
  disable_update_check: 'True'
compute:
  gce_metadata_read_timeout_sec: '0'

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
Please enter your numeric choice:  2

Enter configuration name. Names start with a lower case letter and 
contain only lower case letters a-z, digits 0-9, and hyphens '-':  gde
Your current configuration has been set to: [gde]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

You must log in to continue. Would you like to log in (Y/n)?  Y

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&cli

In [2]:
from google.colab import auth
auth.authenticate_user()

## Imports

In [39]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))
import kfp
print('KFP version: {}'.format(kfp.__version__))

from google.cloud import aiplatform as vertex_ai
import os

TensorFlow version: 2.5.1
TFX version: 1.2.0
KFP version: 1.6.1


## Environment setup

In [40]:
GOOGLE_CLOUD_PROJECT = 'central-hangar-321813'    #@param {type:"string"}
GOOGLE_CLOUD_REGION = 'us-central1'             #@param {type:"string"}
GCS_BUCKET_NAME = 'cifar10-experimental-csp'            #@param {type:"string"}

if not (GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION and GCS_BUCKET_NAME):
    from absl import logging
    logging.error('Please set all required parameters.')

The location of the bucket must be a single region. Also, the bucket needs to be created in a region when [Vertex AI services are available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). 

In [41]:
PIPELINE_NAME = 'continuous-adaptation-for-data-changes'

# Path to various pipeline artifact.
PIPELINE_ROOT = 'gs://{}/pipeline_root/{}'.format(
    GCS_BUCKET_NAME, PIPELINE_NAME)

# Paths for users' Python module.
MODULE_ROOT = 'gs://{}/pipeline_module/{}'.format(
    GCS_BUCKET_NAME, PIPELINE_NAME)

# Paths for input data.
DATA_ROOT = 'gs://cifar10-csp-public'

# This is the path where your model will be pushed for serving.
SERVING_MODEL_DIR = 'gs://{}/serving_model/{}'.format(
    GCS_BUCKET_NAME, PIPELINE_NAME)

print('PIPELINE_ROOT: {}'.format(PIPELINE_ROOT))

PIPELINE_ROOT: gs://cifar10-experimental-csp/pipeline_root/continuous-adaptation-for-data-changes


## Create training modules

In [42]:
_trainer_module_file = 'trainer.py'

In [43]:
%%writefile {_trainer_module_file}

from typing import List
from absl import logging
from tensorflow import keras
from tfx import v1 as tfx
import tensorflow as tf

_IMAGE_FEATURES = {
    "image": tf.io.FixedLenFeature([], tf.string),
    "label": tf.io.FixedLenFeature([], tf.int64),
}

_CONCRETE_INPUT = "numpy_inputs"
_TRAIN_BATCH_SIZE = 64
_EVAL_BATCH_SIZE = 64
_INPUT_SHAPE = (32, 32, 3)
_EPOCHS = 2

def _parse_fn(example):
    example = tf.io.parse_single_example(example, _IMAGE_FEATURES)
    image = tf.image.decode_jpeg(example["image"], channels=3)
    class_label = tf.cast(example["label"], tf.int32)
    return image, class_label

def _input_fn(file_pattern: List[str], batch_size: int) -> tf.data.Dataset:
  print(f"Reading data from: {file_pattern}")
  tfrecord_filenames = tf.io.gfile.glob(file_pattern[0] + ".gz")
  print(tfrecord_filenames)
  dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type="GZIP")
  dataset = dataset.map(_parse_fn).batch(batch_size)
  return dataset.repeat()

def _make_keras_model() -> tf.keras.Model:
  """Creates a DenseNet121-based model for classifying flowers data.

  Returns:
  A Keras Model.
  """
  inputs = keras.Input(shape=_INPUT_SHAPE)
  base_model = keras.applications.ResNet50(
      include_top=False, input_shape=_INPUT_SHAPE, pooling="avg"
  )
  base_model.trainable = False
  x = keras.applications.densenet.preprocess_input(inputs)
  x = base_model(
      x, training=False
  )  # Ensures BatchNorm runs in inference model in this model
  outputs = keras.layers.Dense(10, activation="softmax")(x)
  model = keras.Model(inputs, outputs)

  model.compile(
      optimizer=keras.optimizers.Adam(),
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      metrics=[keras.metrics.SparseCategoricalAccuracy()],
  )

  model.summary(print_fn=logging.info)
  return model

def _preprocess(bytes_input):
    decoded = tf.io.decode_jpeg(bytes_input, channels=3)
    resized = tf.image.resize(decoded, size=(32, 32))
    return resized


@tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
def preprocess_fn(bytes_inputs):
    decoded_images = tf.map_fn(
        _preprocess, bytes_inputs, dtype=tf.float32, back_prop=False
    )
    return {_CONCRETE_INPUT: decoded_images}


def _model_exporter(model: tf.keras.Model):
  m_call = tf.function(model.call).get_concrete_function(
      [
          tf.TensorSpec(
              shape=[None, 32, 32, 3], dtype=tf.float32, name=_CONCRETE_INPUT
          )
      ]
  )

  @tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
  def serving_fn(bytes_inputs):
    # This function comes from the Computer Vision book from O'Reilly.
    labels = tf.constant(
        ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"], dtype=tf.string
    )
    images = preprocess_fn(bytes_inputs)

    probs = m_call(**images)
    indices = tf.argmax(probs, axis=1)
    pred_source = tf.gather(params=labels, indices=indices)
    pred_confidence = tf.reduce_max(probs, axis=1)
    return {"label": pred_source, "confidence": pred_confidence}

  return serving_fn

def run_fn(fn_args: tfx.components.FnArgs):
  print(fn_args)

  train_dataset = _input_fn(fn_args.train_files, batch_size=_TRAIN_BATCH_SIZE)
  eval_dataset = _input_fn(fn_args.eval_files, batch_size=_EVAL_BATCH_SIZE)

  model = _make_keras_model()
  model.fit(
      train_dataset,
      steps_per_epoch=fn_args.train_steps,
      validation_data=eval_dataset,
      validation_steps=fn_args.eval_steps,
      epochs=_EPOCHS,
  )  

  _, acc = model.evaluate(eval_dataset, steps=fn_args.eval_steps)
  logging.info(f"Validation accuracy: {round(acc * 100, 2)}%")
  # The result of the training should be saved in `fn_args.serving_model_dir`
  # directory.
  tf.saved_model.save(
      model,
      fn_args.serving_model_dir,
      signatures={"serving_default": _model_exporter(model)},
  )  

Overwriting trainer.py


In [44]:
!gsutil cp {_trainer_module_file} {MODULE_ROOT}/
!gsutil ls -lh {MODULE_ROOT}/

Copying file://trainer.py [Content-Type=text/x-python]...
/ [1 files][  3.8 KiB/  3.8 KiB]                                                
Operation completed over 1 objects/3.8 KiB.                                      
   3.8 KiB  2021-09-16T03:55:55Z  gs://cifar10-experimental-csp/pipeline_module/continuous-adaptation-for-data-changes/trainer.py
TOTAL: 1 objects, 3892 bytes (3.8 KiB)


In [45]:
os.path.join(MODULE_ROOT, _trainer_module_file)

'gs://cifar10-experimental-csp/pipeline_module/continuous-adaptation-for-data-changes/trainer.py'

# Pipeline

In [46]:
# Specify training worker configurations. To minimize costs we can even specify two
# different configurations: a beefier machine for the Endpoint model and slightly less
# powerful machine for the mobile model.
TRAINING_JOB_SPEC = {
    'project': GOOGLE_CLOUD_PROJECT,
    'worker_pool_specs': [{
        'machine_spec': {
            'machine_type': 'n1-standard-4',
            'accelerator_type': 'NVIDIA_TESLA_K80',
            'accelerator_count': 1
        },
        'replica_count': 1,
        'container_spec': {
            'image_uri': 'gcr.io/tfx-oss-public/tfx:{}'.format(tfx.__version__),
        },
    }],
}

In [47]:
SERVING_JOB_SPEC = {
    'model_name': PIPELINE_NAME.replace('-','_'),  # '-' is not allowed.
    'project_id': GOOGLE_CLOUD_PROJECT,
    # The region to use when serving the model. See available regions here:
    # https://cloud.google.com/ml-engine/docs/regions
    # Note that serving currently only supports a single region:
    # https://cloud.google.com/ml-engine/reference/rest/v1/projects.models#Model  # pylint: disable=line-too-long
    'regions': [GOOGLE_CLOUD_REGION],
}

In [50]:
from tfx.proto import example_gen_pb2
from tfx.components.example_gen import utils

def _create_pipeline(
    pipeline_name: str,
    pipeline_root: str,
    data_root: str,
    trainer_module: str,
    project_id: str,
    region: str,
) -> tfx.dsl.Pipeline:
    """Creates a three component flowers pipeline with TFX."""
    splits = [
      example_gen_pb2.Input.Split(name='train',pattern='span-{SPAN}/train/*'),
      example_gen_pb2.Input.Split(name='val',pattern='span-{SPAN}/test/*')
    ]
    _, span, version = utils.calculate_splits_fingerprint_span_and_version(data_root, splits)

    input_config = example_gen_pb2.Input(splits=[
      example_gen_pb2.Input.Split(name='train', pattern=f'span-{span}/train/*'),
                  example_gen_pb2.Input.Split(name='val', pattern=f'span-{span}/test/*')
    ])
    example_gen = tfx.components.ImportExampleGen(input_base=data_root,
                                                  input_config=input_config)

    # Trainer
    trainer = tfx.extensions.google_cloud_ai_platform.Trainer(
        module_file=trainer_module,
        examples=example_gen.outputs["examples"],
        train_args=tfx.proto.TrainArgs(splits=['train'], num_steps=50000//64),
        eval_args=tfx.proto.EvalArgs(splits=['val'], num_steps=10000//64),
        custom_config={
            tfx.extensions.google_cloud_ai_platform.ENABLE_UCAIP_KEY: True,
            tfx.extensions.google_cloud_ai_platform.UCAIP_REGION_KEY: region,
            tfx.extensions.google_cloud_ai_platform.TRAINING_ARGS_KEY: TRAINING_JOB_SPEC,
            "use_gpu": True,
        },        
    ).with_id("trainer")

    pusher = tfx.extensions.google_cloud_ai_platform.Pusher(
        model=trainer.outputs['model'],
        custom_config={
            tfx.extensions.google_cloud_ai_platform.experimental.PUSHER_SERVING_ARGS_KEY: SERVING_JOB_SPEC
        }
    ).with_id('pusher')

    components = [
        example_gen,
        trainer,
        pusher,
    ]

    return tfx.dsl.Pipeline(
        pipeline_name=pipeline_name, 
        pipeline_root=pipeline_root,
        components=components
    )


## Compile the pipeline

In [51]:
import os

PIPELINE_DEFINITION_FILE = PIPELINE_NAME + '_pipeline.json'

# Important: We need to pass the custom Docker image URI to the
# `KubeflowV2DagRunnerConfig` to take effect.
runner = tfx.orchestration.experimental.KubeflowV2DagRunner(
    config=tfx.orchestration.experimental.KubeflowV2DagRunnerConfig(display_name=PIPELINE_NAME),
    output_filename=PIPELINE_DEFINITION_FILE)

_ = runner.run(
    _create_pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_root=DATA_ROOT,
        trainer_module=os.path.join(MODULE_ROOT, _trainer_module_file),
        project_id=GOOGLE_CLOUD_PROJECT,
        region=GOOGLE_CLOUD_REGION
    )
)

## Submit the pipeline for execution to Vertex AI

Generally, it's a good idea to first do a local run of the end-to-end pipeline before submitting it an online orchestrator. We can use `tfx.orchestration.LocalDagRunner()` for that but for the purposes of this notebook we won't be doing that. 

In [52]:
from kfp.v2.google import client

pipelines_client = client.AIPlatformClient(
    project_id=GOOGLE_CLOUD_PROJECT,
    region=GOOGLE_CLOUD_REGION,
)

_ = pipelines_client.create_run_from_job_spec(PIPELINE_DEFINITION_FILE, enable_caching=True)

