# Adding TFX data validation, feature engineering and model analysis

***WARNING:absl:Pusher is going to push the model without validation. Consider using Evaluator or InfraValidator in your pipeline.***

Parts copied from https://www.tensorflow.org/tfx/guide#tfx_standard_components: 
1. **Basics:** https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_simple.ipynb
  * ExampleGen is the initial input component of a pipeline that ingests and optionally splits the input dataset: https://www.tensorflow.org/tfx/guide/examplegen
  * Trainer trains the model: https://www.tensorflow.org/tfx/guide/trainer
  * Pusher deploys the model on a serving infrastructure: https://www.tensorflow.org/tfx/guide/pusher
1. **Data validation:** https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_tfdv.ipynb
  * https://www.tensorflow.org/tfx/guide/tfdv
  * StatisticsGen calculates statistics for the dataset: https://www.tensorflow.org/tfx/guide/statsgen
  * SchemaGen examines the statistics and creates a data schema: https://www.tensorflow.org/tfx/guide/schemagen
1. **Feature Engineering:** https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_tft.ipynb
  * https://www.tensorflow.org/tfx/guide/tft
  * Transform performs feature engineering on the dataset: https://www.tensorflow.org/tfx/guide/transform
1. **Model Analysis:** https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_tfma.ipynb
  * https://www.tensorflow.org/tfx/guide/tfma
  * Evaluator performs deep analysis of the training results and helps you validate your exported models, ensuring that they are "good enough" to be pushed to production: https://www.tensorflow.org/tfx/guide/evaluator

https://www.tensorflow.org/tfx/tutorials

In [1]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))

TensorFlow version: 2.6.0
TFX version: 1.3.0


In [2]:
from absl import logging
logging.set_verbosity(logging.WARNING)

In [3]:
DATA_DIR = "data"

In [4]:
!mkdir {DATA_DIR}

mkdir: cannot create directory ‘data’: File exists


In [5]:
!curl -o {DATA_DIR}/data.csv https://raw.githubusercontent.com/embarced/notebooks/master/mlops/insurance-customers-risk-1500.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 54421  100 54421    0     0  64864      0 --:--:-- --:--:-- --:--:-- 64786


In [6]:
!ls -l data

total 56
-rw-r--r-- 1 olli olli 54421 Oct 26 17:45 data.csv


In [7]:
!head {DATA_DIR}/data.csv

speed,age,miles,group,risk
97.0,44.0,30.0,1,0.5976112279191053
135.0,63.0,29.0,1,0.4527103520003165
111.0,26.0,34.0,0,0.750233962021037
97.0,25.0,10.0,1,0.32524900971290915
114.0,38.0,22.0,2,0.26973096398543817
130.0,55.0,34.0,0,0.5871633471963134
118.0,40.0,51.0,0,0.8753213424169751
143.0,42.0,34.0,2,0.23665405507569381
110.0,43.0,31.0,2,0.0


In [16]:
def _create_schema_pipeline(pipeline_name: str,
                            pipeline_root: str,
                            data_root: str,
                            metadata_path: str) -> tfx.dsl.Pipeline:
  """Creates a pipeline for schema generation."""
  # Brings data into the pipeline.
  example_gen = tfx.components.CsvExampleGen(input_base=data_root)

  # NEW: Computes statistics over data for visualization and schema generation.
  statistics_gen = tfx.components.StatisticsGen(
      examples=example_gen.outputs['examples'])

  # NEW: Generates schema based on the generated statistics.
  schema_gen = tfx.components.SchemaGen(
      statistics=statistics_gen.outputs['statistics'], infer_feature_shape=True)

  components = [
      example_gen,
      statistics_gen,
      schema_gen,
  ]

  return tfx.dsl.Pipeline(
      pipeline_name=pipeline_name,
      pipeline_root=pipeline_root,
      metadata_connection_config=tfx.orchestration.metadata
      .sqlite_metadata_connection_config(metadata_path),
      components=components)

In [31]:
%%time 

pipeline = _create_schema_pipeline(
      pipeline_name='insurance-schema',
      pipeline_root='schema-pipeline',
      data_root=DATA_DIR,
      metadata_path='metadata.db')
tfx.orchestration.LocalDagRunner().run(pipeline)

<tfx.orchestration.pipeline.Pipeline at 0x7f2bf40ed550>

In [26]:
!ls -l schema-pipeline

total 12
drwxr-xr-x 4 olli olli 4096 Oct 26 18:19 CsvExampleGen
drwxr-xr-x 4 olli olli 4096 Oct 26 18:19 SchemaGen
drwxr-xr-x 4 olli olli 4096 Oct 26 18:19 StatisticsGen


In [33]:
from ml_metadata.proto import metadata_store_pb2
# Non-public APIs, just for showcase.
from tfx.orchestration.portable.mlmd import execution_lib

# TODO(b/171447278): Move these functions into the TFX library.

def get_latest_artifacts(metadata, pipeline_name, component_id):
  """Output artifacts of the latest run of the component."""
  context = metadata.store.get_context_by_type_and_name(
      'node', f'{pipeline_name}.{component_id}')
  executions = metadata.store.get_executions_by_context(context.id)
  latest_execution = max(executions,
                         key=lambda e:e.last_update_time_since_epoch)
  return execution_lib.get_artifacts_dict(metadata, latest_execution.id, 
                                          metadata_store_pb2.Event.OUTPUT)

# Non-public APIs, just for showcase.
from tfx.orchestration.experimental.interactive import visualizations

def visualize_artifacts(artifacts):
  """Visualizes artifacts using standard visualization modules."""
  for artifact in artifacts:
    visualization = visualizations.get_registry().get_visualization(
        artifact.type_name)
    if visualization:
      visualization.display(artifact)

from tfx.orchestration.experimental.interactive import standard_visualizations
standard_visualizations.register_standard_visualizations()

In [35]:
# Non-public APIs, just for showcase.
from tfx.orchestration.metadata import Metadata
from tfx.types import standard_component_specs

metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config("metadata.db")

with Metadata(metadata_connection_config) as metadata_handler:
  # Find output artifacts from MLMD.
  stat_gen_output = get_latest_artifacts(metadata_handler, "insurance-schema", 'StatisticsGen')
  stats_artifacts = stat_gen_output[standard_component_specs.STATISTICS_KEY]

  schema_gen_output = get_latest_artifacts(metadata_handler, "insurance-schema", 'SchemaGen')
  schema_artifacts = schema_gen_output[standard_component_specs.SCHEMA_KEY]

In [37]:
visualize_artifacts(schema_artifacts)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'age',FLOAT,required,,-
'group',INT,required,,-
'miles',FLOAT,required,,-
'risk',FLOAT,required,,-
'speed',FLOAT,required,,-


In [36]:
visualize_artifacts(stats_artifacts)

In [8]:
!ls -l

total 12864
-rw-r--r-- 1 olli olli 1431909 Oct 24 18:13 1_mlops_train.ipynb
-rw-r--r-- 1 olli olli  175341 Oct 26 17:29 2_mlops_serve.ipynb
-rw-r--r-- 1 olli olli 1686560 Oct 26 17:09 3_mlops_shift.ipynb
-rw-r--r-- 1 olli olli   22900 Oct 26 17:34 4_mlops_tfx.ipynb
-rw-r--r-- 1 olli olli   19484 Oct 26 17:32 5_mlops_tfx_ext.ipynb
drwxr-xr-x 2 olli olli    4096 Oct 26 17:17 __pycache__
drwxr-xr-x 4 olli olli    4096 Oct 24 17:34 classifier
-rw-r--r-- 1 olli olli 3130024 Oct 26 17:09 classifier.h5
-rw-r--r-- 1 olli olli 2892775 Oct 24 17:34 classifier.tgz
drwxr-xr-x 2 olli olli    4096 Oct 26 17:11 data
-rw-r--r-- 1 olli olli 2110237 Aug 14 20:29 generate.ipynb
drwxr-xr-x 3 olli olli    4096 Oct 24 18:42 insurance
-rw-r--r-- 1 olli olli   54500 Aug 14 20:29 insurance-customers-risk-1500-shift.csv
-rw-r--r-- 1 olli olli   54435 Aug 14 20:29 insurance-customers-risk-1500-test.csv
-rw-r--r-- 1 olli olli   54421 Aug 14 20:29 insurance-customers-risk-1500.csv
-rw-r--r-- 1 olli

In [9]:
_trainer_module_file = 'trainer.py'

In [10]:
%%writefile {_trainer_module_file}

from tensorflow.keras.layers import InputLayer, Dense, Dropout, \
                                    BatchNormalization, Activation,\
                                    Input, concatenate
from typing import List
from absl import logging
import tensorflow as tf
from tensorflow import keras
from tensorflow_transform.tf_metadata import schema_utils

from tfx import v1 as tfx
from tfx_bsl.public import tfxio
from tensorflow_metadata.proto.v0 import schema_pb2

_FEATURE_KEYS = ['age', 'speed']
_LABEL_KEY = 'group'

_TRAIN_BATCH_SIZE = 32
_EVAL_BATCH_SIZE = 32

# Since we're not generating or creating a schema, we will instead create
# a feature spec.  Since there are a fairly small number of features this is
# manageable for this dataset.
_FEATURE_SPEC = {
    **{
        feature: tf.io.FixedLenFeature(shape=[1], dtype=tf.float32)
           for feature in _FEATURE_KEYS
       },
    _LABEL_KEY: tf.io.FixedLenFeature(shape=[1], dtype=tf.int64)
}

num_features = len(_FEATURE_KEYS)
dropout = 0.5

def _input_fn(file_pattern: List[str],
              data_accessor: tfx.components.DataAccessor,
              schema: schema_pb2.Schema,
              batch_size: int = 200) -> tf.data.Dataset:
  """Generates features and label for training.

  Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    schema: schema of the input data.
    batch_size: representing the number of consecutive elements of returned
      dataset to combine in a single batch

  Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
  """
  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()


def _build_keras_model() -> tf.keras.Model:
  """Creates a DNN Keras model for classifying penguin data.

  Returns:
    A Keras Model.
  """

  inputs = [keras.layers.Input(shape=(1,), name=f) for f in _FEATURE_KEYS]
  d = keras.layers.concatenate(inputs)
  for _ in range(2):
    d = Dense(500)(d)
    d = Activation('relu')(d)
    d = BatchNormalization()(d)
    d = Dropout(dropout)(d)

  outputs = Dense(name='output', units=3, activation='softmax')(d)

  model = keras.Model(inputs=inputs, outputs=outputs)

  model.compile(
      optimizer=keras.optimizers.Adam(),
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=[keras.metrics.SparseCategoricalAccuracy()])

  model.summary(print_fn=logging.info)
  return model


# TFX Trainer will call this function.
def run_fn(fn_args: tfx.components.FnArgs):
  """Train the model based on given args.

  Args:
    fn_args: Holds args used to train the model as name/value pairs.
  """

  # This schema is usually either an output of SchemaGen or a manually-curated
  # version provided by pipeline author. A schema can also derived from TFT
  # graph if a Transform component is used. In the case when either is missing,
  # `schema_from_feature_spec` could be used to generate schema from very simple
  # feature_spec, but the schema returned would be very primitive.
  schema = schema_utils.schema_from_feature_spec(_FEATURE_SPEC)

  train_dataset = _input_fn(
      fn_args.train_files,
      fn_args.data_accessor,
      schema,
      batch_size=_TRAIN_BATCH_SIZE)
  eval_dataset = _input_fn(
      fn_args.eval_files,
      fn_args.data_accessor,
      schema,
      batch_size=_EVAL_BATCH_SIZE)

  model = _build_keras_model()

  model.fit(
      train_dataset,
      steps_per_epoch=fn_args.train_steps,
      validation_data=eval_dataset,
      validation_steps=fn_args.eval_steps)

  # The result of the training should be saved in `fn_args.serving_model_dir`
  # directory.
  model.save(fn_args.serving_model_dir, save_format='tf')

Overwriting trainer.py


In [11]:
# !cat trainer.py

In [12]:
# train:eval 2:1
# epoch = 1
# examples = 500
epoch = 100
examples = 1000
steps = (epoch * examples)/128
steps

781.25

In [13]:
def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,
                     module_file: str, serving_model_dir: str,
                     metadata_path: str) -> tfx.dsl.Pipeline:
  """Creates a three component penguin pipeline with TFX."""
  # Brings data into the pipeline.
  example_gen = tfx.components.CsvExampleGen(input_base=data_root)

  # Uses user-provided Python function that trains a model.
  trainer = tfx.components.Trainer(
      module_file=module_file,
      examples=example_gen.outputs['examples'],
      train_args=tfx.proto.TrainArgs(num_steps=5000),
      eval_args=tfx.proto.EvalArgs(num_steps=15))

  # Pushes the model to a filesystem destination.
  pusher = tfx.components.Pusher(
      model=trainer.outputs['model'],
      push_destination=tfx.proto.PushDestination(
          filesystem=tfx.proto.PushDestination.Filesystem(
              base_directory=serving_model_dir)))

  # Following three components will be included in the pipeline.
  components = [
      example_gen,
      trainer,
      pusher,
  ]

  return tfx.dsl.Pipeline(
      pipeline_name=pipeline_name,
      pipeline_root=pipeline_root,
      metadata_connection_config=tfx.orchestration.metadata
      .sqlite_metadata_connection_config(metadata_path),
      components=components)

In [14]:
pipeline = _create_pipeline(
      pipeline_name='insurance-basic',
      pipeline_root='pipeline',
      data_root=DATA_DIR,
      module_file=_trainer_module_file,
      serving_model_dir='model',
      metadata_path='metadata.db')
pipeline

<tfx.orchestration.pipeline.Pipeline at 0x7f2d9ca2e2b0>

In [15]:
%%time 

tfx.orchestration.LocalDagRunner().run(pipeline)



INFO:tensorflow:Assets written to: pipeline/Trainer/model/8/Format-Serving/assets


INFO:tensorflow:Assets written to: pipeline/Trainer/model/8/Format-Serving/assets


CPU times: user 48.5 s, sys: 23.9 s, total: 1min 12s
Wall time: 59.1 s
