# Adding TFX data validation, feature engineering and model analysis

***WARNING:absl:Pusher is going to push the model without validation. Consider using Evaluator or InfraValidator in your pipeline.***

Parts copied from https://www.tensorflow.org/tfx/guide#tfx_standard_components: 
1. **Basics:** https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_simple.ipynb
  * ExampleGen is the initial input component of a pipeline that ingests and optionally splits the input dataset: https://www.tensorflow.org/tfx/guide/examplegen
  * Trainer trains the model: https://www.tensorflow.org/tfx/guide/trainer
  * Pusher deploys the model on a serving infrastructure: https://www.tensorflow.org/tfx/guide/pusher
1. **Data validation:** https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_tfdv.ipynb
  * https://www.tensorflow.org/tfx/guide/tfdv
  * StatisticsGen calculates statistics for the dataset: https://www.tensorflow.org/tfx/guide/statsgen
  * SchemaGen examines the statistics and creates a data schema: https://www.tensorflow.org/tfx/guide/schemagen
  * ExampleValidator looks for anomalies and missing values in the dataset: https://www.tensorflow.org/tfx/guide/exampleval
1. **Feature Engineering:** https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_tft.ipynb
  * https://www.tensorflow.org/tfx/guide/tft
  * Transform performs feature engineering on the dataset: https://www.tensorflow.org/tfx/guide/transform
1. **Model Analysis:** https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/penguin_tfma.ipynb
  * https://www.tensorflow.org/tfx/guide/tfma
  * Evaluator performs deep analysis of the training results and helps you validate your exported models, ensuring that they are "good enough" to be pushed to production: https://www.tensorflow.org/tfx/guide/evaluator

https://www.tensorflow.org/tfx/tutorials

In [1]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))

TensorFlow version: 2.6.0
TFX version: 1.3.0


In [2]:
from absl import logging
logging.set_verbosity(logging.WARNING)

In [81]:
!mkdir original_data

mkdir: cannot create directory ‘original_data’: File exists


In [82]:
!mkdir drifted_data

In [83]:
!curl -o original_data/data.csv https://raw.githubusercontent.com/embarced/notebooks/master/mlops/insurance-customers-risk-1500.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 54421  100 54421    0     0   281k      0 --:--:-- --:--:-- --:--:--  279k


In [84]:
!curl -o drifted_data/data.csv https://raw.githubusercontent.com/embarced/notebooks/master/mlops/insurance-customers-risk-1500-shift.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 54500  100 54500    0     0   260k      0 --:--:-- --:--:-- --:--:--  260k


In [85]:
output = tfx.proto.Output(
             split_config=tfx.proto.SplitConfig(splits=[
                 tfx.proto.SplitConfig.Split(name='all', hash_buckets=3),
#                  tfx.proto.SplitConfig.Split(name='train', hash_buckets=3),
#                  tfx.proto.SplitConfig.Split(name='eval', hash_buckets=1)
             ]))

example_gen = tfx.components.CsvExampleGen(input_base=DATA_DIR, output_config=output)

In [86]:
def _create_schema_pipeline(pipeline_name: str,
                            pipeline_root: str,
                            data_root: str,
                            metadata_path: str) -> tfx.dsl.Pipeline:
  """Creates a pipeline for schema generation."""
  # Brings data into the pipeline.
  example_gen = tfx.components.CsvExampleGen(input_base=data_root, output_config=output)

  # NEW: Computes statistics over data for visualization and schema generation.
  statistics_gen = tfx.components.StatisticsGen(
      examples=example_gen.outputs['examples'])

  # NEW: Generates schema based on the generated statistics.
  schema_gen = tfx.components.SchemaGen(
      statistics=statistics_gen.outputs['statistics'], infer_feature_shape=True)

  components = [
      example_gen,
      statistics_gen,
      schema_gen,
  ]

  return tfx.dsl.Pipeline(
      pipeline_name=pipeline_name,
      pipeline_root=pipeline_root,
      metadata_connection_config=tfx.orchestration.metadata
      .sqlite_metadata_connection_config(metadata_path),
      components=components)

In [87]:
%%time 

pipeline = _create_schema_pipeline(
      pipeline_name='insurance-schema',
      pipeline_root='schema-pipeline',
      data_root='original_data',
      metadata_path='metadata.db')
tfx.orchestration.LocalDagRunner().run(pipeline)



CPU times: user 1.38 s, sys: 4.5 ms, total: 1.38 s
Wall time: 1.53 s


In [88]:
!ls -l schema-pipeline

total 12
drwxr-xr-x 4 olli olli 4096 Oct 26 18:19 CsvExampleGen
drwxr-xr-x 4 olli olli 4096 Oct 26 18:19 SchemaGen
drwxr-xr-x 4 olli olli 4096 Oct 26 18:19 StatisticsGen


In [89]:
from ml_metadata.proto import metadata_store_pb2
# Non-public APIs, just for showcase.
from tfx.orchestration.portable.mlmd import execution_lib

# TODO(b/171447278): Move these functions into the TFX library.

def get_latest_artifacts(metadata, pipeline_name, component_id):
  """Output artifacts of the latest run of the component."""
  context = metadata.store.get_context_by_type_and_name(
      'node', f'{pipeline_name}.{component_id}')
  executions = metadata.store.get_executions_by_context(context.id)
  latest_execution = max(executions,
                         key=lambda e:e.last_update_time_since_epoch)
  return execution_lib.get_artifacts_dict(metadata, latest_execution.id, 
                                          metadata_store_pb2.Event.OUTPUT)

# Non-public APIs, just for showcase.
from tfx.orchestration.experimental.interactive import visualizations

def visualize_artifacts(artifacts):
  """Visualizes artifacts using standard visualization modules."""
  for artifact in artifacts:
    visualization = visualizations.get_registry().get_visualization(
        artifact.type_name)
    if visualization:
      visualization.display(artifact)

from tfx.orchestration.experimental.interactive import standard_visualizations
standard_visualizations.register_standard_visualizations()

In [90]:
# Non-public APIs, just for showcase.
from tfx.orchestration.metadata import Metadata
from tfx.types import standard_component_specs

metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config("metadata.db")

with Metadata(metadata_connection_config) as metadata_handler:
  # Find output artifacts from MLMD.
  example_gen_output = get_latest_artifacts(metadata_handler, "insurance-schema", 'CsvExampleGen')
  example_artifacts = example_gen_output[standard_component_specs.EXAMPLES_KEY]

  stat_gen_output = get_latest_artifacts(metadata_handler, "insurance-schema", 'StatisticsGen')
  stats_artifacts = stat_gen_output[standard_component_specs.STATISTICS_KEY]

  schema_gen_output = get_latest_artifacts(metadata_handler, "insurance-schema", 'SchemaGen')
  schema_artifacts = schema_gen_output[standard_component_specs.SCHEMA_KEY]

In [91]:
example_artifacts[0].uri

'schema-pipeline/CsvExampleGen/examples/58'

In [92]:
!ls -l {example_artifacts[0].uri}/Split-all

total 32
-rw-r--r-- 1 olli olli 29805 Oct 27 16:12 data_tfrecord-00000-of-00001.gz


In [93]:
schema_artifacts[0].uri

'schema-pipeline/SchemaGen/schema/60'

In [94]:
!ls -l {schema_artifacts[0].uri}

total 4
-rw-r--r-- 1 olli olli 705 Oct 27 16:12 schema.pbtxt


In [113]:
# text file
!cat {schema_artifacts[0].uri}/schema.pbtxt

feature {
  name: "age"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "group"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "miles"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "risk"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "speed"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}


In [99]:
!mkdir schema

mkdir: cannot create directory ‘schema’: File exists


In [98]:
# version for comparison with future data
!cp {schema_artifacts[0].uri}/schema.pbtxt schema

In [97]:
stats_artifacts[0].uri

'schema-pipeline/StatisticsGen/statistics/59'

In [100]:
!ls -l {stats_artifacts[0].uri}

total 4
drwxr-xr-x 2 olli olli 4096 Oct 27 16:12 Split-all


In [101]:
!ls -l {stats_artifacts[0].uri}/Split-all

total 8
-rw-r--r-- 1 olli olli 4713 Oct 27 16:12 FeatureStats.pb


In [102]:
# binary file
# !cat {stats_artifacts[0].uri}/Split-all/FeatureStats.pb

In [103]:
!cp {stats_artifacts[0].uri}/Split-all/FeatureStats.pb FeatureStats.pb 

In [104]:
visualize_artifacts(schema_artifacts)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'age',FLOAT,required,,-
'group',INT,required,,-
'miles',FLOAT,required,,-
'risk',FLOAT,required,,-
'speed',FLOAT,required,,-


In [105]:
visualize_artifacts(stats_artifacts)

In [106]:
_trainer_module_file = 'trainer.py'

In [107]:
%%writefile {_trainer_module_file}

from tensorflow.keras.layers import InputLayer, Dense, Dropout, \
                                    BatchNormalization, Activation,\
                                    Input, concatenate
from typing import List
from absl import logging
import tensorflow as tf
from tensorflow import keras
from tensorflow_transform.tf_metadata import schema_utils

from tfx import v1 as tfx
from tfx_bsl.public import tfxio
from tensorflow_metadata.proto.v0 import schema_pb2

_FEATURE_KEYS = ['age', 'speed']
_LABEL_KEY = 'group'

_TRAIN_BATCH_SIZE = 32
_EVAL_BATCH_SIZE = 32

# Since we're not generating or creating a schema, we will instead create
# a feature spec.  Since there are a fairly small number of features this is
# manageable for this dataset.
_FEATURE_SPEC = {
    **{
        feature: tf.io.FixedLenFeature(shape=[1], dtype=tf.float32)
           for feature in _FEATURE_KEYS
       },
    _LABEL_KEY: tf.io.FixedLenFeature(shape=[1], dtype=tf.int64)
}

num_features = len(_FEATURE_KEYS)
dropout = 0.5

def _input_fn(file_pattern: List[str],
              data_accessor: tfx.components.DataAccessor,
              schema: schema_pb2.Schema,
              batch_size: int = 200) -> tf.data.Dataset:
  """Generates features and label for training.

  Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    schema: schema of the input data.
    batch_size: representing the number of consecutive elements of returned
      dataset to combine in a single batch

  Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
  """
  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()


def _build_keras_model() -> tf.keras.Model:
  """Creates a DNN Keras model for classifying penguin data.

  Returns:
    A Keras Model.
  """

  inputs = [keras.layers.Input(shape=(1,), name=f) for f in _FEATURE_KEYS]
  d = keras.layers.concatenate(inputs)
  for _ in range(2):
    d = Dense(500)(d)
    d = Activation('relu')(d)
    d = BatchNormalization()(d)
    d = Dropout(dropout)(d)

  outputs = Dense(name='output', units=3, activation='softmax')(d)

  model = keras.Model(inputs=inputs, outputs=outputs)

  model.compile(
      optimizer=keras.optimizers.Adam(),
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      metrics=[keras.metrics.SparseCategoricalAccuracy()])

  model.summary(print_fn=logging.info)
  return model


# TFX Trainer will call this function.
def run_fn(fn_args: tfx.components.FnArgs):
  """Train the model based on given args.

  Args:
    fn_args: Holds args used to train the model as name/value pairs.
  """

  # ++ Changed code: Reads in schema file passed to the Trainer component.
  schema = tfx.utils.parse_pbtxt_file(fn_args.schema_path, schema_pb2.Schema())
  # ++ End of the changed code.

  train_dataset = _input_fn(
      fn_args.train_files,
      fn_args.data_accessor,
      schema,
      batch_size=_TRAIN_BATCH_SIZE)
  eval_dataset = _input_fn(
      fn_args.eval_files,
      fn_args.data_accessor,
      schema,
      batch_size=_EVAL_BATCH_SIZE)

  model = _build_keras_model()

  model.fit(
      train_dataset,
      steps_per_epoch=fn_args.train_steps,
      validation_data=eval_dataset,
      validation_steps=fn_args.eval_steps)

  # The result of the training should be saved in `fn_args.serving_model_dir`
  # directory.
  model.save(fn_args.serving_model_dir, save_format='tf')

Overwriting trainer.py


In [108]:
# !cat trainer.py

In [110]:
def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,
                     schema_path: str, module_file: str, serving_model_dir: str,
                     metadata_path: str) -> tfx.dsl.Pipeline:
  """Creates a three component penguin pipeline with TFX."""
  # Brings data into the pipeline.
  example_gen = tfx.components.CsvExampleGen(input_base=data_root)

  # Computes statistics over data for visualization and example validation.
  statistics_gen = tfx.components.StatisticsGen(
      examples=example_gen.outputs['examples'])

  # NEW: Import the schema.
  schema_importer = tfx.dsl.Importer(
      source_uri=schema_path,
      artifact_type=tfx.types.standard_artifacts.Schema).with_id(
          'schema_importer')

  # NEW: Performs anomaly detection based on statistics and data schema.
  example_validator = tfx.components.ExampleValidator(
      statistics=statistics_gen.outputs['statistics'],
      schema=schema_importer.outputs['result'])

  # Uses user-provided Python function that trains a model.
  trainer = tfx.components.Trainer(
      module_file=module_file,
      examples=example_gen.outputs['examples'],
      # NEW
      schema=schema_importer.outputs['result'],  # Pass the imported schema.
      train_args=tfx.proto.TrainArgs(num_steps=5000),
      eval_args=tfx.proto.EvalArgs(num_steps=15))

  # Pushes the model to a filesystem destination.
  pusher = tfx.components.Pusher(
      model=trainer.outputs['model'],
      push_destination=tfx.proto.PushDestination(
          filesystem=tfx.proto.PushDestination.Filesystem(
              base_directory=serving_model_dir)))

  # Following three components will be included in the pipeline.
  components = [
      example_gen,
      
      # NEW: Following three components were added to the pipeline.
      statistics_gen,
      schema_importer,
      example_validator,


      trainer,
      pusher,
  ]

  return tfx.dsl.Pipeline(
      pipeline_name=pipeline_name,
      pipeline_root=pipeline_root,
      metadata_connection_config=tfx.orchestration.metadata
      .sqlite_metadata_connection_config(metadata_path),
      components=components)

In [111]:
pipeline = _create_pipeline(
      pipeline_name='insurance-basic',
      pipeline_root='pipeline',
      data_root='drifted_data',
    
      # NEW
      schema_path='schema',
      
      module_file=_trainer_module_file,
      serving_model_dir='model',
      metadata_path='metadata.db')
pipeline

<tfx.orchestration.pipeline.Pipeline at 0x7f342f856220>

In [112]:
%%time 

tfx.orchestration.LocalDagRunner().run(pipeline)



INFO:tensorflow:Assets written to: pipeline/Trainer/model/64/Format-Serving/assets


INFO:tensorflow:Assets written to: pipeline/Trainer/model/64/Format-Serving/assets


CPU times: user 24.3 s, sys: 11.6 s, total: 36 s
Wall time: 19.7 s


In [114]:
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(
    'metadata.db')

with Metadata(metadata_connection_config) as metadata_handler:
  ev_output = get_latest_artifacts(metadata_handler, 'insurance-basic',
                                   'ExampleValidator')
  anomalies_artifacts = ev_output[standard_component_specs.ANOMALIES_KEY]

In [115]:
visualize_artifacts(anomalies_artifacts)

  pd.set_option('max_colwidth', -1)
