Alexander S. Lundervold, 03.04.22

# Introduction

After ingesting some data, the next step is to **validate** it. We don't want to pass data on to the next step in our ML pipeline unless it passes some checks. Among other things, we need to make sure that there are no _anomalies_, i.e., deviations from what's expected, that the data's statistics are similar to the one we expect, and that it conforms to our data schema. 

In TensorFlow Extended, we can add various validation components from the [TensorFlow Data Validation (TFDV) library](https://www.tensorflow.org/tfx/data_validation/get_started).

We'll build on the pipeline we constructed in `1.0-data_ingestion.ipynb`:

<img width=60% src="assets/pipeline_1.png">

# Setup

In [None]:
%matplotlib inline
import os
from pathlib import Path

In [None]:
# Check whether we're running on Colab
try:
    import colab
    colab=True
except:
    colab=False

In [None]:
if colab:
    !pip install -U tfx

> If on Colab, restart the runtime after running the above cell

In [None]:
import tensorflow as tf
import tfx

In [None]:
if colab:
    from google.colab import drive
    drive.mount('./gdrive')
    DATA = Path('./gdrive/MyDrive/ColabData/petfinder-mini/csv')
else:
    NB_DIR = Path.cwd()
    DATA = NB_DIR/'..'/'data'/'petfinder-mini'/'csv'
    
SPLIT_DATA = DATA/'..'/'split_csv'

In [None]:
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

In [None]:
context = InteractiveContext()

# Ingest data and generate statistics

The below is copied from `1.0-data_ingestion.ipynb`. 

In [None]:
from tfx.components import CsvExampleGen

In [None]:
from tfx.proto import example_gen_pb2

We'll use a 8:2 split of the data:

In [None]:
output_config = example_gen_pb2.Output(
                split_config=example_gen_pb2.SplitConfig(splits=[
                    example_gen_pb2.SplitConfig.Split(name='test', hash_buckets=1)
                        ]))

In [None]:
example_gen = CsvExampleGen(input_base=str(DATA)+'/', 
                            input_config=None, output_config=output_config, 
                            range_config=None)

In [None]:
from tfx.components import StatisticsGen

In [None]:
statistics_gen = StatisticsGen(
        examples=example_gen.outputs['examples'],
        schema=None,
        stats_options=None,
        exclude_splits=None
      )

# Execute the components

As we're playing the role as orchestrator, we need to run the components in some order. 

First, we run our `ExampleGen`:

In [None]:
context.run(example_gen)

Then our `StatisticsGen`:

In [None]:
context.run(statistics_gen)

This gives us some statistics that we can visualize: 

In [None]:
context.show(statistics_gen.outputs['statistics'])

Nothing new yet. The above simply reproduced the pipeline we constructed in `1.0-data_ingestion.ipynb`

<img width=30% src="assets/pipeline_1.png">

What can we do with the statistics we've computed?

# Generate a data schema

A data schema contains all the features in the dataset and their corresponding data types. It can also define the expected bounds and other properties of the features. Having a schema is important for reading, interpreting, applying the correct feature transformations, and, importantly, detect anomalies in the input data. 

The `SchemaGen` component can infer a data schema automatically from the generated statistics.

## Automatically generated schema

 Our `StatisticsGen` component can be used as input, and a data schema proto is produced: 

In [None]:
schema_gen = tfx.components.SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=False)

In [None]:
context.run(schema_gen, enable_cache=True)

In [None]:
context.show(schema_gen.outputs['schema'])

Note that the automatically generated features isn't necessarily correct, but rather a starting point to define a data schema. It's important that the schema is correct (you'll see why when we use the schema below), so manual modifications are typically needed. 

## Manual modifications of the data schema

In the above schema, all the features are marked as "Required" because all of them were avaliable in the training data. That's not necessarily something we want, as we may expect that certain features will be missing once we put our pipeline in production. Also, we may want to restrict the range of values for some of our numerical features. 

For example, let's say that we know that the fur length will sometimes be missing. Then we'd like to make it optional. However, as we would want to keep training a model after it's been put into production, a natural thing to ask for is that certain features are present for at least a given percentage of the training data. For example, maybe we need to know the fur length for 90% of the training instances. 

Also, perhaps we want to make sure that there are no negative values entered for "Age", and no ages above 30 (as those would probably be mistyped). 

Let's edit the data schema to reflect this:

The schema is an artifact of our `schema_gen`. It is stored on disk as a protobuf text file:

In [None]:
schema_gen.outputs

In [None]:
schema_gen.outputs['schema']

In [None]:
schema_gen.outputs['schema'].get()[0]

In [None]:
URI = schema_gen.outputs['schema'].get()[0].uri
URI

In [None]:
!ls $URI

In [None]:
schema_uri = URI + '/schema.pbtxt'

In [None]:
!cat $schema_uri

Rather than updating the file manually, we can use `TensorFlow Data Validation`:

In [None]:
import tensorflow_data_validation as tfdv

In [None]:
schema = tfdv.load_schema_text(schema_uri)

### Updating the features

**Fur length**

In [None]:
fur_feature = tfdv.get_feature(schema, "FurLength")
fur_feature

In [None]:
fur_feature.presence.min_fraction = 0.9

In [None]:
fur_feature

**Age**

In [None]:
age_feature = tfdv.get_feature(schema, "Age")
age_feature

Update the domain:

In [None]:
from tensorflow_metadata.proto.v0 import schema_pb2

In [None]:
tfdv.set_domain(schema, "Age", schema_pb2.IntDomain(min=0, max=30))

In [None]:
age_feature

### Our updated schema

In [None]:
tfdv.display_schema(schema)

### Saving the updated data schema

Now we can write the updated schema to disk for later use. We replace the artifact generated by the above `SchemaGen` with our modified version:

In [None]:
tfdv.write_schema_text(schema, schema_uri)

In [None]:
context.show(schema_gen.outputs['schema'])

# Identify anomalies

Using the data schema, we can detect anomalies in our data simply by comparing a data instance to the data schema. The `ExampleGen` component can be used to achieve this. It stops the pipeline if anomalies are detected. Produces an artifact in the MetadataStore indicating that it failed.

We use the schema we edited above:

In [None]:
example_validator = tfx.components.ExampleValidator(
        statistics=statistics_gen.outputs['statistics'],
        schema=schema_gen.outputs['schema'])

In [None]:
context.run(example_validator)

It produces artifacts that list whether or not each instance failed or not. Let's have a look:

In [None]:
context.show(example_validator.outputs['anomalies'])

Here's how it would look if there were more anomalies. First we need more instances that doesn't conform to the data schema:

## Anomalous instances

Let's see what happens if we feed in instances that doesn't conform to the data schema. Back in `0.0-prepare_data.ipynb`, we made some instances that had values for features that are in different ways out of the feature domains in the above schema (f.ex. a value "Bird" for the feature "Type"). 

We can load these and run them through the `example_gen`. 

In [None]:
input_config = example_gen_pb2.Input(
    splits=[
        example_gen_pb2.Input.Split(name='test', pattern='span3*')
    ])

In [None]:
output_config = example_gen_pb2.Output(
                split_config=example_gen_pb2.SplitConfig(splits=[
                    example_gen_pb2.SplitConfig.Split(name='test', hash_buckets=1)
                        ]))

In [None]:
example_gen_anomalous = CsvExampleGen(input_base=str(SPLIT_DATA)+'/',
                                 input_config=input_config, output_config=output_config)

We can check the data in `test` for anomalies:

In [None]:
statistics_gen_anomalous = StatisticsGen(
        examples=example_gen_anomalous.outputs['examples'],
        schema=None,
        stats_options=None,
        exclude_splits=None
      )

In [None]:
example_validator_anomalous = tfx.components.ExampleValidator(
        statistics=statistics_gen_anomalous.outputs['statistics'],
        schema=schema_gen.outputs['schema'])

In [None]:
context.run(example_gen_anomalous)

In [None]:
context.run(statistics_gen_anomalous)

In [None]:
context.run(example_validator_anomalous)

In [None]:
context.show(example_validator_anomalous.outputs['anomalies'])

Once we've added further components to our ML pipleine, the anomalies in the above dataset would make the `ExampleValidatior` component stop the downstream components from running, enabling us to catch the problem without doing any additional time-consuming data preprocessing or model training (and, importantly, produce output predictions on data our ML pipeline isn't constructed to handle).

# Summary

<img width=60% src="assets/pipeline_2.png">

# Other forms of data validation

## Comparing datasets

In practice one often need to compare datasets. Do their statistics differ? How similar is my validation or test data to the training data? Is my new dataset conforming to the data schema? 

We can do this by directly using the TensorFlow Data Validation library on which the data validation components of TFX is based.

In [None]:
import tensorflow_data_validation as tfdv

We load the datasets that we want to compare from the splits we set up in notebook `0.0`. 

How different are the two data sets?

In [None]:
span1_stats = tfdv.generate_statistics_from_csv(data_location=str(SPLIT_DATA/'span1.csv'))

In [None]:
span2_stats = tfdv.generate_statistics_from_csv(data_location=str(SPLIT_DATA/'span2.csv'))

In [None]:
tfdv.visualize_statistics(lhs_statistics=span1_stats, rhs_statistics=span2_stats)

This is a great way to catch possible problems related to the training-, validation- and test-sets being different. For example, checking wether the test set is representative of the data the model is trained on. 

## More data validation

Have a look at the TensorFlow Data Validation for additional data validation functionality (f.ex. detecting data drift, bias, comparing slices of the datasets, etc): https://www.tensorflow.org/tfx/guide/tfdv

# What's next?

We now have a pipeline that ingests data, computes statistics, generates a data schema, and applies the schema to validate examples. Next, we'll look at how to do **data preprocessing**: encoding features, preprocessing features, feature engineering, and more. All the things we need to transform our raw data into a form suitable for our machine learning models. 

In TensorFlow Extended, this is done using the [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started), built on [Apache Beam](https://beam.apache.org/).