This is an example of how to use Tensorflow Data Validation (https://www.tensorflow.org/tfx/data_validation/get_started) to detect changes in tabular datasets.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_data_validation as tfdv


print(tf.__version__)
print(tfdv.__version__)

2.9.1
1.8.0


First we read a tabular dataset with numerical data. This is taken from the UCI Machine Learning Dataset Archive: https://archive.ics.uci.edu/ml/datasets/banknote+authentication. This dataset contains features calculated from pixel values of images of authentic and fake banknotes.

In [2]:
data_df = pd.read_csv(
    "data_banknote_authentication.txt",
    names=["variance", "skewness", "kurtosis", "entropy", "class"]
)
data_df.head()

Unnamed: 0,variance,skewness,kurtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


We split the dataset into a "baseline" and an "update". We use the baseline to generate a schema which contains details of the columns, their data types and other expectations.

In [3]:
baseline_df = data_df.sample(1000, random_state=12345)
update_df = data_df.loc[data_df.index.difference(baseline_df.index), ]

print(baseline_df.shape)
print(update_df.shape)

(1000, 5)
(372, 5)


In [18]:
baseline_stats = tfdv.generate_statistics_from_dataframe(baseline_df)
baseline_stats

datasets {
  num_examples: 1000
  features {
    type: FLOAT
    num_stats {
      common_stats {
        num_non_missing: 1000
        min_num_values: 1
        max_num_values: 1
        avg_num_values: 1.0
        num_values_histogram {
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 100.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 100.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 100.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 100.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 100.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 100.0
          }
          buckets {
            low_value: 1.0


In [17]:
schema = tfdv.infer_schema(statistics=baseline_stats)
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'variance',FLOAT,required,,-
'skewness',FLOAT,required,,-
'kurtosis',FLOAT,required,,-
'entropy',FLOAT,required,,-
'class',INT,required,,-


Now we will validate the "update" part of the dataset against the schema we generated from the baseline.

In [9]:
def validate_update(schema, update_df):
    update_stats = tfdv.generate_statistics_from_dataframe(update_df)
    validation_result = tfdv.validate_statistics(
        statistics=update_stats,
        schema=schema,
    )
    return validation_result

In [19]:
# Update with no anomalies
no_anomaly = validate_update(schema=schema, update_df=update_df)
tfdv.display_anomalies(no_anomaly)

In [20]:
# Update with the "variance" column missing
missing_variance = validate_update(
    schema=schema, 
    update_df=update_df.drop(["variance",], axis=1)
)
tfdv.display_anomalies(missing_variance)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'variance',Column dropped,Column is completely missing
