# Detecting Drift

* https://www.tensorflow.org/tfx/guide/tfdv
* https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic#check_for_drift_and_skew          
  * https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/data_validation/tfdv_basic.ipynb
* Stats visualized using https://pair-code.github.io/facets/
* Metric for drift: https://en.wikipedia.org/wiki/Chebyshev_distance

In [1]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))

import tensorflow_data_validation as tfdv
print('TFDV version: {}'.format(tfdv.version.__version__))

TensorFlow version: 2.6.0
TFDV version: 1.3.0


In [2]:
from absl import logging

# logging.set_verbosity(logging.INFO)
logging.set_verbosity(logging.WARNING)
# logging.set_verbosity(logging.ERROR)

In [3]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [4]:
!curl -o data/data.csv https://raw.githubusercontent.com/embarced/notebooks/master/mlops/insurance-customers-risk-1500.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 54421  100 54421    0     0  58018      0 --:--:-- --:--:-- --:--:-- 57956


In [5]:
!curl -o data/drifted-data.csv https://raw.githubusercontent.com/embarced/notebooks/master/mlops/insurance-customers-risk-1500-shift.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 54500  100 54500    0     0  55669      0 --:--:-- --:--:-- --:--:-- 55612


In [6]:
!ls -l data

total 112
-rw-r--r-- 1 olli olli 54421 Nov  7 12:45 data.csv
-rw-r--r-- 1 olli olli 54500 Nov  7 12:45 drifted-data.csv


# Stats for Training Data

In [7]:
train_stats = tfdv.generate_statistics_from_csv(data_location='data/data.csv')
tfdv.visualize_statistics(train_stats)



Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


# Inferring Schema from training data

In [8]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'speed',FLOAT,required,,-
'age',FLOAT,required,,-
'miles',FLOAT,required,,-
'group',INT,required,,-
'risk',FLOAT,required,,-


# Stats for Serving Data

In [9]:
serving_stats = tfdv.generate_statistics_from_csv(data_location='data/drifted-data.csv')
tfdv.visualize_statistics(serving_stats)



In [10]:
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

  pd.set_option('max_colwidth', -1)


# Comparing Differences

In [11]:
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=serving_stats, rhs_statistics=train_stats,
                          lhs_name='SERVING_DATASET', rhs_name='TRAIN_DATASET')

# Detecting skew anomalies
* skew is drift between training and serving

In [12]:
tfdv.get_feature(schema, 'speed').skew_comparator.jensen_shannon_divergence.threshold = 0.02
tfdv.get_feature(schema, 'age').skew_comparator.jensen_shannon_divergence.threshold = 0.02
tfdv.get_feature(schema, 'miles').skew_comparator.jensen_shannon_divergence.threshold = 0.02

skew_anomalies = tfdv.validate_statistics(train_stats, schema,
#                                           previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)

tfdv.display_anomalies(skew_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'speed',High approximate Jensen-Shannon divergence between training and serving,"The approximate Jensen-Shannon divergence between training and serving is 0.0462642 (up to six significant digits), above the threshold 0.02."
