# Data Validation

## Validador de datasets

With the statistics and schema in place, we can now validate our new dataset. The ExampleValidator pipeline component identifies anomalies in training and serving data. It can detect different classes of anomalies in the data.

if  the  `ExampleValidator`  component  detects  a  misalignment  in  the  dataset  statistics or schema between the new and the previous dataset, it will set the status to failed inthe  metadata  store,  and  the  pipeline  ultimately  stops.  Otherwise,  the  pipeline  moves on to the next step, the data preprocessing

TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift and skew in our dataset. It's important to understand your dataset's characteristics, including how it might change over time in your production pipeline. It's also important to look for anomalies in your data, and to compare your training, evaluation, and serving datasets to make sure that they're consistent.

TensorFlow Data Validation identifies anomalies in training and serving data, and can automatically create a schema by examining the data. The component can be configured to detect different classes of anomalies in the data. It can

Perform validity checks by comparing data statistics against a schema that codifies expectations of the user.
Detect training-serving skew by comparing examples in training and serving data.
Detect data drift by looking at a series of data.



In [None]:
import tensorflow as tf
import tensorflow_data_validation as tfdv

## Estadísitica descriptiva básica

TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.

Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets.

In [None]:
# Data generates from the first notebook 
train_path = "./output_notebook_pipeline/CsvExampleGen/examples/1/Split-test/"
test_path =  "./output_notebook_pipeline/CsvExampleGen/examples/1/Split-train/"
val_path =  "./output_notebook_pipeline/CsvExampleGen/examples/1/Split-validation/"

In [None]:
train_stats = tfdv.generate_statistics_from_tfrecord(data_location=train_path)
val_stats = tfdv.generate_statistics_from_tfrecord(data_location=val_path)
tfdv.visualize_statistics(train_stats)

## Generación de esquema de datos

A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data. For categorical features the schema also defines the domain - the list of acceptable values. Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct

The schema codifies properties which the input data is expected to satisfy, such as data types or categorical values, and can be modified or replaced by the user.

In [None]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

## Check evaluation data for errors

- It's important that our evaluation data is consistent with our training data, including that it uses the same schema. 
- It's also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data, so that our coverage of the loss surface during evaluation is roughly the same as during training. 
- The same is true for categorical features. Otherwise, we may have training issues that are not identified during evaluation, because we didn't evaluate part of our loss surface.

To detect unbalanced features in a Facets Overview, choose "Non-uniformity" from the "Sort by" dropdown.

In [None]:
eval_stats = tfdv.generate_statistics_from_tfrecord(data_location=test_path)

# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

## Check for evaluation anomalies

Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical features, where we want to identify the range of acceptable values.

In [None]:
# Check eval data for errors by validating the eval data stats using the previously inferred schema.
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)

## Check for drift and skew

In addition to checking whether a dataset conforms to the expectations set in the schema, TFDV also provides functionalities to detect drift and skew. TFDV performs this check by comparing the statistics of the different datasets based on the drift/skew comparators specified in the schema

* Drift

Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. We express drift in terms of L-infinity distance, and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable.

Drift detection is supported between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. We express drift in terms of L-infinity distance for categorical features and approximate Jensen-Shannon divergence for numeric features. You can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.


* Skew

TFDV can detect three different kinds of skew in your data - schema skew, feature skew, and distribution skew

- Schema Skew

Schema skew occurs when the training and serving data do not conform to the same schema. Both training and serving data are expected to adhere to the same schema. Any expected deviations between the two (such as the label feature being only present in the training data but not in serving) should be specified through environments field in the schema.

- Feature Skew
Feature skew occurs when the feature values that a model trains on are different from the feature values that it sees at serving time. For example, this can happen when:

    A data source that provides some feature values is modified between training and serving time
    
    There is different logic for generating features between training and serving. For example, if you apply some transformation only in one of the two code    paths.

* Distribution Skew

Distribution skew occurs when the distribution of the training dataset is significantly different from the distribution of the serving dataset. One of the key causes for distribution skew is using different code or different data sources to generate the training dataset. Another reason is a faulty sampling mechanism that chooses a non-representative subsample of the serving data to train on.

In [None]:
# Add skew comparator for 'close' feature.
close = tfdv.get_feature(schema, 'label')
close.skew_comparator.infinity_norm.threshold = 0.001

# Add drift comparator for 'volume' feature
volume=tfdv.get_feature(schema, 'col_t_1')
volume.drift_comparator.infinity_norm.threshold = 0.001

skew_anomalies = tfdv.validate_statistics(
    train_stats, schema, 
    previous_statistics=val_stats,
    serving_statistics=eval_stats)

tfdv.display_anomalies(skew_anomalies)


### Cuando usar tfdv?

It's easy to think of TFDV as only applying to the start of your training pipeline, as we did here, but in fact it has many uses. Here's a few more:

- Validating new data for inference to make sure that we haven't suddenly started receiving bad features
- Validating new data for inference to make sure that our model has trained on that part of the decision surface
- Validating our data after we've transformed it and done feature engineering (probably using TensorFlow Transform) to make sure we haven't done something wrong
