# Package installation and imports

In [1]:
# from the Terminal run 'pip install -r ../requirements.txt --quiet' prior to launching the notebook

import tensorflow as tf
import tensorflow_data_validation as tfdv
from tensorflow_metadata.proto.v0 import schema_pb2
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
print('TFDV Version: {}'.format(tfdv.__version__))
print('Tensorflow Version: {}'.format(tf.__version__))

TFDV Version: 1.3.0
Tensorflow Version: 2.6.2


# Download the dataset

Abstract: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], [Web Link]).

## Attribute Information:

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): - fixed acidity 
- volatile acidit
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
Output variable (based on sensory data):
- quality (score between 0 and 10)

source: http://archive.ics.uci.edu/ml/datasets/Wine+Quality

In [32]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv -O ../data/raw/winequality-red.csv

--2022-01-24 22:47:02--  http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84199 (82K) [application/x-httpd-php]
Saving to: ‘../data/raw/winequality-red.csv’


2022-01-24 22:47:02 (624 KB/s) - ‘../data/raw/winequality-red.csv’ saved [84199/84199]



## Read in the training and evaluation data

In [51]:
# Read in the data
csv_path = "../data/raw/winequality-red.csv"
df = pd.read_csv(csv_path, sep=";")

In [44]:
# split the data to train and test
train_df, eval_df = train_test_split(df, test_size=0.2, shuffle=False)

In [45]:
# Preview the training data
train_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


# Visualize training dataset statistics

TFDV accepts input formats: TFRecord, Pandas Dataframes, and CSV.

In [36]:
train_stats = tfdv.generate_statistics_from_dataframe(train_df)

Once you generate statistics, you can visualize the results with visualize_statistics(). Uses Facets Interface for data at-a-glance.

In [37]:
tfdv.visualize_statistics(train_stats)

# Infer the data schema

Create a data schema to describe the characteristics of your training set (data types, expected values, etc). 

- expected type of each feature
- expected presence of each feature, in terms of a minimum count and fraction of examples that must contain the feature.
- expected valency of the feature in each example, i.e., minimum and maximum number of values.
- expected domain of a feature, i.e., the small universe of values for a string feature, or range for an integer feature.

In [38]:
# Infer schema from computed stats
schema = tfdv.infer_schema(statistics=train_stats)

In [39]:
# Display the schema 
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'fixed acidity',FLOAT,required,,-
'volatile acidity',FLOAT,required,,-
'citric acid',FLOAT,required,,-
'residual sugar',FLOAT,required,,-
'chlorides',FLOAT,required,,-
'free sulfur dioxide',FLOAT,required,,-
'total sulfur dioxide',FLOAT,required,,-
'density',FLOAT,required,,-
'pH',FLOAT,required,,-
'sulphates',FLOAT,required,,-


In [41]:
# Generate evaluation dataset statistics
eval_stats = tfdv.generate_statistics_from_dataframe(eval_df)

# Compare training with evaluation
tfdv.visualize_statistics(
    lhs_statistics=eval_stats, 
    rhs_statistics=train_stats, 
    lhs_name='EVAL_DATASET', 
    rhs_name='TRAIN_DATASET'
)

TypeError: dataframe argument is of type NoneType. Must be a pandas DataFrame.

# Generate and visualize dataset statistics

After generating stats, look at the evaluation statistics. Compute and compare the stats with the training datasets. visualize_statistics() lets you compare side-by-side.

Left-Hand-Side vs. Right-Hand-Side:
- lhs_statistics: Required parameter. Expects an instance of DatasetFeatureStatisticsList.
- rhs_statistics: Expects an instance of DatasetFeatureStatisticsList to compare with lhs_statistics.
- lhs_name: Name of the lhs_statistics dataset.
- rhs_name: Name of the rhs_statistics dataset.



In [30]:
# Generate evaluation statistucs

eval_stats = tfdv.generate_statistics_from_dataframe(eval_df)

In [43]:
# Compare training with evaluation
tfdv.visualize_statistics(
    lhs_statistics=eval_stats, 
    rhs_statistics=train_stats, 
    lhs_name='EVAL_DATASET', 
    rhs_name='TRAIN_DATASET'
)

# If you sort by Amount missing/zeros, you'll see a high percentage in citric acid

# Calculate evaluation dataset statistics

Check for anomalies, like new values, for a specific feature in the eval data. https://www.tensorflow.org/tfx/data_validation/anomalies

In [48]:
# check eval data for errors
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)

# Visualize anomalies
tfdv.display_anomalies(anomalies)

  pd.set_option('max_colwidth', -1)


# Examine dataset slices

Inspect feature types in your dataset using the get_feature_value_slicer method. If you want to get the entire domain of a feature, then you can map the feature name with None as shown below. This means that you will get slices for both Male and Female entries.

In [57]:
# First, you will use the get_feature_value_slicer method from the slicing_util to get the features you want to examine. You can specify that by passing a dictionary to the features argument. 
from tensorflow_data_validation.utils import slicing_util

slice_fn = slicing_util.get_feature_value_slicer(features={'fixed acidity': None})

In [58]:
# With the slice function ready, you can now generate the statistics. You need to tell TFDV that you need statistics for the features you set and you can do that through the slice_functions argument of tfdv.StatsOptions
slice_stats_options = tfdv.StatsOptions(schema=schema,
                                        slice_functions=[slice_fn],
                                        infer_type_from_schema=True)

In [61]:
# Convert dataframe to CSV since `slice_functions` works only with `tfdv.generate_statistics_from_csv`
CSV_PATH = "../data/processed/slice_sample.csv"
train_df.to_csv(CSV_PATH)

sliced_stats = tfdv.generate_statistics_from_csv(CSV_PATH, stats_options=slice_stats_options)





Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


In [62]:
# you now have the statistics for the set slice. These are packed into a DatasetFeatureStatisticsList protocol buffer. You can see the dataset names below. The first element in the list (i.e. index=0) is named All_Examples which just contains the statistics for the entire dataset. 
print(f'Datasets generated: {[sliced.name for sliced in sliced_stats.datasets]}')

print(f'Type of sliced_stats elements: {type(sliced_stats.datasets[0])}')

Datasets generated: ['All Examples', 'fixed acidity_7.400000095367432', 'fixed acidity_7.800000190734863', 'fixed acidity_11.199999809265137', 'fixed acidity_7.900000095367432', 'fixed acidity_7.300000190734863', 'fixed acidity_7.5', 'fixed acidity_6.699999809265137', 'fixed acidity_5.599999904632568', 'fixed acidity_8.899999618530273', 'fixed acidity_8.5', 'fixed acidity_8.100000381469727', 'fixed acidity_7.599999904632568', 'fixed acidity_6.900000095367432', 'fixed acidity_6.300000190734863', 'fixed acidity_7.099999904632568', 'fixed acidity_8.300000190734863', 'fixed acidity_5.199999809265137', 'fixed acidity_5.699999809265137', 'fixed acidity_8.800000190734863', 'fixed acidity_6.800000190734863', 'fixed acidity_4.599999904632568', 'fixed acidity_7.699999809265137', 'fixed acidity_8.699999809265137', 'fixed acidity_6.400000095367432', 'fixed acidity_6.599999904632568', 'fixed acidity_8.600000381469727', 'fixed acidity_10.199999809265137', 'fixed acidity_7.0', 'fixed acidity_7.199999

You can then visualize the statistics as before to examine the slices. An important caveat is visualize_statistics() accepts a DatasetFeatureStatisticsList type instead of DatasetFeatureStatistics. Thus, at least for this version of TFDV, you will need to convert it to the correct type.