# Data Validation for Diabetes 130-US Hospitals Using Python and TensorFlow Data Validation
### David Lowe
### June 16, 2021

SUMMARY: The project aims to construct a data validation flow using TensorFlow Data Validation (TFDV) and document the end-to-end steps using a template. The Diabetes 130-US Hospitals dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: The data set is the Diabetes 130-US Hospitals for years 1999-2008 donated to the University of California, Irvine (UCI) Machine Learning Repository. The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes.

Additional Notes: I adapted this workflow from the TensorFlow Data Validation tutorial on TensorFlow.org (https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic). I also plan to build a TFDV script for validating future datasets and building machine learning models.

CONCLUSION: In this iteration, the data validation workflow helped to validate the features and structures of the training, validation, and test datasets. The workflow also generated statistics over different slices of data which can help track model and anomaly metrics.

Dataset Used: Diabetes 130-US Hospitals for years 1999-2008 Dataset

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Data validation for a machine learning project generally can be broken down into the following tasks:

1. Prepare Environment
2. Generate and Visualize Training Data Statistics
3. Check Anomalies in Validation Dataset
4. Check Anomalies in Test Dataset
5. Check for Data Drift and Skew
6. Display Stats for Data Slices
7. Finalize the Schema

## Task 1 - Prepare Environment

### 1.a) Load libraries and modules

In [1]:
# Set the random seed number for reproducible results
RNG_SEED = 8

In [2]:
# Import packages
import os
import pandas as pd
from datetime import datetime
from sklearn.model_selection import train_test_split

import tensorflow as tf
import tempfile, urllib, zipfile
import tensorflow_data_validation as tfdv
from tensorflow.python.lib.io import file_io
from tensorflow_data_validation.utils import slicing_util
from tensorflow_metadata.proto.v0.statistics_pb2 import DatasetFeatureStatisticsList, DatasetFeatureStatistics
from tensorflow_metadata.proto.v0 import schema_pb2

### 1.b) Set up the controlling parameters and functions

In [3]:
# Begin the timer for the script processing
start_time_script = datetime.now()

# Set the percentage sizes for splitting the dataset
TEST_SET_RATIO = 0.5
VAL_SET_RATIO = 0.3

# Set TF's logger to only display errors to avoid internal warnings being shown
tf.get_logger().setLevel('ERROR')

### 1.c) Load dataset

In [4]:
# Read CSV data into a dataframe and mark the missing data that is encoded with '?' string as NaN
dataset_path = 'https://dainesanalytics.com/datasets/ucirvine-diabetes-130-us-hospitals/diabetic_data.csv'
df_dataset_import = pd.read_csv(dataset_path, header=0, na_values = '?')

# Take a peek at the dataframe after import
print(df_dataset_import.head())

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


   encounter_id  patient_nbr             race  gender      age weight  \
0       2278392      8222157        Caucasian  Female   [0-10)    NaN   
1        149190     55629189        Caucasian  Female  [10-20)    NaN   
2         64410     86047875  AfricanAmerican  Female  [20-30)    NaN   
3        500364     82442376        Caucasian    Male  [30-40)    NaN   
4         16680     42519267        Caucasian    Male  [40-50)    NaN   

   admission_type_id  discharge_disposition_id  admission_source_id  \
0                  6                        25                    1   
1                  1                         1                    7   
2                  1                         1                    7   
3                  1                         1                    7   
4                  1                         1                    7   

   time_in_hospital  ... citoglipton insulin  glyburide-metformin  \
0                 1  ...          No      No                   No

In [5]:
df_dataset_import.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      99493 non-null   object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    3197 non-null    object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                61510 non-null   object
 11  medical_specialty         51817 non-null   object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

### 1.d) Splitting Data into Sets

In [6]:
# Standardize the class column to the name of target
df_dataset_import = df_dataset_import.rename(columns={'readmitted':'target'})

In [7]:
# Split the data further into training, validation, and test datasets
df_train, df_val_test = train_test_split(df_dataset_import, test_size=VAL_SET_RATIO, random_state=RNG_SEED)
df_val, df_test = train_test_split(df_val_test, test_size=TEST_SET_RATIO, random_state=RNG_SEED)

# Test data emulates the data that would be submitted for predictions, so it should not have the label column.
df_test = df_test.drop(['target'], axis=1)
    
print("Training dataset has {} records and {} columns.".format(df_train.shape[0], df_train.shape[1]))
print("Validation dataset has {} records and {} columns.".format(df_val.shape[0], df_val.shape[1]))
print("Test dataset has {} records and {} columns".format(df_test.shape[0], df_test.shape[1]))

Training dataset has 71236 records and 50 columns.
Validation dataset has 15265 records and 50 columns.
Test dataset has 15265 records and 49 columns


## Task 2 - Generate and Visualize Training Data Statistics

### 2.a) Removing Irrelevant Features

In [8]:
# Define features to remove
features_to_remove = {'encounter_id', 'patient_nbr'}

# Collect features to whitelist while computing the statistics
features_to_keep = [col for col in df_dataset_import.columns if (col not in features_to_remove)]

# Instantiate a StatsOptions class and define the feature_whitelist property
options_train = tfdv.StatsOptions(feature_allowlist=features_to_keep)

# Review the features to generate the statistics
print(options_train.feature_allowlist)

['race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'target']


### 2.b) Generate Statistics for Training Data

In [9]:
train_stats = tfdv.generate_statistics_from_dataframe(df_train, stats_options=options_train)

# get the number of features used to compute statistics
print(f"Number of features used: {len(train_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples used: {train_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {train_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {train_stats.datasets[0].features[-1].path.step[0]}")

Number of features used: 48
Number of examples used: 71236
First feature: race
Last feature: target


### 2.c) Visualize Training Statistics

In [10]:
tfdv.visualize_statistics(train_stats)

### 2.d) Infer Training Schema

In [11]:
# Infer the data schema by using the training statistics previously generated
dataset_schema = tfdv.infer_schema(train_stats)

# Display the data schema
tfdv.display_schema(dataset_schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'race',STRING,optional,single,'race'
'gender',STRING,required,,'gender'
'age',STRING,required,,'age'
'weight',STRING,optional,single,'weight'
'admission_type_id',INT,required,,-
'discharge_disposition_id',INT,required,,-
'admission_source_id',INT,required,,-
'time_in_hospital',INT,required,,-
'payer_code',STRING,optional,single,'payer_code'
'medical_specialty',STRING,optional,single,'medical_specialty'


  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'race',"'AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other'"
'gender',"'Female', 'Male', 'Unknown/Invalid'"
'age',"'[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'"
'weight',"'>200', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)'"
'payer_code',"'BC', 'CH', 'CM', 'CP', 'DM', 'FR', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'"
'medical_specialty',"'AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'DCPTEAM', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Endocrinology-Metabolism', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Neurophysiology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Resident', 'Rheumatology', 'Speech', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology'"
'max_glu_serum',"'>200', '>300', 'None', 'Norm'"
'A1Cresult',"'>7', '>8', 'None', 'Norm'"
'metformin',"'Down', 'No', 'Steady', 'Up'"
'repaglinide',"'Down', 'No', 'Steady', 'Up'"


## Task 3 - Check Anomalies in Validation Dataset

### 3.a) Generate Statistics for Validation Data

In [12]:
val_stats = tfdv.generate_statistics_from_dataframe(df_val, stats_options=options_train)

# get the number of features used to compute statistics
print(f"Number of features used: {len(val_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples used: {val_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {val_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {val_stats.datasets[0].features[-1].path.step[0]}")

Number of features used: 48
Number of examples used: 15265
First feature: race
Last feature: target


### 3.b) Compare Validation with Training Statistics

In [13]:
tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats,
                          lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')

### 3.c) Detect Anomalies

In [14]:
val_anomalies = tfdv.validate_statistics(statistics=val_stats, schema=dataset_schema)
tfdv.display_anomalies(val_anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'medical_specialty',Unexpected string values,"Examples contain values missing from the schema: SportsMedicine (<1%), Surgery-PlasticwithinHeadandNeck (<1%)."
'tolazamide',Unexpected string values,Examples contain values missing from the schema: Up (<1%).


### 3.d) Fix Validation Data Anomalies in Schema

In [15]:
# Get the domain associated with the input feature, tolazamide, from the schema
tolazamide_domain = tfdv.get_domain(dataset_schema, 'tolazamide') 
# Append the missing value to the domain
tolazamide_domain.value.append('Up')

# Get the domain associated with the input feature, medical_specialty, from the schema
medical_specialty_domain = tfdv.get_domain(dataset_schema, 'medical_specialty') 
# Append the missing values to the domain
medical_specialty_domain.value.append('SportsMedicine')
medical_specialty_domain.value.append('Surgery-PlasticwithinHeadandNeck')

# Re-calculate and re-display anomalies with the new schema
val_anomalies = tfdv.validate_statistics(statistics=val_stats, schema=dataset_schema)
tfdv.display_anomalies(val_anomalies)

  pd.set_option('max_colwidth', -1)


## Task 4 - Check Anomalies in Test Dataset

### 4.a) Generate Statistics for Test Data

In [16]:
# Define a new statistics options by the tfdv.StatsOptions class for the serving data by passing the previously inferred schema
options_test = tfdv.StatsOptions(schema=dataset_schema, infer_type_from_schema=True, feature_allowlist=features_to_keep)

In [17]:
# Generate serving dataset statistics
test_stats = tfdv.generate_statistics_from_dataframe(df_test, stats_options=options_test)

# get the number of features used to compute statistics
print(f"Number of features used: {len(test_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples used: {test_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {test_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {test_stats.datasets[0].features[-1].path.step[0]}")

Number of features used: 47
Number of examples used: 15265
First feature: race
Last feature: diabetesMed


### 4.b) Compare Test with Training Statistics

In [18]:
test_anomalies = tfdv.validate_statistics(statistics=test_stats, schema=dataset_schema)
tfdv.display_anomalies(test_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'target',Column dropped,Column is completely missing
'medical_specialty',Unexpected string values,Examples contain values missing from the schema: Perinatology (<1%).
'miglitol',Unexpected string values,Examples contain values missing from the schema: Up (<1%).


### 4.c) Fix Test Data Anomalies in Schema

In [19]:
# Standardize categories for some features
domain_change_features = ['repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 
                          'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 
                          'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 
                          'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 
                          'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']

for feature in domain_change_features:
    tfdv.set_domain(dataset_schema, feature, schema_pb2.StringDomain(value=['Down', 'No', 'Steady', 'Up']))

        # Display new schema
tfdv.display_schema(dataset_schema)



Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'race',STRING,optional,single,'race'
'gender',STRING,required,,'gender'
'age',STRING,required,,'age'
'weight',STRING,optional,single,'weight'
'admission_type_id',INT,required,,-
'discharge_disposition_id',INT,required,,-
'admission_source_id',INT,required,,-
'time_in_hospital',INT,required,,-
'payer_code',STRING,optional,single,'payer_code'
'medical_specialty',STRING,optional,single,'medical_specialty'


  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'race',"'AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other'"
'gender',"'Female', 'Male', 'Unknown/Invalid'"
'age',"'[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'"
'weight',"'>200', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)'"
'payer_code',"'BC', 'CH', 'CM', 'CP', 'DM', 'FR', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'"
'medical_specialty',"'AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'DCPTEAM', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Endocrinology-Metabolism', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Neurophysiology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Resident', 'Rheumatology', 'Speech', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology', 'SportsMedicine', 'Surgery-PlasticwithinHeadandNeck'"
'max_glu_serum',"'>200', '>300', 'None', 'Norm'"
'A1Cresult',"'>7', '>8', 'None', 'Norm'"
'metformin',"'Down', 'No', 'Steady', 'Up'"
'repaglinide',"'Down', 'No', 'Steady', 'Up'"


In [20]:
# Get the domain associated with the input feature, medical_specialty, from the schema
medical_specialty_domain = tfdv.get_domain(dataset_schema, 'medical_specialty') 
# Append the missing value to the domain
medical_specialty_domain.value.append('Perinatology')

# Re-calculate and re-display anomalies with the new schema
test_anomalies = tfdv.validate_statistics(statistics=test_stats, schema=dataset_schema)
tfdv.display_anomalies(test_anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'target',Column dropped,Column is completely missing


### 4.d) Organize Schema based on Environment

In [21]:
# All features are by default in both TRAINING and SERVING environments.
dataset_schema.default_environment.append('TRAINING')
dataset_schema.default_environment.append('TESTING')

In [22]:
# Specify that 'target' feature is not in TEST environment.
tfdv.get_feature(dataset_schema, 'target').not_in_environment.append('TESTING')

# Re-calculate and re-display anomalies with the new environment parameters
test_anomalies = tfdv.validate_statistics(statistics=test_stats, schema=dataset_schema, environment='TESTING')
tfdv.display_anomalies(test_anomalies)

  pd.set_option('max_colwidth', -1)


## Task 5 - Check for Data Drift and Skew

In [23]:
# Calculate skew for the diabetesMed feature
diabetes_med = tfdv.get_feature(dataset_schema, 'diabetesMed')
diabetes_med.skew_comparator.infinity_norm.threshold = 0.0 # domain knowledge helps to determine this threshold

# Calculate drift for the payer_code feature
payer_code = tfdv.get_feature(dataset_schema, 'payer_code')
payer_code.drift_comparator.infinity_norm.threshold = 0.0 # domain knowledge helps to determine this threshold

# Calculate anomalies
skew_drift_anomalies = tfdv.validate_statistics(train_stats, dataset_schema,
                                          previous_statistics=val_stats,
                                          serving_statistics=test_stats)

# Display anomalies
tfdv.display_anomalies(skew_drift_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'diabetesMed',High Linfty distance between training and serving,"The Linfty distance between training and serving is 0.00572019 (up to six significant digits), above the threshold 0. The feature value with maximum difference is: No"
'payer_code',High Linfty distance between current and previous,"The Linfty distance between current and previous is 0.00499293 (up to six significant digits), above the threshold 0. The feature value with maximum difference is: MC"


## Task 6 - Display Stats for Data Slices

In [24]:
def split_datasets(dataset_list):
    '''
    split datasets.

            Parameters:
                    dataset_list: List of datasets to split

            Returns:
                    datasets: sliced data
    '''
    datasets = []
    for dataset in dataset_list.datasets:
        proto_list = DatasetFeatureStatisticsList()
        proto_list.datasets.extend([dataset])
        datasets.append(proto_list)
    return datasets

In [25]:
def display_stats_at_index(index, datasets):
    '''
    display statistics at the specified data index

            Parameters:
                    index : index to show the anomalies
                    datasets: split data

            Returns:
                    display of generated sliced data statistics at the specified index
    '''
    if index < len(datasets):
        print(datasets[index].datasets[0].name)
        tfdv.visualize_statistics(datasets[index])

In [26]:
def sliced_stats_for_slice_fn(slice_fn, approved_cols, dataframe, schema):
    '''
    generate statistics for the sliced data.

            Parameters:
                    slice_fn : slicing definition
                    approved_cols: list of features to pass to the statistics options
                    dataframe: pandas dataframe to slice
                    schema: the schema

            Returns:
                    slice_info_datasets: statistics for the sliced dataset
    '''
    # Set the StatsOptions
    slice_stats_options = tfdv.StatsOptions(schema=schema,
                                            slice_functions=[slice_fn],
                                            infer_type_from_schema=True,
                                            feature_allowlist=approved_cols)
    
    # Convert Dataframe to CSV since `slice_functions` works only with `tfdv.generate_statistics_from_csv`
    CSV_PATH = 'slice_sample.csv'
    dataframe.to_csv(CSV_PATH)
    
    # Calculate statistics for the sliced dataset
    sliced_stats = tfdv.generate_statistics_from_csv(CSV_PATH, stats_options=slice_stats_options)
    
    # Split the dataset using the previously defined split_datasets function
    slice_info_datasets = split_datasets(sliced_stats)
    
    return slice_info_datasets

In [27]:
# Generate slice function for the `medical_speciality` feature
slice_fn = slicing_util.get_feature_value_slicer(features={'medical_specialty': None})

# Generate stats for the sliced dataset
slice_datasets = sliced_stats_for_slice_fn(slice_fn, features_to_keep, dataframe=df_train, schema=dataset_schema)

# Print name of slices for reference
print(f'Statistics generated for:\n')
print('\n'.join([sliced.datasets[0].name for sliced in slice_datasets]))

# Display at index 10, which corresponds to the slice named `medical_specialty_Gastroenterology`
display_stats_at_index(10, slice_datasets) 





Statistics generated for:

All Examples
medical_specialty_Family/GeneralPractice
medical_specialty_Cardiology
medical_specialty_Gynecology
medical_specialty_InternalMedicine
medical_specialty_Pulmonology
medical_specialty_Orthopedics-Reconstructive
medical_specialty_Orthopedics
medical_specialty_Emergency/Trauma
medical_specialty_Urology
medical_specialty_Surgery-Neuro
medical_specialty_Surgery-General
medical_specialty_Hematology/Oncology
medical_specialty_Nephrology
medical_specialty_Pediatrics
medical_specialty_Surgery-Cardiovascular/Thoracic
medical_specialty_Psychiatry
medical_specialty_ObstetricsandGynecology
medical_specialty_Oncology
medical_specialty_Radiologist
medical_specialty_Surgery-Vascular
medical_specialty_Obstetrics
medical_specialty_PhysicalMedicineandRehabilitation
medical_specialty_Neurology
medical_specialty_Gastroenterology
medical_specialty_Ophthalmology
medical_specialty_Endocrinology
medical_specialty_Pathology
medical_specialty_Anesthesiology
medical_specialt

## Task 7 - Finalize the Schema

In [28]:
# Create output directory
OUTPUT_DIR = "output"
file_io.recursive_create_dir(OUTPUT_DIR)

# Use TensorFlow text output format pbtxt to store the schema
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')

# write_schema_text function expect the defined schema and output path as parameters
tfdv.write_schema_text(dataset_schema, schema_file) 

In [29]:
print ('Total time for the script:',(datetime.now() - start_time_script))

Total time for the script: 0:01:51.723062
