# Data Pipeline Components for Production ML

This notebook will handle the first three steps of a production machine learning project - Data ingestion, Data Validation, and Data Transformation.

Specifically, you will build the production data pipeline by:

*   Performing feature selection
*   Ingesting the dataset
*   Generating the statistics of the dataset
*   Creating a schema as per the domain knowledge
*   Creating schema environments
*   Visualizing the dataset anomalies
*   Preprocessing, transforming and engineering your features
*   Tracking the provenance of your data pipeline using ML Metadata

## 1 - Package Installation and Imports

We used 

In [1]:
import tensorflow as tf
from tfx import v1 as tfx

# TFX libaries
import tensorflow_data_validation as tfdv
import tensorflow_transform as tft
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

# For performing feature selection
#from sklearn.feature_selection import SelectKBest, f_classif

# For feature visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Utilities
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
from google.protobuf.json_format import MessageToDict
from  tfx.proto import example_gen_pb2
from tfx.types import standard_artifacts
from tensorflow_transform.tf_metadata import dataset_metadata, schema_utils
import tensorflow_transform.beam as tft_beam
import os
import pprint
import tempfile
import pandas as pd

# To ignore warnings from TF
tf.get_logger().setLevel('ERROR')

# For formatting print statements
pp = pprint.PrettyPrinter()

# Display versions of TF and TFX related packages
print('TensorFlow version: {}'.format(tf.__version__))
print('TFX version: {}'.format(tfx.__version__))
print('TensorFlow Data Validation version: {}'.format(tfdv.__version__))
print('TensorFlow Transform version: {}'.format(tft.__version__))

TensorFlow version: 2.10.1
TFX version: 1.11.0
TensorFlow Data Validation version: 1.11.0
TensorFlow Transform version: 1.11.0


### 1.1 - Define paths

You will define a few global variables to indicate paths in the local workspace.

In [9]:
# In case you need to restart the workspace
#import shutil

#shutil.rmtree('.\pipeline', ignore_errors=True)
#shutil.rmtree('.\data', ignore_errors=True)

In [17]:
# Declare paths to the data
DATA_DIR = '.\data'

# path to the raw training data
TRAINING_DATA = f'{DATA_DIR}\A_E_Fire_Dataset.csv'

### 1.2 Preview the  dataset

In [12]:
# Load the dataset to a dataframe
df = pd.read_csv(TRAINING_DATA)

# Preview the dataset
df.head()

Unnamed: 0,SIZE,FUEL,DISTANCE,DESIBEL,AIRFLOW,FREQUENCY,STATUS
0,1,gasoline,10,96,0.0,75,0
1,1,gasoline,10,96,0.0,72,1
2,1,gasoline,10,96,2.6,70,1
3,1,gasoline,10,96,3.2,68,1
4,1,gasoline,10,109,4.5,67,1


In [13]:
# Show the data type of each column
df.dtypes

SIZE           int64
FUEL          object
DISTANCE       int64
DESIBEL        int64
AIRFLOW      float64
FREQUENCY      int64
STATUS         int64
dtype: object

## 2 - Data Pipeline

With the selected subset of features prepared, you can now start building the data pipeline. This involves ingesting, validating, and transforming your data. You will be using the TFX components

### 2.2 - Create the Interactive Context

 We will first setup the `Interactive Context` so you can manually execute the pipeline components from the notebook. You will save the sqlite database in a pre-defined directory in your workspace. Please do not modify this path because you will need this in a later exercise involving ML Metadata.

In [18]:
# Location of the pipeline metadata store
PIPELINE_DIR = '.\pipeline'

# Initialize the InteractiveContext with a local sqlite file.
# If you leave `_pipeline_root` blank, then the db will be created in a temporary directory.
# You can safely ignore the warning about the missing config file.
context = InteractiveContext(pipeline_root=PIPELINE_DIR)



### 2.3 - Generating Examples

The first step in the pipeline is to ingest the data. Using [ExampleGen](https://www.tensorflow.org/tfx/guide/examplegen), you can convert raw data to TFRecords for faster computation in the later stages of the pipeline.

#### ExampleGen

You will start the pipeline with the [ExampleGen](https://www.tensorflow.org/tfx/guide/examplegen) component. This  will:

*   split the data into training and evaluation sets (by default: 2/3 train, 1/3 eval).
*   convert each data row into `tf.train.Example` format. This [protocol buffer](https://developers.google.com/protocol-buffers) is designed for Tensorflow operations and is used by the TFX components.
*   compress and save the data collection under the `_pipeline_root` directory for other components to access. These examples are stored in `TFRecord` format. This optimizes read and write operations within Tensorflow especially if you have a large collection of data.

Its constructor takes the path to your data source/directory. In our case, this is the `_data_root` path. The component supports several data sources such as CSV, tf.Record, and BigQuery. Since our data is a CSV file, we will use [CsvExampleGen](https://www.tensorflow.org/tfx/api_docs/python/tfx/components/CsvExampleGen) to ingest the data.

Run the cell below to instantiate `CsvExampleGen`.

In [None]:
# NOTE: Uncomment and run this if you get an error saying there are different 
# headers in the dataset. This is usually because of the notebook checkpoints saved in 
# that folder.

#import shutil
#shutil.rmtree(f'{DATA_DIR}\.ipynb_checkpoints', ignore_errors=True)

In [19]:
# Instantiate ExampleGen with the input CSV dataset
example_gen = tfx.components.CsvExampleGen(input_base=DATA_DIR)

# Execute the component
context.run(example_gen)

0,1
.execution_id,2
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x294997546a0.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0.exec_properties['input_base'].\data['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:467055,xor_checksum:1670948423,sum_checksum:1670948423"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"
.exec_properties,"['input_base'].\data['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:467055,xor_checksum:1670948423,sum_checksum:1670948423"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,.\pipeline\CsvExampleGen\examples\2
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['input_base'],.\data
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }"
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:467055,xor_checksum:1670948423,sum_checksum:1670948423"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,.\pipeline\CsvExampleGen\examples\2
.span,0
.split_names,"[""train"", ""eval""]"
.version,0


You will notice that an output cell showing the execution results is automatically shown. This metadata is recorded into the database created earlier. This allows you to keep track of your project runs. For example, if you run it again, you will notice the `.execution_id` incrementing.

The output of the components are called *artifacts* and you can see an example by navigating through  `.component.outputs > ['examples'] > Channel > ._artifacts > [0]` above. It shows information such as where the converted data is stored (`.uri`) and the splits generated (`.split_names`).

You can also examine the output artifacts programmatically with the code below.

In [20]:
# get the artifact object
#artifact = example_gen.outputs['examples'].get()[0]

# print split names and uri
#print(f'split names: {artifact.split_names}')
#print(f'artifact uri: {artifact.uri}')

split names: ["train", "eval"]
artifact uri: .\pipeline\CsvExampleGen\examples\2


As mentioned, the ingested data is stored in the directory shown in the `uri` field. It is also compressed using `gzip` and you can verify by running the cell below.

In [22]:
# Get the URI of the output artifact representing the training examples
#train_uri = os.path.join(artifact.uri, 'Split-train')

In a notebook environment, it may be useful to examine a few examples of the data especially if you're still experimenting. Since the data collection is saved in [TFRecord format](https://www.tensorflow.org/tutorials/load_data/tfrecord), you will need to use methods that work with that data type. You will need to unpack the individual examples from the `TFRecord` file and format it for printing. 

In [24]:
# Get the list of files in this directory (all compressed TFRecord files)
#tfrecord_filenames = [os.path.join(train_uri, name)
#                     for name in os.listdir(train_uri)]

# Create a 'TFRecordDataset' to read these files
#dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type="GZIP")

In [None]:
# Define a helper function to get individual examples
#def get_records(dataset, num_records):
    '''Extracts records from the given dataset.
    Args:
        dataset (TFRecordDataset): dataset saved by ExampleGen
        num_records (int): number of records to preview
    '''
    
    # initialize an empty list
    records = []
    
    # Use the 'take()' method to specify how many record to get
#    for tfrecord in dataset.take(num_records):
        
        # Get the numpy property of the tensor
#        serialized_example = tfrecord.numpy()
        
        # Initialize a `tf.train.Example()` to read the serialized data
#        example = tf.train.Example()
        
        # Read the example data (output is a protocol buffer message)
#        example.ParseFromString(serialized_example)
        
        # covert the protocol buffer message to a Python dictionary
#        example_dict = (MessageToDict(example))
        
        # append to the records list
#        records.append(example_dict)
        
#    return records


In [None]:
# Get 3 records from the dataset
#sample_records = get_records(dataset, 3)

# Print the output
#pp.pprint(sample_records)

### 2.4 - Computing Statistics

Next, you will compute the statistics of your data. This will allow you to observe and analyze characteristics of your data through visualizations provided by the integrated [FACETS](https://pair-code.github.io/facets/) library.

#### StatisticsGen

In [25]:
# Instantiate StatisticsGen with the ExampleGen ingested dataset
statistics_gen = tfx.components.StatisticsGen(
    examples=example_gen.outputs['examples'])

# Execute the components
context.run(statistics_gen)

0,1
.execution_id,3
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } StatisticsGen at 0x2949b3e0c10.inputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0.outputs['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""].exec_properties['stats_options_json']None['exclude_splits'][]"
.component.inputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"
.component.outputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.inputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"
.outputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"
.exec_properties,['stats_options_json']None['exclude_splits'][]

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,.\pipeline\CsvExampleGen\examples\2
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,.\pipeline\StatisticsGen\statistics\3
.span,0
.split_names,"[""train"", ""eval""]"

0,1
['stats_options_json'],
['exclude_splits'],[]

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x2949b3916d0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: .\pipeline\CsvExampleGen\examples\2) at 0x2949987bc10.type<class 'tfx.types.standard_artifacts.Examples'>.uri.\pipeline\CsvExampleGen\examples\2.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,.\pipeline\CsvExampleGen\examples\2
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,.\pipeline\StatisticsGen\statistics\3
.span,0
.split_names,"[""train"", ""eval""]"


In [26]:
# Show the output statistics
context.show(statistics_gen.outputs['statistics'])

### 2.5 - SchemaGen
You will need to create a schema to validate incoming datasets during training and serving. Fortunately, TFX allows you to infer a first draft of this schema with the [SchemaGen](https://www.tensorflow.org/tfx/guide/schemagen) component.

The [SchemaGen](https://www.tensorflow.org/tfx/guide/schemagen) component also uses TFDV to generate a schema based on your data statistics.A schema defines the expected bounds, types, and properties of the features in your dataset.

`SchemaGen` will take as input the statistics that we generated with `StatisticsGen`, looking at the training split by default.

In [27]:
# Instantiate SchemaGen with the StatisticsGen ingested dataset
schema_gen = tfx.components.SchemaGen(
    statistics=statistics_gen.outputs['statistics'])

# Run the component
context.run(schema_gen)

0,1
.execution_id,4
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } SchemaGen at 0x2949b3e0910.inputs['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""].outputs['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x2949b3e0f40.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4.exec_properties['infer_feature_shape']1['exclude_splits'][]"
.component.inputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"
.component.outputs,['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x2949b3e0f40.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4

0,1
.inputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"
.outputs,['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x2949b3e0f40.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4
.exec_properties,['infer_feature_shape']1['exclude_splits'][]

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,.\pipeline\StatisticsGen\statistics\3
.span,0
.split_names,"[""train"", ""eval""]"

0,1
['schema'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x2949b3e0f40.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4

0,1
.type_name,Schema
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4

0,1
.type,<class 'tfx.types.standard_artifacts.Schema'>
.uri,.\pipeline\SchemaGen\schema\4

0,1
['infer_feature_shape'],1
['exclude_splits'],[]

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x2949b3e0f70.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: .\pipeline\StatisticsGen\statistics\3) at 0x294999d0a00.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri.\pipeline\StatisticsGen\statistics\3.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,.\pipeline\StatisticsGen\statistics\3
.span,0
.split_names,"[""train"", ""eval""]"

0,1
['schema'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x2949b3e0f40.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4

0,1
.type_name,Schema
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: .\pipeline\SchemaGen\schema\4) at 0x2949b3e0100.type<class 'tfx.types.standard_artifacts.Schema'>.uri.\pipeline\SchemaGen\schema\4

0,1
.type,<class 'tfx.types.standard_artifacts.Schema'>
.uri,.\pipeline\SchemaGen\schema\4


In [28]:
# Visualize the schema
context.show(schema_gen.outputs['schema'])

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'AIRFLOW',FLOAT,required,,-
'DESIBEL',INT,required,,-
'DISTANCE',INT,required,,-
'FREQUENCY',INT,required,,-
'FUEL',STRING,required,,'FUEL'
'SIZE',INT,required,,-
'STATUS',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'FUEL',"'gasoline', 'kerosene', 'lpg', 'thinner'"


### 2.6 - Curating the schema

You can see that the inferred schema is able to capture the data types correctly and also able to show the expected values for the qualitative (i.e. string) data. You want to update your schema to take note of these so the pipeline can detect if invalid values are being fed to the model.

* `SIZE`: 1 to 7
* `STATUS`: 0 to 1

#### Curating the Schema

Use [TFDV](https://www.tensorflow.org/tfx/data_validation/get_started) to update the inferred schema to restrict a range of values to the features mentioned above.


In [29]:
# Get the schema uri
schema_uri = schema_gen.outputs['schema']._artifacts[0].uri

# Get the schema pbtxt file from the SchemaGen output
schema = tfdv.load_schema_text(os.path.join(schema_uri, 'schema.pbtxt'))

In [30]:
# Set `SIZE` to categorical having minimum value of 1 and maximum value of 7
tfdv.set_domain(schema, 'SIZE', schema_pb2.IntDomain(name='SIZE', min=1, max=7, is_categorical=True))

# Set `STATUS` to categorical having minimum value of 0 and maximum value of 1
tfdv.set_domain(schema, 'STATUS', schema_pb2.IntDomain(name='STATUS', min=0, max=1, is_categorical=True))

tfdv.display_schema(schema=schema)



Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'AIRFLOW',FLOAT,required,,-
'DESIBEL',INT,required,,-
'DISTANCE',INT,required,,-
'FREQUENCY',INT,required,,-
'FUEL',STRING,required,,'FUEL'
'SIZE',INT,required,,min: 1; max: 7
'STATUS',INT,required,,min: 0; max: 1


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'FUEL',"'gasoline', 'kerosene', 'lpg', 'thinner'"


### ExampleValidator

The [ExampleValidator](https://www.tensorflow.org/tfx/guide/exampleval) component detects anomalies in your data based on the generated schema from the previous step. Like the previous two components, it also uses TFDV under the hood. 

`ExampleValidator` will take as input the statistics from `StatisticsGen` and the schema from `SchemaGen`. By default, it compares the statistics from the evaluation split to the schema from the training split.

In [None]:
# Instantiate ExampleValidator with the StatisticsGen and SchemaGen ingested data
example_validator = tfx.components.ExampleValidator(
    statistics=statistics_gen.outputs['statistics'],
    schema = schema_gen.outputs['schema'])

# Run the component.
context.run(example_validator)

In [None]:
# Visualize the results
context.show(example_validator.outputs['anomalies'])

With no anomalies detected, you can proceed to the next step in the pipeline.

### Transform
The [Transform](https://www.tensorflow.org/tfx/guide/transform) component performs feature engineering for both training and serving datasets. It uses the [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) library introduced in the first ungraded lab of this week.

`Transform` will take as input the data from `ExampleGen`, the schema from `SchemaGen`, as well as a module containing the preprocessing function.

In this section, you will work on an example of a user-defined Transform code. The pipeline needs to load this as a module so you need to use the magic command `%% writefile` to save the file to disk. Let's first define a few constants that group the data's attributes according to the transforms we will perform later. This file will also be saved locally.

In [None]:
# Set the constants module filename
_aefire_constants_module_file = 'aefire_constants.py'

In [None]:
%%writefile {_aefire_constants_module_file}

# Features to be scaled to the z-score
DENSE_FLOAT_FEATURE_KEYS = ['AIRFLOW']

# Features to bucketize
BUCKET_FEATURE_KEYS = ['DISTANCE']

# Number of buckets used by tf.transform for encoding each feature.
FEATURE_BUCKET_COUNT = {'DISTANCE': 5}

# Feature to scale from 0 to 1
RANGE_FEATURE_KEYS = ['DESIBEL', 'FREQUENCY', 'SIZE']

# Number of vocabulary terms used for encoding VOCAB_FEATURES by tf.transform
VOCAB_SIZE = 15

# Count of out-of-vocab buckets in which unrecognized VOCAB_FEATURES are hashed.
OOV_SIZE = 5

# Features with string data types that will be converted to indices
VOCAB_FEATURE_KEYS = ['FUEL']

# Feature that the model will predict
STATUS_KEY = 'STATUS'

# Utility function for renaming the feature
def transformed_name(key):
    return key + '_xf'

Next, you will work on the module that contains `preprocessing_fn()`. This function defines how you will transform the raw data into features that your model can train on (i.e. the next step in the pipeline). You will use the [tft module functions](https://www.tensorflow.org/tfx/transform/api_docs/python/tft) to make these transformations.

In [None]:
# Set the transform module filename
_aefire_transform_module_file = 'aefire_transform.py'

In [None]:
%%writefile {_aefire_transform_module_file}

import tensorflow as tf
import tensorflow_transform as tft

import aefire_constants

# Unpack the contents of the constants module
_DENSE_FLOAT_FEATURE_KEYS = aefire_constants.DENSE_FLOAT_FEATURE_KEYS
_BUCKET_FEATURE_KEYS = aefire_constants.BUCKET_FEATURE_KEYS
_FEATURE_BUCKET_COUNT = aefire_constants.FEATURE_BUCKET_COUNT
_RANGE_FEATURE_KEYS = aefire_constants.RANGE_FEATURE_KEYS
_VOCAB_SIZE = aefire_constants.VOCAB_SIZE
_OOV_SIZE = aefire_constants.OOV_SIZE
_VOCAB_FEATURE_KEYS = aefire_constants.VOCAB_FEATURE_KEYS
_STATUS_KEY = aefire_constants.STATUS_KEY
_transformed_name = aefire_constants.transformed_name

# Define the transformantions
def preprocessing_fn(inputs):
    """tf.transform's callback function for preprocessing inputs.
    Args:
        inputs: map from feature keys to raw not-yet-transformed features.
    Returns:
        Map from string feature key to transformed feature operations.
    """
    outputs = {}

    # Scale these features to the z-score.
    for key in _DENSE_FLOAT_FEATURE_KEYS:
        # Scale these features to the z-score.
        outputs[_transformed_name(key)] = tft.scale_to_z_score(inputs[key])

    # Bucketize the feature
    for key in _BUCKET_FEATURE_KEYS:
        outputs[_transformed_name(key)] = tft.bucketize(inputs[key], _FEATURE_BUCKET_COUNT[key])

    # Scale these these feature to range [0,1]
    for key in _RANGE_FEATURE_KEYS:
        outputs[_transformed_name(key)] = tft.scale_to_0_1(inputs[key])
    
    # Convert strings to indices in a vocabulary
    for key in _VOCAB_FEATURE_KEYS:
        outputs[_transformed_name(key)] = tft.compute_and_apply_vocabulary(
            inputs[key], 
            top_k=(_VOCAB_SIZE),
            num_oov_buckets=(_OOV_SIZE))
    
    # Since the label has integer values, no need to convert
    outputs[_transformed_name(_STATUS_KEY)] = inputs[_STATUS_KEY]
    
    return outputs

You can now pass the training data, schema, and transform module to the `Transform` component. You can ignore the warning messages generated by Apache Beam regarding type hints.

In [None]:
# Instantiate the Transform component
transform = tfx.components.Transform(
    examples=example_gen.outputs['examples'],
    schema=schema_gen.outputs['schema'],
    module_file=os.path.abspath(_aefire_transform_module_file))
    
# Run the component. The 'enable_cache' is disabled in case we need to update the transform module file
context.run(transform, enable_cache=False)