<a id='top'></a><a name='top'></a>
# Preprocessing data with TensorFlow Transform (Intro)

[Source](https://www.tensorflow.org/tfx/tutorials/transform/simple)

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/gbih/ml-notes/blob/main/tf_transform/tft-01-simple.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

1. [Setup](#1.0)
2. [Introduction](#2.0)
3. [Data: Create some dummy data](#3.0)
4. [Transform: Create a preprocessing function](#4.0)
5. [Syntax](#5.0)
6. [Putting it all together](#6.0)
7. [Is this right answer?](#7.0)
8. [Use the resulting transform_fn](#8.0)
9. [Export](#9.0)
    * [9.1 An example training model](#9.1)
    * [9.2 An example export wrapper](#9.2)

---
<a id='setup'></a><a name='setup'>
# 1. Setup
<a href="#top">[back to top]</a>

In [None]:
import sys

IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    print("Running on Colab. Installing TFT.")
    !pip install tfx &> /dev/null
    !apt-get install tree &> /dev/null
    !pip install protobuf==3.19.0
else:
    print("Running locally.")

In [48]:
import sys
import pathlib
import pprint
pp = pprint.PrettyPrinter(indent=2)
import tempfile

import tensorflow as tf
import tensorflow_transform as tft

# Module level imports for tensorflow_transform.beam.
# https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/__init__.py
# https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam
import tensorflow_transform.beam as tft_beam

# In-memory representation of all metadata associated with a dataset.
# https://www.tensorflow.org/tfx/transform/api_docs/python/tft/DatasetMetadata
# https://github.com/tensorflow/transform/blob/master/tensorflow_transform/tf_metadata/dataset_metadata.py
from tensorflow_transform.tf_metadata import dataset_metadata

# Utilities for using the tf.Metadata Schema within TensorFlow
# https://github.com/tensorflow/transform/blob/master/tensorflow_transform/tf_metadata/schema_utils.py
from tensorflow_transform.tf_metadata import schema_utils

import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.set_printoptions(precision=3, suppress=True)

# global seed
tf.random.set_seed(42)

tf.get_logger().propagate = False
tf.autograph.set_verbosity(0)
tf.get_logger().setLevel('ERROR') # DEBUG, INFO, WARN, ERROR, or FATAL

def HR():
    print("-"*40)
    
def dir_ex(obj):
    result = [x for x in dir(obj) if not x.startswith('_')]
    print(type(obj))
    print()
    for x in result:
        print(f'{x:<40}', end="")
        
print("Loaded libraries.")

Loaded libraries.


---
<a id='2.0'></a><a name='2.0'></a>
# 2. Introduction
<a href="#top">[back to top]</a>

**The Feature Engineering Component of TensorFlow Extended (TFX)**

This example notebook provides a very simple example of how TensorFlow Transform (`tf.Transform`) can be used to preprocess data using exactly the same code for both training a model and serving inferences in production.

TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset. For example, using `tr.Transform` you could:

* Normalize an input value by using the mean and standard deviation (`tft.scale_to_0_1()`)

* Convert strings to integers by generating a vocabulary over all of the input values (`tft.compute_and_apply_vocabulary()`)

* Convert floats to integers by assigning them to buckets, based on the observed data distribution (`tft.bucketize`)

TensorFlow has built-in support for manipulations on a single example or batch of example **(GB: Show examples of this)**. `tf.Transform` extends these capabilities to support full passes over the entire training dataset.

The output of `tf.Transform` is exported as a TensorFlow graph which you can use for both training and serving. Using the same graph for both training and serving can prevent skew, since the same transformations are applied in both stages.

## Transform library for TFX and non-TFX users 

The `tft` module documentation is the only module that is relevant to TFX users. The `tft_beam` module is relevant only when using Transform as a standalone library. Typically, a TFX user constructs a `preprocessing_fn`, and the rest of the Transform library calls are made by the [TFX Transform component](https://www.tensorflow.org/tfx/guide/transform).

---

<sub>


[Quick cheatsheet for TensorFlow Transform](https://www.tensorflow.org/tfx/transform/api_docs/python/tft)

    
### Classes:
    
    
```
tft.DatasetMetadata:  Metadata about a dataset used for the "instance dict" format.
tft.TFTransformOutput:  A wrapper around the output of the tf.Transform.
tft.TransformFeaturesLayer:  A Keras layer for applying a tf.Transform output to input layers.
```

### Transform Functions:


```
tft.apply_buckets():  Returns a bucketized column, with a bucket index assigned to each input.
tft.apply_buckets_with_interpolation():  Interpolates within the provided buckets and then normalizes to 0 to 1.
tft.apply_pyfunc():  Applies a python function to some Tensors.
tft.apply_vocabulary():  Maps x to a vocabulary specified by the deferred tensor.
tft.bag_of_words():  Computes a bag of "words" based on the specified ngram configuration.
tft.bucketize():  Returns a bucketized column, with a bucket index assigned to each input.
tft.bucketize_per_key():  Returns a bucketized column, with a bucket index assigned to each input.
tft.compute_and_apply_vocabulary():  Generates a vocabulary for x and maps it to an integer with this vocab.
tft.count_per_key():  Computes the count of each element of a Tensor.
tft.covariance():  Computes the covariance matrix over the whole dataset.
tft.deduplicate_tensor_per_row():  Deduplicates each row (0-th dimension) of the provided tensor.
tft.estimated_probability_density():  Computes an approximate probability density at each x, given the bins.
tft.get_analyze_input_columns():  Return columns that are required inputs of AnalyzeDataset.
tft.get_num_buckets_for_transformed_feature():  Provides the number of buckets for a transformed feature if annotated.
tft.get_transform_input_columns():  Return columns that are required inputs of TransformDataset.
tft.hash_strings():  Hash strings into buckets.
tft.histogram():  Computes a histogram over x, given the bin boundaries or bin count.
tft.make_and_track_object():  Keeps track of the object created by invoking trackable_factory_callable.
tft.max():  Computes the maximum of the values of a Tensor over the whole dataset.
tft.mean():  Computes the mean of the values of a Tensor over the whole dataset.
tft.min():  Computes the minimum of the values of a Tensor over the whole dataset.
tft.ngrams():  Create a SparseTensor of n-grams.
tft.pca():  Computes PCA on the dataset using biased covariance.
tft.quantiles():  Computes the quantile boundaries of a Tensor over the whole dataset.
tft.scale_by_min_max():  Scale a numerical column into the range [output_min, output_max].
tft.scale_by_min_max_per_key():  Scale a numerical column into a predefined range on a per-key basis.
tft.scale_to_0_1():  Returns a column which is the input column scaled to have range [0,1].
tft.scale_to_0_1_per_key():  Returns a column which is the input column scaled to have range [0,1].
tft.scale_to_gaussian():  Returns an (approximately) normal column with mean to 0 and variance 1.
tft.scale_to_z_score():  Returns a standardized column with mean 0 and variance 1.
tft.scale_to_z_score_per_key():  Returns a standardized column with mean 0 and variance 1, grouped per key.
tft.segment_indices():  Returns a Tensor of indices within each segment.
tft.size():  Computes the total size of instances in a Tensor over the whole dataset.
tft.sparse_tensor_left_align():  Re-arranges a tf.SparseTensor and returns a left-aligned version of it.
tft.sparse_tensor_to_dense_with_shape():  Converts a SparseTensor into a dense tensor and sets its shape.
tft.sum():  Computes the sum of the values of a Tensor over the whole dataset.
tft.tfidf():  Maps the terms in x to their term frequency * inverse document frequency.
tft.tukey_h_params():  Computes the h parameters of the values of a Tensor over the dataset.
tft.tukey_location():  Computes the location of the values of a Tensor over the whole dataset.
tft.tukey_scale():  Computes the scale of the values of a Tensor over the whole dataset.
tft.var():  Computes the variance of the values of a Tensor over the whole dataset.
tft.vocabulary():  Computes the unique values of a Tensor over the whole dataset.
tft.word_count():  Find the token count of each document/row.
```
    
</sub>

---
<a id='3.0'></a><a name='3.0'></a>
# 3. Data: Create some dummy data
<a href="#top">[back to top]</a>

Create some simple dummy data for our simple exercise:

* `raw_data` is the initial raw data that we're going to preprocess
* `raw_data_metadata` contains the schema that tells us the types of each of the columns in `raw_data`. It is very simple in this example.

---

**NOTES:**

https://www.tensorflow.org/tfx/transform/get_started

TFT Beam implementation accepts two different input data formats. The "instance dict" format (as seen here) is an intuitive format and is suitable for small datasets while the TFXIO (Apache Arrow) format provides improved performance and is suitable for large datasets.

The "instance dict" format:

The previous code examples used this format. The **metadata** contains the schema that defines the layout of the data and how it is read from and written to various formats. This in-memory format is not self-describing and requires a schema in order to be interpreted as tensors.

The Schema proto contains the information needed to parse the data, from on-disk or in-memory format, into tensors. It is typically constructed by calling `tft.tf_metadata.schema_utils.schema_from_feature_spec` with a dict mapping feature keys to `tf.io.FixedLenFeature`, `tf.io.VarLenFeature`, and `tf.io.SparseFeature` values (see documentation for `tf.parse_example` for more details).

Above, we use `tf.io.FixedLenFeature` to indicate that each feature contains a fixed number of values (in this case a single scalar value). Because `tensorflow_transform` batches instances, the actual Tensor representing the feature will have shape `(None,)` where the unknown dimension is the batch dimension.

---




### tft.DatasetMetadata

from tensorflow_transform.tf_metadata import dataset_metadata

    tft.DatasetMetadata(
        schema: schema_pb2.Schema
    )

* This is an in-memory representation that may be serialized and deserialized to and from a variety of disk representations.
* Caution: The "instance dict" format used with DatasetMetadata is much less efficient than TFXIO. For any serious workloads you should use TFXIO with a tfxio.TensorAdapterConfig instance as the metadata

https://www.tensorflow.org/tfx/transform/api_docs/python/tft/DatasetMetadata

---


In [49]:
raw_data = [
    {'x': 1, 'y': 1, 's': 'hello'},
    {'x': 2, 'y': 2, 's': 'world'},
    {'x': 3, 'y': 3, 's': 'hello'},
]

# Simple "instance dict" format for TFT Beam input data formats

raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        'y': tf.io.FixedLenFeature([], tf.float32),
        'x': tf.io.FixedLenFeature([], tf.float32),
        's': tf.io.FixedLenFeature([], tf.string)
    })
)

print(raw_data_metadata)

{'_schema': feature {
  name: "s"
  type: BYTES
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "x"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "y"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
}


---
<a id='4.0'></a><a name='4.0'></a>
# 4. Transform: Create a preprocessing function
<a href="#top">[back to top]</a>

The *preprocessing function* is the most important concept of tf.Transform. A preprocessing function is where the transformation of the dataset happens. It accepts and return a dictionary of tensors, where a tensor is defined as a `Tensor` or `SparseTensor`. There are two main groups of API calls that typically form the heart of a preprocessing function:

1. **TensorFlowOps**: Any function that accepts and returns tensors, which usually means TensorFlow ops. These add TensorFlow operations to the graph that transforms raw data into transformed data, one feature at a time. These will run for every example, during both training and serving.

2. **Tensorflow Transform Analyzers/Mappers**: Any of the analyzers/mappers provided by `tf.Transform`. These accept and return tensors, and typically contain a combination of Tensorflow ops and Beam computation, but unlike TensorFlow ops, they only run in the Beam pipeline during analysis requiring a full pass over the entire training dataset. The Beam computation runs only once (prior to training, during analysis), and typically makes a full pass over the entire training dataset. They create `tf.constant` tensors, which are added to your graph. For example, `tft.min` computes the minimum of a tensor over the training dataset.

Caution: When you apply your preprocessing function to serving inferences, the constants that were created by analyzers during training do not change. If your data has trend or seasonality components, plan accordingly.

Note: `preprocessing_fn` is not directly callable. Instead, it must be passed to the Transform Beam API, as shown in the following cells.

In [50]:
def preprocessing_fn(inputs) -> dict:
    """tf.transform's callback function for preprocessing inputs.
    Preprocess input columns into transformed columns.

    Args:
    inputs: map from feature keys to raw not-yet-transformed features.

    Returns:
    Map from string feature key to transformed feature operations.
    """

    # Return a frame object from the call stack.
    print(f"---> preprocessing_fn CALLED FROM: {sys._getframe().f_back.f_code.co_name}")
        
    x = inputs['x']
    y = inputs['y']
    s = inputs['s']
    
    x_centered = x - tft.mean(x)
    y_normalized = tft.scale_to_0_1(y)
    s_integerized = tft.compute_and_apply_vocabulary(s)
    x_centered_times_y_normalized = (x_centered * y_normalized)
    
    return {
        'x_centered': x_centered,
        'y_normalized': y_normalized,
        's_integerized': s_integerized,
        'x_centered_times_y_normalized': x_centered_times_y_normalized
    }

---
<a id='5.0'></a><a name='5.0'></a>
# 5. Syntax
<a href="#top">[back to top]</a>

We are ready to put everything together and use Apache Beam to run it.

Apache Beam uses a special syntax to define and invoke transforms.

---
<a id='6.0'></a><a name='6.0'></a>
# 6. Putting it all together
<a href="#top">[back to top]</a>

We are ready to transform our data. We will use Apache Beam with a direct runner, and supply three inputs:

1. `raw_data`: The raw input data that we created above.
2. `raw_data_metadata`: The schema for the raw data.
3. `preprocessing_fn`: The function that we created to do our transformation.

---

We use three `tft_beam` classes here:

* [tft_beam.Context](https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/Context): Context manager for tensorflow-transform. All the attributes in this context are kept on a thread local state. 

* [tft_beam.AnalyzeAndTransformDataset](https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/AnalyzeAndTransformDataset): Combination of AnalyzeDataset and TransformDataset. This may be more efficient since it avoids multiple passes over the data.

* [tft_beam.WriteTransformFn](https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/WriteTransformFn): Writes a TransformFn to disk. The internal structure is a directory containing two subdirectories. The first is 'transformed_metadata' and contains metadata of the transformed data. The second is 'transform_fn' and contains a SavedModel representing the transformed data.

In [51]:
def main(output_dir, DEBUG=False):
    
    # Context manager for tensorflow-transform.
    with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
        
        # resulting dataset and function is returned as transformed_dataset, transform_fn
        transformed_dataset, transform_fn = (
            (raw_data, raw_data_metadata) 
            # Combination of AnalyzeDataset and TransformDataset, equivalent to:
            # transform_fn = AnalyzeDataset(preprocessing_fn).expand(dataset)
            # transformed = TransformDataset().expand((dataset, transform_fn))
            | tft_beam.AnalyzeAndTransformDataset(
                # A function that accepts and returns a dict from strings to `Tensor` or `SparseTensor`s.
                preprocessing_fn
            )
        )
    
        # for higher performance, could use tfxio
        """
        csv_tfxio = tfxio.BeamRecordCsvTFXIO(                                                                                                                                                                                                                                                                                                                                    
            physical_format='text',                                                                                                                                                                                                                                                                                                                                              
            column_names=CSV_COLUMNS,                                                                                                                                                                                                                                                                                                                                            
            schema=_SCHEMA
        )                                                                                                                                                                                                                                                                                                                                                      
        raw_data = (                                                                                                                                                                                                                                                                                                                                                             
            pipeline                                                                                                                                                                                                                                                                                                                                                             
            | 'ReadTrainData' >> beam.io.ReadFromText(file_pattern=train_file_path, coder=beam.coders.BytesCoder(), skip_header_lines=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
            | 'DecodeTrainData' >> csv_tfxio.BeamSource()                                                                                                                                                                                                       
        )                                                                                                                                                                                                                                                                                                                                                                    
        raw_dataset = (raw_data, csv_tfxio.TensorAdapterConfig())                                                                                                                                                                                                                                                                                                                

        transformed_dataset, transform_fn = (                                                                                                                                                                                                                                                                                                                                    
            raw_dataset | tft_beam.AnalyzeAndTransformDataset(                                                                                                                                                                                                                                                                                                                   
                preprocessing_fn, output_record_batches=True))                                                                                                                                                                                                                                                                                                                   
        """
    

    if DEBUG:
        HR()
        print(f"******* transformed_dataset:\n\n{transformed_dataset}")
        HR()
        HR()
        print(f"======= transform_fn:\n\n{transform_fn}")
        HR()
    
    transformed_data, transformed_metadata = transformed_dataset
    
    # Save the transform_fn to the output_dir
    _ = (
        transform_fn | 'WriteTransformFn' >> tft_beam.WriteTransformFn(output_dir)
    )
    
    # list, tensorflow_transform.beam.tft_beam_io.beam_metadata_io.BeamDatasetMetadata
    # We don't return transform_fn, since we instead save it to disk above....?
    return transformed_data, transformed_metadata

In [52]:
from datetime import datetime

ts = datetime.timestamp(datetime.now())
output_dir = pathlib.Path(f"tft_01_simple/output_{int(ts)}")
print("** Start of calling function main()")
transformed_data, transformed_metadata = main(str(output_dir), DEBUG=False)

HR() 
print("** Exited function main()")
print(f"TYPE: transformed_data {type(transformed_data)}") # list
print(f"TYPE: transformed_metadata: {type(transformed_metadata)}") # tensorflow_transform.beam.tft_beam_io.beam_metadata_io.BeamDatasetMetadata

HR()

print(f"Raw data:\n{pp.pformat(raw_data)}")
HR()
print(f"Transformed data:\n{pp.pformat(transformed_data)}")

** Start of calling function main()
---> preprocessing_fn CALLED FROM: transform_fn
---> preprocessing_fn CALLED FROM: metadata_fn




---> preprocessing_fn CALLED FROM: transform_fn
---> preprocessing_fn CALLED FROM: transform_fn
---> preprocessing_fn CALLED FROM: metadata_fn




----------------------------------------
** Exited function main()
TYPE: transformed_data <class 'list'>
TYPE: transformed_metadata: <class 'tensorflow_transform.beam.tft_beam_io.beam_metadata_io.BeamDatasetMetadata'>
----------------------------------------
Raw data:
[ {'s': 'hello', 'x': 1, 'y': 1},
  {'s': 'world', 'x': 2, 'y': 2},
  {'s': 'hello', 'x': 3, 'y': 3}]
----------------------------------------
Transformed data:
[ { 's_integerized': 0,
    'x_centered': -1.0,
    'x_centered_times_y_normalized': -0.0,
    'y_normalized': 0.0},
  { 's_integerized': 1,
    'x_centered': 0.0,
    'x_centered_times_y_normalized': 0.0,
    'y_normalized': 0.5},
  { 's_integerized': 0,
    'x_centered': 1.0,
    'x_centered_times_y_normalized': 1.0,
    'y_normalized': 1.0}]


In [53]:
dir_ex(tft.beam.tft_beam_io.beam_metadata_io.BeamDatasetMetadata)

<class 'type'>

asset_map                               count                                   dataset_metadata                        deferred_metadata                       index                                   schema                                  

In [54]:
tft.beam.tft_beam_io.beam_metadata_io.BeamDatasetMetadata?

[0;31mInit signature:[0m
[0mtft[0m[0;34m.[0m[0mbeam[0m[0;34m.[0m[0mtft_beam_io[0m[0;34m.[0m[0mbeam_metadata_io[0m[0;34m.[0m[0mBeamDatasetMetadata[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdataset_metadata[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdeferred_metadata[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0masset_map[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
A class like DatasetMetadata also holding `PCollection`s and an asset_map.

`deferred_metadata` is a PCollection containing a single DatasetMetadata.
`asset_map` is a Dictionary mapping asset keys to filenames.
[0;31mFile:[0m           ~/Desktop/python-3.8.12/env/lib/python3.8/site-packages/tensorflow_transform/beam/tft_beam_io/beam_metadata_io.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     


---
<a id='7.0'></a><a name='7.0'></a>
# 7. Is this the right answer?
<a href="#top">[back to top]</a>

In [55]:
keys_test = ['x_centered', 's_integerized', 'x_centered_times_y_normalized', 'y_normalized']

results = {
    'x_centered' :[-1.0, 0.0, 1.0],
    's_integerized':[0, 1, 0],
    'x_centered_times_y_normalized':[-0.0, 0.0, 1.0],
    'y_normalized':[0.0, 0.5, 1.0]
}

for x in keys_test:
    value = list(map(lambda key: key[x], transformed_data))
    #print(f"{x}: transformed: {value}, should be:{results[x]}")
    assert value == results[x]

In [56]:
# Alternative style of extracting data and running assert
assert [x['x_centered'] for x in transformed_data] == results['x_centered']
assert [x['s_integerized'] for x in transformed_data] == results['s_integerized']
assert [x['x_centered_times_y_normalized'] for x in transformed_data] == results['x_centered_times_y_normalized']
assert [x['y_normalized'] for x in transformed_data] == results['y_normalized']

---
<a id='8.0'></a><a name='8.0'></a>
# 8. Use the resulting transform_fn
<a href="#top">[back to top]</a>

In [57]:
!du -h {output_dir}/*

8.0K	tft_01_simple/output_1658933910/transform_fn/variables
4.0K	tft_01_simple/output_1658933910/transform_fn/assets
 40K	tft_01_simple/output_1658933910/transform_fn
8.0K	tft_01_simple/output_1658933910/transformed_metadata


---
The `transform_fn/` directory contains a `tf.saved_model` implementing all the constants that tensorflow analysis builds into the graph.

It is possible to load this directly with `tf.saved_model.load` but this is awkward to use.

---
https://www.tensorflow.org/api_docs/python/tf/saved_model/load

Keras models are trackable, so they can be saved to SavedModel. 

However, the object returned by `tf.saved_model.load` is not a Keras object (i.e. doesn't have .fit, .predict, etc. methods). A few attributes and functions are still available: .variables, .trainable_variables and .__call__.

At the minimum, `tf.saved_model.load` returns a trackable object with a signatures attribute mapping from signature keys to functions.

In [58]:
# Create a `trackable` object.
# This is a trackable object with a signatures attribute mapping from signature keys to functions
loaded = tf.saved_model.load(str(output_dir/'transform_fn'))

dir_ex(loaded)

<class 'tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject'>

assets                                  created_variables                       graph_debug_info                        initializers                            resources                               signatures                              tensorflow_git_version                  tensorflow_version                      trackable_objects                       transform_fn                            

In [59]:
# Check the signature map. We use the signature key 'serving_default'
loaded.signatures

_SignatureMap({'serving_default': <ConcreteFunction signature_wrapper(*, inputs_1, inputs, inputs_2) at 0x159C34640>})

In [60]:
# Create a wrapped concrete function
wrapped_concrete_fn = loaded.signatures['serving_default']
dir_ex(wrapped_concrete_fn)

<class 'tensorflow.python.saved_model.load._WrapperFunction'>

add_gradient_functions_to_graph         add_to_graph                            captured_inputs                         function_def                            graph                                   inputs                                  name                                    output_dtypes                           output_shapes                           outputs                                 pretty_printed_signature                replace_capture_with_deferred_capture   set_external_captures                   structured_input_signature              structured_outputs                      trainable_variables                     variables                               

In [61]:
inputs = tf.constant("This is string")
inputs_1 = tf.constant([1.1, 2.2, 3.3], dtype=tf.float32)
inputs_2 = tf.constant(1.3, dtype=tf.float32)

wrapped_concrete_fn(inputs=inputs, inputs_1=inputs_1, inputs_2=inputs_2)

# Returns tensors in `self.graph` corresponding to returned tensors.
print("***** wrapped_concrete_fn.inputs:")
pp.pprint(wrapped_concrete_fn.inputs)
HR()

print("***** wrapped_concrete_fn.outputs:")
pp.pprint(wrapped_concrete_fn.outputs)

***** wrapped_concrete_fn.inputs:
[ <tf.Tensor 'inputs:0' shape=(None,) dtype=string>,
  <tf.Tensor 'inputs_1:0' shape=(None,) dtype=float32>,
  <tf.Tensor 'inputs_2:0' shape=(None,) dtype=float32>,
  <tf.Tensor 'unknown:0' shape=() dtype=float32>,
  <tf.Tensor 'unknown_0:0' shape=() dtype=float32>,
  <tf.Tensor 'unknown_1:0' shape=() dtype=float32>,
  <tf.Tensor 'unknown_2:0' shape=() dtype=float32>,
  <tf.Tensor 'unknown_3:0' shape=() dtype=int64>,
  <tf.Tensor 'unknown_4:0' shape=() dtype=int64>,
  <tf.Tensor 'unknown_5:0' shape=() dtype=resource>,
  <tf.Tensor 'unknown_6:0' shape=() dtype=int64>,
  <tf.Tensor 'unknown_7:0' shape=() dtype=int64>]
----------------------------------------
***** wrapped_concrete_fn.outputs:
[ <tf.Tensor 'Identity:0' shape=<unknown> dtype=int64>,
  <tf.Tensor 'Identity_1:0' shape=(None,) dtype=float32>,
  <tf.Tensor 'Identity_2:0' shape=(None,) dtype=float32>,
  <tf.Tensor 'Identity_3:0' shape=(None,) dtype=float32>]


---
A better approach is to load transform_fn using `tft.TRTransformOutput`. 

The `TFTransformOutput.transform_features_layer` method returns a `tft.TransformFeaturesLayer` object that can be used to apply the transformation.

In [62]:
# Signature: tft.TFTransformOutput.transform_features_layer(self) -> keras.engine.training.Model
# Creates a `TransformFeaturesLayer` from this transform output.

tft_layer: tft.output_wrapper.TransformFeaturesLayer = tft.TFTransformOutput(output_dir).transform_features_layer()

In [63]:
dir_ex(tft_layer)

<class 'tensorflow_transform.output_wrapper.TransformFeaturesLayer'>

activity_regularizer                    add_loss                                add_metric                              add_update                              add_variable                            add_weight                              apply                                   build                                   built                                   call                                    compile                                 compiled_loss                           compiled_metrics                        compute_dtype                           compute_loss                            compute_mask                            compute_metrics                         compute_output_shape                    compute_output_signature                count_params                            distribute_strategy                     dtype                                   dtype_policy                            dynamic   

---
This `tft.TransformFeaturesLayer` expects a dictionary of batched features. 

Accordingly, we create a `Dict[str, tf.Tensor]` from the `List[Dict[str, Any]]` in `raw_data`.

In [64]:
# Check what raw_data contains
raw_data

[{'x': 1, 'y': 1, 's': 'hello'},
 {'x': 2, 'y': 2, 's': 'world'},
 {'x': 3, 'y': 3, 's': 'hello'}]

In [65]:
# Extract individual columns via list-comprehension
raw_data_batch = {
    's': tf.constant([ex['s'] for ex in raw_data]),
    'x': tf.constant([ex['x'] for ex in raw_data], dtype=tf.float32),
    'y': tf.constant([ex['y'] for ex in raw_data], dtype=tf.float32),
}

raw_data_batch

{'s': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'hello', b'world', b'hello'], dtype=object)>,
 'x': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([1., 2., 3.], dtype=float32)>,
 'y': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([1., 2., 3.], dtype=float32)>}

---
You can use `tft.TransformFeaturesLayer` on its own.

In [66]:
transformed_batch = tft_layer(raw_data_batch)

# Examine the structure of transformed_batch
transformed_batch

{'x_centered': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([-1.,  0.,  1.], dtype=float32)>,
 'y_normalized': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0. , 0.5, 1. ], dtype=float32)>,
 's_integerized': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 1, 0])>,
 'x_centered_times_y_normalized': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([-0.,  0.,  1.], dtype=float32)>}

In [67]:
# Since transformed_batch is a dict, use .items() to iterate over it
{key: value.numpy() for key, value in transformed_batch.items()}

{'x_centered': array([-1.,  0.,  1.], dtype=float32),
 'y_normalized': array([0. , 0.5, 1. ], dtype=float32),
 's_integerized': array([0, 1, 0]),
 'x_centered_times_y_normalized': array([-0.,  0.,  1.], dtype=float32)}

---
<a id='9.0'></a><a name='9.0'></a>
# 9. Export
<a href="#top">[back to top]</a>

A more typical use case would use `tf.Transform` to apply the transformation to the training and evaluation datasets (see the [**Census tutorial**](https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/transform/census.ipynb) for an example). 

After training, before exporting the model, attach the `tft.TransformFeaturesLayer` as the first layer so that you can export it as part of your `tf.saved_model`.

<a id='9.1'></a><a name='9.1'></a>
## 9.1 An example training model
<a href="#top">[back to top]</a>

Below is a model that:

1. Takes the transformed batch
2. Stacks the batch together in a simple `(batch, features)` matrix.
3. Runs them through a few dense layers
4. Produces 10 linear outputs

In a real use case, you would apply a one-hot encoding to the `s_integerized` feature.

You could train this model on a dataset transformed by `tf.Transform`

# GB: Try to redo this using only functions (eg create a model_builder function)

---

### Notes on tf.keras.layers.Layer 

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer

A layer is a callable object that takes as input one or more tensors and that outputs one or more tensors. It involves computation, defined in the call() method, and a state (weight variables). State can be created in various places, at the convenience of the subclass implementer:

* in `__init__()`

* in the optional `build()` method, which is invoked by the first `__call__()` to the layer, and supplies the shape(s) of the input(s), which may not have been known at initialization time

* in the first invocation of `call()`, with some caveats

Users will just instantiate a layer and then treat it as a callable.


In [68]:
# child class StackDict inherits from tf.keras.layers.Layer
# tf.keras.layers.Layer is the class from which all layers inherit.
# Since we don't call __init__, we inherit all of the methods and attributes
# of parent tf.keras.layers.Layer as-is.

# We use the .call method here to create a callable object, so we
# can treat it more like a function. We are not interested in creating 
# any new attributes or customizing inherited ones, hence the lack of __init__ method here.

# Remember, we call the parent class __init__() method to get access to 
# all the attributes defined in the parent class __init__ method, if they exist.

class StackDict(tf.keras.layers.Layer):
    
    # This is where the layer's logic lives.
    # The call() method may not create state (except in its first invocation, 
    # wrapping the creation of variables or other resources in tf.init_scope()). 
    # It is recommended to create state in __init__(), or the build() method that 
    # is called automatically before call() executes the first time.
    # Returns a tensor or list/tuple of tensors. 
    # https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer#call
    # https://github.com/keras-team/keras/blob/v2.9.0/keras/engine/base_layer.py#L485-L528
    def call(self, inputs):
        values = [
            tf.cast(v, tf.float32)
            for k,v in sorted(inputs.items(), key=lambda kv: kv[0])
        ]
        return tf.stack(values, axis=1)
    
    
# Test
stack_dict_method = StackDict()
print(stack_dict_method(transformed_batch))

tf.Tensor(
[[ 0.  -1.  -0.   0. ]
 [ 1.   0.   0.   0.5]
 [ 0.   1.   1.   1. ]], shape=(3, 4), dtype=float32)


In [69]:
print(dir_ex(stack_dict_method))

<class '__main__.StackDict'>

activity_regularizer                    add_loss                                add_metric                              add_update                              add_variable                            add_weight                              apply                                   build                                   built                                   call                                    compute_dtype                           compute_mask                            compute_output_shape                    compute_output_signature                count_params                            dtype                                   dtype_policy                            dynamic                                 finalize_state                          from_config                             get_config                              get_input_at                            get_input_mask_at                       get_input_shape_at                      get_losses

## Notes (tf.keras.Model):

There are two ways to instantiate a `tf.keras.Model`:

1. With the "Functional API". This is not used here.

2. By subclassing the `Model` class (used here). In this case, you should define your
layers in `__init__()` and you should implement the model's forward pass
in `call()`.

This method should not be called directly. It is only meant to be
overridden when subclassing `tf.keras.Model`.
To call a model on an input, always use the `__call__()` method,
i.e. `model(inputs)`, which relies on the underlying `call()` method.


In [70]:
class TrainedModel(tf.keras.Model):
    
    def __init__(self):
        # By using the __init__() function with super(), this child class 
        # will inherit all the methods and properties from its parent tf.keras.Model)
        super().__init__(self)
        self.concat = StackDict()
        self.body = tf.keras.Sequential([
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(10),
        ])
    
    # GB: This is NOT the Python __call__ mechanism.
    def call(self, inputs, training=None):
    #def __call__(self, inputs, training=None):
        x = self.concat(inputs)
        print("===== class TrainedModel, function call() =====\n")
        print("THIS IS input x:\n", x)
        print("THIS IS input training:\n", training)
        HR()
        return self.body(x, training)

In [71]:
# This only calls __init__ of TrainedModel, not call().
# We access call() by first creating an instance, then using call() on that instance.
trained_model = TrainedModel()

dir_ex(trained_model)

<class '__main__.TrainedModel'>

activity_regularizer                    add_loss                                add_metric                              add_update                              add_variable                            add_weight                              apply                                   body                                    build                                   built                                   call                                    compile                                 compiled_loss                           compiled_metrics                        compute_dtype                           compute_loss                            compute_mask                            compute_metrics                         compute_output_shape                    compute_output_signature                concat                                  count_params                            distribute_strategy                     dtype                                   dtype_p

In [72]:
# Need to call the model on a batch of data first:
try:
    trained_model.summary()
except Exception as e:
    print(f"Error: {e}")

Error: This model has not yet been built. Build the model first by calling `build()` or by calling the model on a batch of data.


---
Imagine we trained the model:

```
trained_model.compile(loss=..., optimizer='adam')
trained_model.fit(...)
```

---
This model runs on the transformed inputs.

In [73]:
# Use trained_model() like a callable, via its .call() wrapper method.
trained_model_output = trained_model(transformed_batch)

===== class TrainedModel, function call() =====

THIS IS input x:
 tf.Tensor(
[[ 0.  -1.  -0.   0. ]
 [ 1.   0.   0.   0.5]
 [ 0.   1.   1.   1. ]], shape=(3, 4), dtype=float32)
THIS IS input training:
 None
----------------------------------------


In [74]:
trained_model_output.shape

TensorShape([3, 10])

In [75]:
# Model was called on a batch of data, so we can now run model.summary()
trained_model.summary()

Model: "trained_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 stack_dict_4 (StackDict)    multiple                  0         
                                                                 
 sequential_1 (Sequential)   (3, 10)                   5130      
                                                                 
Total params: 5,130
Trainable params: 5,130
Non-trainable params: 0
_________________________________________________________________


<a id='9.2'></a><a name='9.2'></a>
## 9.2 An example export wrapper
<a href="#top">[back to top]</a>

Image you've trained the above model and want to export it.

You'll want to include the transform function in the exported model.

In [76]:
# ExportModel inherits from the parent tf.Module
class ExportModel(tf.Module):
    # Override creation of attributes here
    def __init__(self, trained_model, input_transform):
        self.trained_model = trained_model
        self.input_transform = input_transform
        print("INSIDE __init__")
        print(f"self.trained_model: {self.trained_model}")
        print(f"self.input_transform: {self.input_transform}")
        HR()
        
    @tf.function
    # Create callable-type mechanism via __call__, so we can use as pseudo-function call.
    def __call__(self, inputs, training=None):
        x = self.input_transform(inputs)
        print(f"*** ExportModel call(),  inputs:\n")
        pp.pprint(inputs)
        HR()
        return self.trained_model(x)  

In [77]:
# Call __init__ method of ExportModel here (not the __call__ method)
export_model = ExportModel(
    trained_model = trained_model,
    input_transform = tft_layer
)

INSIDE __init__
self.trained_model: <__main__.TrainedModel object at 0x159bbac10>
self.input_transform: <tensorflow_transform.output_wrapper.TransformFeaturesLayer object at 0x158871c40>
----------------------------------------


In [78]:
dir_ex(export_model)

<class '__main__.ExportModel'>

input_transform                         name                                    name_scope                              non_trainable_variables                 submodules                              trainable_variables                     trained_model                           variables                               with_name_scope                         

---
This combined model works on the raw data, and produces exactly the same results as calling the trained model directly:

In [79]:
#Acess ExportModel's __call__ as a callable.
export_model_output = export_model(raw_data_batch)
export_model_output

*** ExportModel call(),  inputs:

{ 's': <tf.Tensor 'inputs:0' shape=(3,) dtype=string>,
  'x': <tf.Tensor 'inputs_1:0' shape=(3,) dtype=float32>,
  'y': <tf.Tensor 'inputs_2:0' shape=(3,) dtype=float32>}
----------------------------------------
===== class TrainedModel, function call() =====

THIS IS input x:
 Tensor("trained_model_1/stack_dict_4/stack:0", shape=(3, 4), dtype=float32)
THIS IS input training:
 None
----------------------------------------


<tf.Tensor: shape=(3, 10), dtype=float32, numpy=
array([[ 0.023,  0.099, -0.094,  0.124, -0.126,  0.105, -0.083, -0.26 ,
        -0.058, -0.018],
       [ 0.085, -0.051, -0.146,  0.19 , -0.227,  0.071, -0.071, -0.191,
        -0.108, -0.178],
       [ 0.15 ,  0.011, -0.01 ,  0.059, -0.272, -0.092, -0.172, -0.096,
        -0.073, -0.298]], dtype=float32)>

In [80]:
export_model_output.shape

TensorShape([3, 10])

In [81]:
tf.reduce_max(abs(export_model_output - trained_model_output)).numpy()

0.0

---
This `export_model` includes the `tft.TransformFeaturesLayer` and is entirely self-contained. You can save it and restore it in another environment and still get exactly the same result:

In [82]:
ts = datetime.timestamp(datetime.now())
model_dir = str(pathlib.Path(f"tft_01_model/output_{int(ts)}"))

tf.saved_model.save(export_model, model_dir)

*** ExportModel call(),  inputs:

{ 's': <tf.Tensor 'inputs/s:0' shape=(3,) dtype=string>,
  'x': <tf.Tensor 'inputs/x:0' shape=(3,) dtype=float32>,
  'y': <tf.Tensor 'inputs/y:0' shape=(3,) dtype=float32>}
----------------------------------------
===== class TrainedModel, function call() =====

THIS IS input x:
 Tensor("trained_model_1/stack_dict_4/stack:0", shape=(3, 4), dtype=float32)
THIS IS input training:
 None
----------------------------------------
===== class TrainedModel, function call() =====

THIS IS input x:
 Tensor("trained_model_1/stack_dict_4/stack:0", shape=(None, 4), dtype=float32)
THIS IS input training:
 False
----------------------------------------
===== class TrainedModel, function call() =====

THIS IS input x:
 Tensor("stack_dict_4/PartitionedCall:0", shape=(None, 4), dtype=float32)
THIS IS input training:
 False
----------------------------------------
===== class TrainedModel, function call() =====

THIS IS input x:
 Tensor("stack_dict_4/PartitionedCall:0", 

In [83]:
reloaded = tf.saved_model.load(model_dir)
reloaded

<tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject at 0x1599e7610>

In [84]:
reloaded_model_output = reloaded(raw_data_batch)
reloaded_model_output

<tf.Tensor: shape=(3, 10), dtype=float32, numpy=
array([[ 0.023,  0.099, -0.094,  0.124, -0.126,  0.105, -0.083, -0.26 ,
        -0.058, -0.018],
       [ 0.085, -0.051, -0.146,  0.19 , -0.227,  0.071, -0.071, -0.191,
        -0.108, -0.178],
       [ 0.15 ,  0.011, -0.01 ,  0.059, -0.272, -0.092, -0.172, -0.096,
        -0.073, -0.298]], dtype=float32)>

In [85]:
reloaded_model_output.shape

TensorShape([3, 10])

In [86]:
tf.reduce_max(abs(export_model_output - reloaded_model_output)).numpy()

0.0