Alexander S. Lundervold, 20.04.22

# Introduction

The **Transform** component will grab the artifacts produced by our `ExampleGen` (our examples) and `SchemaGen` (the data schema). It will produce two artifacts: a computational graph (a TensorFlow graph) containing the preprocessing steps and the transformed examples stored as TFRecords (together with their statistics). 

Note that when using Transform, the preprocessing steps become part of the TensorFlow graph. When the TensorFlow graph is deployed, all the preprocessing steps will be performed on the server, making it easier (and less error-prone) to construct client-side setups and avoiding pitfalls when going from training to serving models. 

The computations in TensorFlow Transform are implemented in the high-performance, data-parallel processing framework [Apache Beam](https://beam.apache.org/).

See also the TFX guide about the Transform component: https://www.tensorflow.org/tfx/guide/transform.

This is what our pipeline will look like at the end of the notebook:

<img width=60% src="assets/pipeline_3.png">

# Setup

Import basic libraries:

In [None]:
%matplotlib inline
import os
from pathlib import Path

Check whether we're running on Colab:

In [None]:
try:
    import colab
    colab=True
except:
    colab=False

Set up data directories:

In [None]:
if colab:
    from google.colab import drive
    drive.mount('./gdrive')
    DATA = Path('./gdrive/MyDrive/ColabData/petfinder-mini/csv')
else:
    NB_DIR = Path.cwd()
    DATA = NB_DIR/'..'/'data'/'petfinder-mini'/'csv'
    
SPLIT_DATA = DATA/'..'/'split_csv'

Install TFX and import components:

In [None]:
if colab:
    !pip install -U tfx

> If on Colab, restart the runtime after running the above cell

In [None]:
import tensorflow as tf
import tfx

In [None]:
from tfx.components import CsvExampleGen
from tfx.components import StatisticsGen
from tfx.components import SchemaGen
from tfx.components import ExampleValidator
from tfx.components import Transform

Set up the interactive context for running TFX components:

In [None]:
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

In [None]:
context = InteractiveContext()

# Recreate the previous pipeline

In [None]:
from tfx.components import CsvExampleGen
from tfx.components import StatisticsGen
from tfx.components import SchemaGen
from tfx.components import ExampleValidator
from tfx.components import Transform

In [None]:
# Generate examples
example_gen = CsvExampleGen(input_base=str(DATA)+'/')

# Generate statistics
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])

# Automatic data schema (in a more realistic setting we would have 
# used a manually modified schema)
schema_gen = SchemaGen(statistics=statistics_gen.outputs['statistics'])

# Validate examples
example_validator = tfx.components.ExampleValidator(
    statistics=statistics_gen.outputs['statistics'],
    schema=schema_gen.outputs['schema'])

### Execute the components

In [None]:
context.run(example_gen)

In [None]:
context.run(statistics_gen)

In [None]:
context.run(schema_gen)

In [None]:
context.run(example_validator)

In [None]:
context.show(schema_gen.outputs['schema'])

In [None]:
context.show(example_validator.outputs['anomalies'])

Now we've recreated the pipeline from the previous notebook:

<img src='assets/pipeline_2.png'>

# The Transform component

The data preprocessing will be done using the Transform component (which is based on the standalone library [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started)).

In [None]:
from tfx.components import Transform

In [None]:
#?Transform

To perform our preprocessing, we need to take a closer look at the dataset we're using. We have to preprocess the numerical, categorical, ordinal, text and other features in data type-specific ways. This typically requires some manual work, as we have to make a bunch of decisions about how to represent each feature. (However, it's possible to at least partially automate some of this. Some of you have for example seen the PyCaret library, and it's `setup` function that figures out workable preprocessing steps: https://pycaret.gitbook.io/docs/get-started/functions/initialize#setting-up-environment). 

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(DATA/'petfinder-mini.csv')

In [None]:
df.head()

In [None]:
df.info()

Here's some descriptions that we can gather from looking at the data, reading the description on Kaggle [PetFinder.my Adoption Prediction](https://www.kaggle.com/c/petfinder-adoption-prediction/data), and consulting the TensorFlow catalog https://www.tensorflow.org/datasets/catalog/pet_finder. 

* `Type` - Type of animal (Cat, Dog)
* `Age` - Age of pet when listed, in months
* `Breed1` - Primary breed of pet (Refer to BreedLabels dictionary)
* `Gender` - Gender of pet (Male, Female)
* `Color1` - Color 1 of pet (Black, Brown, Cream, Gray, Golden, White, Yellow)
* `Color2` - Color 2 of pet ('White', 'Brown', 'No Color', 'Gray', 'Cream', 'Golden', 'Yellow')
* `MaturitySize` - Size at maturity (Small, Medium, Large)
* `FurLength` - Fur length (Short, Medium, Long)
* `Vaccinated` - Pet has been vaccinated (Yes, No, Not Sure)
* `Sterilized` - Pet has been spayed / neutered (Yes, No, Not Sure)
* `Health` - Health Condition (Healthy, Minor Injury, Serious Injury, Not Specified)
* `Fee` - Adoption fee (0 = Free)
* `Description` - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
* `PhotoAmt` - The total number of uploaded photos for each pet (numerical)
* `AdoptionSpeed` - Categorical speed of adoption. Lower is faster. This is the value to predict. 0 - Pet was adopted on the same day as it was listed; 1 - Pet was adopted between 1 and 7 days (1st week) after being listed; 2 - Pet was adopted between 8 and 30 days (1st month) after being listed; 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed; 4 - No adoption after 100 days of being listed.

| Column          | Pet description               | Feature type   | Data type |
| --------------- | ----------------------------- | -------------- | --------- |
| `Type`          | Type of animal (`Dog`, `Cat`) | Categorical    | String    |
| `Age`           | Age                           | Numerical      | Integer   |
| `Breed1`        | Primary breed                 | Categorical    | String    |
| `Color1`        | Color 1                       | Categorical    | String    |
| `Color2`        | Color 2                       | Categorical    | String    |
| `MaturitySize`  | Size at maturity              | Categorical    | String    |
| `FurLength`     | Fur length                    | Categorical    | String    |
| `Vaccinated`    | Pet has been vaccinated       | Categorical    | String    |
| `Sterilized`    | Pet has been sterilized       | Categorical    | String    |
| `Health`        | Health condition              | Categorical    | String    |
| `Fee`           | Adoption fee                  | Numerical      | Integer   |
| `Description`   | Profile write-up              | Text           | String    |
| `PhotoAmt`      | Total uploaded photos         | Numerical      | Integer   |
| `AdoptionSpeed` | Categorical speed of adoption | Classification | Integer   |

## Define a preprocessing function

The preprocessing steps we want our Transform component to perform will have to be specified in a preprocessing function `preprocessing_fn`. 

As you'll see, we need to point the Transform component to a .py module that defines the preprocessing steps. We construct such a module below by saving a code cell to a file. 

In [None]:
pets_transform_file = 'pets_transforms.py'

In [None]:
#?%%writefile

In [None]:
%%writefile {pets_transform_file}

from typing import Union
import tensorflow as tf
import tensorflow_transform as tft

LABEL_KEY = "AdoptionSpeed"

ONE_HOT_FEATURES = {
    'Type': 2,
    'Breed1': 166,
    'Gender': 2,
    'Color1': 7,
    'Color2': 7,
    'MaturitySize': 2,
    'FurLength': 3,
    'Vaccinated': 3,
    'Sterilized': 3,
    'Health': 3
    
}

NUMERICAL_FEATURES = [
    'Age',
    'Fee',
    'PhotoAmt' 
]

TEXT_FEATURES = {'Description': None}


def _transformed_name(key:str) -> str:
    return key + "_xf"



def _convert_num_to_one_hot(label_tensor: tf.Tensor, num_labels: int = 2) -> tf.Tensor:
    """
    Convert a label (0 or 1) into a one-hot vector
    Args:
        int: label_tensor (0 or 1)
    Returns
        label tensor
    """
    one_hot_tensor = tf.one_hot(label_tensor, num_labels)
    return tf.reshape(one_hot_tensor, [-1, num_labels])

def _fill_in_missing(x: Union[tf.Tensor, tf.SparseTensor]) -> tf.Tensor:
    """Replace missing values in a SparseTensor.
    Fills in missing values of `x` with '' or 0, and converts to a
    dense tensor.
    Args:
      x: A `SparseTensor` of rank 2.  Its dense shape should have
        size at most 1 in the second dimension.
    Returns:
      A rank 1 tensor where missing values of `x` have been filled in.
    """
    if isinstance(x, tf.sparse.SparseTensor):
        default_value = "" if x.dtype == tf.string else 0
        x = tf.sparse.to_dense(
            tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),
            default_value,
        )
    return tf.squeeze(x, axis=1)


def preprocessing_fn(inputs: tf.Tensor) -> tf.Tensor:
    """tf.transform's callback function for preprocessing inputs.
    Args:
        inputs: map from feature keys to raw not-yet-transformed features.
    Returns:
        Map from string feature key to transformed feature operations.
    """
    
    outputs = {}

    for key in ONE_HOT_FEATURES:
        dim = ONE_HOT_FEATURES[key]
        int_value = tft.compute_and_apply_vocabulary(
            _fill_in_missing(inputs[key]), top_k=dim+1
        )
        outputs[_transformed_name(key)] = _convert_num_to_one_hot(
            int_value, num_labels=dim+1
        )

    for key in NUMERICAL_FEATURES:
        # Scale these features to the z-score.
        outputs[_transformed_name(key)] = tft.scale_to_z_score(inputs[key])
            
    for key in TEXT_FEATURES.keys():
        outputs[_transformed_name(key)] = _fill_in_missing(inputs[key])

        
    outputs[_transformed_name(LABEL_KEY)] = _fill_in_missing(inputs[LABEL_KEY])

    return outputs

## Create and run a Transfom component

In [None]:
import os

In [None]:
transform = Transform(
    examples=example_gen.outputs['examples'],
    schema=schema_gen.outputs['schema'],
    module_file=os.path.abspath(pets_transform_file))

In [None]:
context.run(transform)

### Inspect some transformed examples

In [None]:
train_uri = transform.outputs['transformed_examples'].get()[0].uri + '/Split-train'
train_uri

In [None]:
os.listdir(train_uri)

In [None]:
transformed_dataset = tf.data.TFRecordDataset(train_uri+'/transformed_examples-00000-of-00001.gz', 
                                              compression_type="GZIP")

In [None]:
transformed_dataset

Here is the first record records:

In [None]:
for tfrecord in transformed_dataset.take(1):
    serialized_example = tfrecord.numpy()
    example = tf.train.Example()
    example.ParseFromString(serialized_example)
    print(example)
    print("#"*40)
    print("#"*40)
    print()

### The transform graph artifact

In [None]:
transform_graph_uri = transform.outputs['transform_graph'].get()[0].uri

In [None]:
transform_graph_uri

In [None]:
os.listdir(transform_graph_uri)

In [None]:
os.listdir(transform_graph_uri + '/metadata')

In [None]:
os.listdir(transform_graph_uri + '/transformed_metadata')

In [None]:
os.listdir(transform_graph_uri + '/transform_fn')

# What have we done so far?

Here's our current pipeline:

<img width=60% src="assets/pipeline_3.png">

# What's next?

We now have a pipeline that ingests data, computes statistics, generates a data schema, applies the schema to validate examples, and preprocesses the examples. Next, we'll look at how to do the actual **training of machine learning models**.

The training will be done by a TFX **Trainer** component, which consumes the transformed examples from the `Transform` component, and the data schema. 

We'll end up with the following pipeline:

<img width=100% src="assets/pipeline_4.png">