This notebook demonstrates the use of MLTransform. MLTransform is a PTransform that is used to wrap data processing transforms provided by Beam. The data processing transforms are useful to process the data going into the ML training/inference work. 

In the notebook, we will be making use of data processing transforms defined at `apache_beam/ml/transforms/tft`. These modules are implemented using `tensorflow_transform` but all the details of `tensorflow_transform` are abstracted from the end user. Users can simply pass the data, which could be a dictionary where keys being the name of the columns and values being either scalar, lists or numpy arrays.

Import necessary modules.

In [68]:
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.tft import ComputeAndApplyVocabulary
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.ml.transforms.utils import ArtifactsFetcher

In [89]:
artifact_location = './my_artifacts'
# Remove this code
def delete_artifact_location(artifact_location):
    import shutil
    import os
    if os.path.exists(artifact_location):
        shutil.rmtree(artifact_location)
import logging
for _ in ("tensorflow_transform"):
    logging.getLogger(_).setLevel(logging.WARNING)

## ComputeAndApplyVocabulary
Let us compute vocabulary for the dataset and map its vocabulary to unique index. This can be achieved using `ComputeAndApplyVocabulary`(LINK)

In [94]:
delete_artifact_location(artifact_location)
data = [
    {'x': ['I', 'love', 'pie']},
    {'x': ['I', 'love', 'going', 'to', 'the', 'park']}
]
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
    data = (
        p 
        | beam.Create(data)
        | MLTransform(artifact_location=artifact_location).with_transform(ComputeAndApplyVocabulary(columns=['x']))
        | beam.Map(print)
    )



INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/6c0885c2e7f148369a4ceeed08f0b137/assets


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/6c0885c2e7f148369a4ceeed08f0b137/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/6f0a3df927ed4e2d977625225c6e04dc/assets


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/6f0a3df927ed4e2d977625225c6e04dc/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


Row(x=array([1, 0, 4]))
Row(x=array([1, 0, 6, 2, 3, 5]))


On the same data, let us use `TFIDF`(LINK) after we compute vocab indices for each word. 

## Fetching vocabulary artifacts

To fetch artifacts generated by the `ComputeAndApplyVocabulary`, in this case a file with all the vocabulary in the dataset, use `ArtifactsFetcher` class that will fetch vocab list, path to the vocab file.

In [67]:
fetcher = ArtifactsFetcher(artifact_location=artifact_location)
# get vocab list
vocab_list = fetcher.get_vocab_list()
print(vocab_list)
# get vocab file path
vocab_file_path = fetcher.get_vocab_filepath()
print(vocab_file_path)
# get vocab size
vocab_size = fetcher.get_vocab_size()
print(vocab_size)

['love', 'I', 'to', 'the', 'pie', 'park', 'going']
./my_artifacts/transform_fn/assets/compute_and_apply_vocab


TypeError: get_vocab_size() missing 1 required positional argument: 'vocab_filename'

## TFIDF

In [84]:
from apache_beam.ml.transforms.tft import TFIDF

In [90]:
data = [
    {'x': ['I', 'love', 'pie']},
    {'x': ['I', 'love', 'going', 'to', 'the', 'park']}
]
delete_artifact_location(artifact_location)
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
    data = (
        p 
        | beam.Create(data)
        | MLTransform(artifact_location=artifact_location
                     ).with_transform(ComputeAndApplyVocabulary(columns=['x'])
                     ).with_transform(TFIDF(columns=['x']))
    )
    _ = data | beam.Map(print)



INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/0a6a470f382c455fb68b594737de8e5f/assets


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/0a6a470f382c455fb68b594737de8e5f/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/d0e70fe851d94d03b924ab1a769fa964/assets


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/d0e70fe851d94d03b924ab1a769fa964/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/de7237d2106440a4b2080bf340a5562b/assets


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/de7237d2106440a4b2080bf340a5562b/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


Row(x=array([1, 0, 4]), x_tfidf_weight=array([0.33333334, 0.33333334, 0.4684884 ], dtype=float32), x_vocab_index=array([0, 1, 4]))
Row(x=array([1, 0, 6, 2, 3, 5]), x_tfidf_weight=array([0.16666667, 0.16666667, 0.2342442 , 0.2342442 , 0.2342442 ,
       0.2342442 ], dtype=float32), x_vocab_index=array([0, 1, 2, 3, 5, 6]))


TFIDF provides two artifacts. These are provided in the output suffixed with `tfidf_weight` and `vocab_index` to the processing column name. 

* `vocab_index`: indices of the words computed in the `ComputeAndApplyVocabulary`.
* `tfidif_weight`: weight for each vocab index. The weight represents how important the word present at that vocab_index is to the document.


## ScaleTo01

Now let us scale the data to be in the range of 0 and 1. This would be done by calculating `min` and `max` values on the whole dataset and then performing
```
x = (x - x_min) / (x_max)
```

This can be achieved using MLTransform and ScaleTo01 data processing transform.

In [80]:
from apache_beam.ml.transforms.tft import ScaleTo01

In [83]:
data = [
    {'x': [1, 2, 3]}, {'x': [4, 5, 7]}, {'x': [10, 2, 10, 34, 100, 54, 20, 10, 2, 3, 11, 12]}]

# delete_artifact_location(artifact_location)
with beam.Pipeline() as p:
    _ = (
        p 
        | 'CreateData' >> beam.Create(data)
        | 'MLTransform' >> MLTransform(artifact_location=artifact_location).with_transform(ScaleTo01(columns=['x']))
        | 'PrintResults' >> beam.Map(print)
    )
    



INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/4a0a3683ae774a74b877a27e48d1f796/assets


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/4a0a3683ae774a74b877a27e48d1f796/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/8c6d7a67caa44232807f1d9d251e9ede/assets


INFO:tensorflow:Assets written to: ./my_artifacts/tftransform_tmp/8c6d7a67caa44232807f1d9d251e9ede/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


Row(x=array([0.        , 0.01010101, 0.02020202], dtype=float32), x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))
Row(x=array([0.03030303, 0.04040404, 0.06060606], dtype=float32), x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))
Row(x=array([0.09090909, 0.01010101, 0.09090909, 0.33333334, 1.        ,
       0.53535354, 0.1919192 , 0.09090909, 0.01010101, 0.02020202,
       0.1010101 , 0.11111111], dtype=float32), x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))


The output would be the values that are scaled between 0 and 1. Scaling is done using max and min values computed from the entire dataset.
Also, the output will comprise of artifacts such as `x_max`, `x_min`, which represents the max and min values of the entire dataset.