In [14]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

In [None]:
! git clone https://github.com/apache/beam.git
! cd beam/sdks/python
! pip install beam/sdks/python
! pip install tensorflow-transform --quiet

This notebook shows how to use `MLTransform` to complete the following tasks:
* Use `write` mode in `MLTransform` to generate a vocabulary and assign an index value to each vocabulary item.
* Use `read` mode to use the generated vocabulary and assign an index to a different dataset.

`MLTransform` uses the `ComputeAndApplyVocabulary` transform, which is implemented by using `tensorflow_transform` to generate the vocabulary.

[ComputeAndApplyVocabulary](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) is a data processing transform that computes a unique vocabulary from a dataset and then maps each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning tasks.

For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation

## Import the required modules

To use `ComputeAndVocabulary` with `MLTransfrom`, install tensorflow_transform and the Apache Beam SDK version 2.53.0 or later.

In [16]:
import os
import tempfile
import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.tft import ComputeAndApplyVocabulary

In [17]:
artifact_location = tempfile.mkdtemp(prefix='compute_and_apply_vocab_')
artifact_location_with_frequency_threshold = tempfile.mkdtemp(prefix='compute_and_apply_vocab_frequency_threshold_')

In [18]:
documents = [
    {"feature": "the quick brown fox jumps over the lazy dog"},
    {"feature": "the five boxing wizards jump quickly in the sky"},
    {"feature": "dogs are running in the park"},
    {"feature": "the quick brown fox"}
]

In this example, `MLTransform` in `write` mode uses `ComputeAndApplyVocabulary` to generate vocabulary on the incoming dataset. The generated vocabulary is stored in an `artifact_location` that you can use on a different dataset in `read` mode with `MLTransform`.

In [19]:
with beam.Pipeline() as pipeline:
  data_pcoll = pipeline | "CreateData" >> beam.Create(documents)
  # Compute and apply vocabulary by using MLTransform.
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location).with_transform(
          ComputeAndApplyVocabulary(columns=['feature'], split_string_by_delimiter=' '))
      )
  transformed_pcoll | "Print" >> beam.Map(print)



Row(feature=array([ 0,  1,  4,  3, 12, 10,  0, 11, 16]))
Row(feature=array([ 0, 14, 17,  5, 13,  8,  2,  0,  6]))
Row(feature=array([15, 18,  7,  2,  0,  9]))
Row(feature=array([0, 1, 4, 3]))


The `frequency_threshold` parameter identifies the elements that appear frequently in the dataset. This parameter limits the generated vocabulary to elements with an absolute frequency greater than or equal to the specified threshold. If you don't specify the parameter, the entire vocabulary is generated.

If the frequency of a vocabulary item is less than the threshold, it's assigned a default value. You can use the `default_value` parameter to set this value. Otherwise, it defaults to `-1`.

In [20]:
with beam.Pipeline() as pipeline:
  data_pcoll = pipeline | "CreateData" >> beam.Create(documents)
  # Compute and apply vocabulary by using MLTransform.
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location_with_frequency_threshold).with_transform(
          ComputeAndApplyVocabulary(columns=['feature'], split_string_by_delimiter=' ', frequency_threshold=2))
      )
  transformed_pcoll | "Print" >> beam.Map(print)



Row(feature=array([ 0,  1,  4,  3, -1, -1,  0, -1, -1]))
Row(feature=array([ 0, -1, -1, -1, -1, -1,  2,  0, -1]))
Row(feature=array([-1, -1, -1,  2,  0, -1]))
Row(feature=array([0, 1, 4, 3]))


## `MLTransform` for inference workloads

When `MLTransform` is in `write` mode, it produces artifacts, such as vocabulary files for `ComputeAndApplyVocabulary`. When `MLTransform` is used `read` mode, it uses the previously generated vocabulary files to map the incoming text data. If the incoming vocabulary isn't found in the generated vocabulary, then the incoming vocabulary is mapped to a `default_value` provided during the `write` mode. In this case, the `default_value` is `-1`.

In [21]:
test_documents = [
    {'feature': 'wizards are flying in the sky'},
    {'feature': 'I love dogs'}
]

with beam.Pipeline() as pipeline:
  data_pcoll = pipeline | "CreateData" >> beam.Create(test_documents)
  # Compute and apply vocabulary by using MLTransform.
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(read_artifact_location=artifact_location))

  transformed_pcoll | "Print" >> beam.Map(print)



Row(feature=array([ 5, 18, -1,  2,  0,  6]))
Row(feature=array([-1, -1, 15]))


When you specify `read_artifact_location`, the previous step doesn't pass transforms to `MLTransform`. Instead, `MLTransform` saves the artifacts and the transforms produced in the location specified by `write_artifact_location`.