In [None]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

In [None]:
! git clone https://github.com/apache/beam.git
! cd beam/sdks/python
! pip install beam/sdks/python
! pip install tensorflow-transform --quiet

Cloning into 'beam'...
remote: Enumerating objects: 1030577, done.[K
remote: Counting objects: 100% (1884/1884), done.[K
remote: Compressing objects: 100% (450/450), done.[K
remote: Total 1030577 (delta 911), reused 1734 (delta 844), pack-reused 1028693[K
Receiving objects: 100% (1030577/1030577), 558.71 MiB | 23.83 MiB/s, done.
Resolving deltas: 100% (532445/532445), done.
Updating files: 100% (13920/13920), done.
Processing ./beam/sdks/python
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting crcmod<2.0,>=1.7 (from apache-beam==2.54.0.dev0)
  Downloading crcmod-1.7.tar.gz (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting orjson<4,>=3.9.7 (from apache-beam==2.54.0.dev0)
  Downloading orjson-3.9.10-cp310-cp3

This notebook shows how to use the Apache Beam `MLTransform` for genarating vocabulary and assigning index to each vocab. It uses `ComputeAndApplyVocabulary`, implemented using `tensorflow_transform`. More specifically,
* Split the input sentence into list of words using a delimiter.
* Generate vocabulary of the input dataset in `write_artifact_mode`
* Map the words to the generated vocabulary on test dataset in `read_artifact_mode`

For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation

## Import the required modules

To use `MLTransfrom`, install tensorflow_transform and the Apache Beam SDK version 2.53.0 or later.

In [None]:
import os
import tempfile
import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.tft import ComputeAndApplyVocabulary

In [None]:
artifact_location = tempfile.mkdtemp(prefix='compute_and_apply_vocab_')
artifact_location_with_frequency_threshold = tempfile.mkdtemp(prefix='compute_and_apply_vocab_frequency_threshold_')

In [None]:
documents = [
    {"feature": "the quick brown fox jumps over the lazy dog"},
    {"feature": "the five boxing wizards jump quickly in the sky"},
    {"feature": "dogs are running in the park"},
    {"feature": "the quick brown fox"}
]

[ComputeAndApplyVocabulary](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) is a data processing transform that computes a unique vocabulary from a dataset and then maps each word or token to a distinct integer index. It facilitates transforming textual data into numerical representations for machine learning tasks.

In [None]:
with beam.Pipeline() as pipeline:
  data_pcoll = pipeline | "CreateData" >> beam.Create(documents)
  # Compute and apply vocabulary using MLTransform
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location).with_transform(
          ComputeAndApplyVocabulary(columns=['feature'], split_string_by_delimiter=' '))
      )
  transformed_pcoll | "Print" >> beam.Map(print)





Row(feature=array([ 0,  1,  4,  3, 12, 10,  0, 11, 16]))
Row(feature=array([ 0, 14, 17,  5, 13,  8,  2,  0,  6]))
Row(feature=array([15, 18,  7,  2,  0,  9]))
Row(feature=array([0, 1, 4, 3]))


The elements that appear frequently in the dataset are identified using the `frequency_threshold` parameter. This parameter limits the generated vocabulary to elements with an absolute frequency greater than or equal to the specified threshold. If not specified, entire vocabulary is generated.
* If the frequency of a vocabulary item is less than the threshold, it is assigned a default value. The `default_value` parameter can be used to set this default value, which is `-1` by default.

In [None]:
with beam.Pipeline() as pipeline:
  data_pcoll = pipeline | "CreateData" >> beam.Create(documents)
  # Compute and apply vocabulary using MLTransform
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location_with_frequency_threshold).with_transform(
          ComputeAndApplyVocabulary(columns=['feature'], split_string_by_delimiter=' ', frequency_threshold=2))
      )
  transformed_pcoll | "Print" >> beam.Map(print)



Row(feature=array([ 0,  1,  4,  3, -1, -1,  0, -1, -1]))
Row(feature=array([ 0, -1, -1, -1, -1, -1,  2,  0, -1]))
Row(feature=array([-1, -1, -1,  2,  0, -1]))
Row(feature=array([0, 1, 4, 3]))


## `MLTransform` for inference workloads

When `MLTransform` is in `write` mode, it produces artifacts such as vocab files for `ComputeAndApplyVocabulary` etc but when `MLTransform` is used `read` mode, it uses existing vocab files and maps the incoming text data based on the previously generated vocab. If the incoming vocab is not found in the generated vocab, then the incomimg vocab will be mappend to `-1`

In [None]:
test_documents = [
    {'feature': 'wizards are flying in the sky'},
    {'feature': 'I love dogs'}
]

with beam.Pipeline() as pipeline:
  data_pcoll = pipeline | "CreateData" >> beam.Create(test_documents)
  # Compute and apply vocabulary using MLTransform
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(read_artifact_location=artifact_location))

  transformed_pcoll | "Print" >> beam.Map(print)



Row(feature=array([ 5, 18, -1,  2,  0,  6]))
Row(feature=array([-1, -1, 15]))


The above step doesn't pass any transforms to the MLTransform when `read_artifact_location` is specified because `MLTransform` saves the artifacts and the transforms produced in the `write_artifact_location`.

In the output, if the word is present in the previously generated vocab, then it gets assigned the index otherwise it gets assigned to default value of `-1`.