In [None]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

In [None]:
! git clone https://github.com/apache/beam.git
! cd beam/sdks/python
! pip install beam/sdks/python
! pip install tensorflow-transform --quiet

## Use TD-IDF to weight terms

[TF-IDF](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) (Term Frequency-Inverse Document Frequency) is a numerical statistic used in text processing to reflect how important a word is to a document in a collection or corpus. It balances the frequency of a word in a document against its frequency in the entire corpus, giving higher value to more specific terms.

Use `TF-IDF` with `MLTransform`.

1. Compute the vocabulary of the dataset by using `ComputeAndApplyVocabulary`.
2. Use the output of `ComputeAndApplyVocabulary` to calculate the `TF-IDF` weights.

In this notebook, `MLTransform` will be run in `write` mode and `read` mode.

#### MLTransform in write mode.

In write mode, `MLTransform` will generate artifacts such as `min` and `max` of the entire dataset and then uses these generated artifacts to scale the entire dataset. This workflow is useful for data going to train an ML model.

#### MLTransform in read mode.

In read mode, `MLTransform` will use the generated artifacts from write mode and then uses those artifacts to scale the entire dataset.

For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation


## Import the required modules

To use `MLTransfrom`, install tensorflow_transform and the Apache Beam SDK version 2.53.0 or later

In [None]:
import tempfile
import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.tft import TFIDF, ComputeAndApplyVocabulary

In [None]:
artifact_location = tempfile.mkdtemp(prefix='TFIDF_')

data = [
    {
        'feature': ['I', 'love', 'pie']
    },
    {
        'feature': ['I', 'love', 'going', 'to', 'the', 'park']
    }
]

test_data = [
    {
        'feature': ['I', 'love', 'dogs']
    },
    {
        'feature': ['Dogs', 'are', 'running', 'in', 'the', 'park' ]
    }
]

### TF-IDF output

`TF-IDF` produces two output columns for a given input. For example, if you input `feature`, the output column names in the dictionary are `feature_vocab_index` and `feature_tfidf_weight`.

- `vocab_index`: indices of the words computed in the `ComputeAndApplyVocabulary` transform.
- `tfidif_weight`: the weight for each vocabulary index. The weight represents how important the word present at that `vocab_index` is to the document.


In [None]:
# Write mode
with beam.Pipeline() as pipeline:
  data_pcoll = pipeline | "CreateData" >> beam.Create(data)

  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location
                     ).with_transform(ComputeAndApplyVocabulary(columns=['feature'])
                     ).with_transform(TFIDF(columns=['feature']))
  )
  transformed_pcoll | "Print" >> beam.Map(print)



Row(feature=array([1, 0, 4]), feature_tfidf_weight=array([0.33333334, 0.33333334, 0.4684884 ], dtype=float32), feature_vocab_index=array([0, 1, 4]))
Row(feature=array([1, 0, 6, 2, 3, 5]), feature_tfidf_weight=array([0.16666667, 0.16666667, 0.2342442 , 0.2342442 , 0.2342442 ,
       0.2342442 ], dtype=float32), feature_vocab_index=array([0, 1, 2, 3, 5, 6]))


In [None]:
# Read mode
with beam.Pipeline() as pipeline:
  data_pcoll = pipeline | "CreateData" >> beam.Create(test_data)

  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(read_artifact_location=artifact_location)
  )
  transformed_pcoll | "Print" >> beam.Map(print)



Row(feature=array([ 1,  0, -1]), feature_tfidf_weight=array([0.33333334, 0.33333334, 0.4684884 ], dtype=float32), feature_vocab_index=array([0, 1, 6]))
Row(feature=array([-1, -1, -1, -1,  3,  5]), feature_tfidf_weight=array([0.2342442, 0.2342442, 0.9369768], dtype=float32), feature_vocab_index=array([3, 5, 6]))


When processing text data using TF-IDF (Term Frequency-Inverse Document Frequency), any term not found in the predefined vocabulary (termed 'out-of-vocabulary' or OOV) is assigned a unique identifier equal to the length of the vocabulary plus one. Despite being outside the established vocabulary, these OOV terms are still evaluated in the TF-IDF transformation. This approach ensures that all terms in the data, even those not initially recognized, are accounted for and given appropriate weights in the TF-IDF scheme, enabling comprehensive and nuanced text analysis.


**NOTE**: For TFIDF, a Vocabulary must be generated and each word is assigned to an index because TFIDF accepts int inputs.