[Bug]: MLTransform drops elements if they are already transformed before. #29600

AnandInguva · 2023-12-04T05:59:02Z

What happened?

When duplicate elements are present in the input PColl, the MLTransform will only output the elements once and drops the remaining duplicate transformed elements. This is not an expected behavior.

Note: MLTransform is intended to be an experimental feature in 2.50.0 to 2.52.0 and this bug suggests not to use MLTransform with those versions if your data have identical elements.

For 2.53.0, the fix will be introduced in PR #29542

Simple repro:

import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.tft import ComputeAndApplyVocabulary
import tempfile
data = [
    {
        'x': 'I'
    },
    {
        'x': 'love'
    },
    {
        'x': 'Beam'
    },
    {
        'x': 'Beam'
    },
    {
        'x': 'is'
    },
    {
        'x': 'awesome'
    },
]
artifact_location = tempfile.mkdtemp()
compute_and_apply_vocabulary_fn = ComputeAndApplyVocabulary(columns=['x'])
with beam.Pipeline() as p:
  transformed_data = (
      p
      | beam.Create(data)
      | MLTransform(write_artifact_location=artifact_location).with_transform(
          compute_and_apply_vocabulary_fn)
      | beam.Map(print))

Expected output

Row(x=array([4]))
Row(x=array([1]))
Row(x=array([0]))
Row(x=array([0]))
Row(x=array([2]))
Row(x=array([3]))

Actual output

Row(x=array([4]))
Row(x=array([1]))
Row(x=array([0]))
Row(x=array([2]))
Row(x=array([3]))

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

AnandInguva added bug awaiting triage labels Dec 4, 2023

AnandInguva added this to the 2.53.0 Release milestone Dec 4, 2023

AnandInguva mentioned this issue Dec 4, 2023

Use UUIDs instead of object hashes to avoid collisions #29542

Merged

3 tasks

github-actions bot added python P2 labels Dec 4, 2023

AnandInguva closed this as completed in #29542 Dec 4, 2023

damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: MLTransform drops elements if they are already transformed before. #29600

[Bug]: MLTransform drops elements if they are already transformed before. #29600

AnandInguva commented Dec 4, 2023 •

edited

Loading

[Bug]: MLTransform drops elements if they are already transformed before. #29600

[Bug]: MLTransform drops elements if they are already transformed before. #29600

Comments

AnandInguva commented Dec 4, 2023 • edited Loading

What happened?

Issue Priority

Issue Components

AnandInguva commented Dec 4, 2023 •

edited

Loading