Skip to content

Add DeduplicationByUniqueId transform#10972

Closed
boyuanzz wants to merge 3 commits intoapache:masterfrom
boyuanzz:dedup
Closed

Add DeduplicationByUniqueId transform#10972
boyuanzz wants to merge 3 commits intoapache:masterfrom
boyuanzz:dedup

Conversation

@boyuanzz
Copy link
Contributor

@boyuanzz boyuanzz commented Feb 26, 2020

R: @lukecwik @robertwb


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps call this "DeduplicateByKey"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it ChoosePerKey or something like that since what it actually does is choose an arbitrary element for each key. Test cases should include different values for the same key (will require a flexible result matching)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an assumption of using this transform: under any situation, one unique id should map one specific value, for any 2 same unique id, the value should also be the same. Thus we can deduplicate the value by this id. One typical usage is reading from certain message queue, any message is paired with one unified ID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. It does allow inconsistency but not too much risk since anyone will immediately see that a key might have two values and they have to be careful. FWIW in Java is it arranged slightly differently into Distinct.withRepresentativeValues. But that way is not as cross-language-friendly as precomputing the keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the user wants to specify things in terms of windows and triggers, it would be more natural to manually do windowing before this operation. Instead, perhaps higher-level semantic information could be provided (namely, over what interval should the deduplication occur), and windowing/triggering should be used to accomplish this.

@boyuanzz boyuanzz changed the title [WIP] Add DeduplicationByUniqueId transform Add DeduplicationByUniqueId transform Feb 29, 2020
Copy link
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person, mostly left comments for myself.

| core.WindowInto(
Sessions(self._session_size),
trigger=trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be using an accumulating trigger.

Sessions(self._session_size),
trigger=trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| core.CombinePerKey(self._DeduplicationCombineFn())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a GBK followed by a ParDo that only emits elements if it is the first pane.

window=core.DoFn.WindowParam,
paneinfo=core.DoFn.PaneInfoParam):
id, value = kv
yield (id, WindowedValue(value, ts, [window], paneinfo))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deduplication key should also contain the window since we don't want to remove output that may have been assigned to multiple windows. Most SDFs will produce output in the global window.

@boyuanzz
Copy link
Contributor Author

boyuanzz commented Mar 5, 2020

Discussed offline. We decided to purchase timer/state approach. Close this PR for now.

@boyuanzz boyuanzz closed this Mar 5, 2020
@boyuanzz boyuanzz deleted the dedup branch March 5, 2020 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants