-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-38464] Introduce OrderedMultiSetState #27071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
to be able to find serializer tests
|
||
/** | ||
* Remove the given element. If there are multiple instances of the same element, remove the | ||
* first one in insertion order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious : should we allow the user to choose LIFO or FIFO for the remove ?
* The most recently added element was removed. The result will contain the element added | ||
* before it. | ||
*/ | ||
REMOVED_LAST_ADDED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see tests checking these removal result types.
* An element was removed, it was not the most recently added, there are more elements. The | ||
* result will not contain any elements | ||
*/ | ||
REMOVED_OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious why nothing is returned in this case, this seems inconsistent with REMOVED_LAST_ADDED which will return the element added before it.
SizeChangeInfo append(T element, long timestamp) throws Exception; | ||
|
||
/** Get iterator over all remaining elements and their timestamps, in order of insertion. */ | ||
Iterator<Tuple2<T, Long>> iterator() throws Exception; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the multi set changes (i.e. there is a removal) under the iterator what will happen? AI unit test for this would be good. It would be useful to understand any locking that has been considered or is in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(partial review, I haven't yet reviewed linked
variant and tests).
@Internal | ||
@Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 aren't those two mutually exclusive? Class can either be Internal
and not intended to be used outside of the Flink's repo, or Experimental
/PublicEvolving
/Public
depending on the stability of the said api
private final OrderedMultiSetState<RowData> smallState; | ||
private final OrderedMultiSetState<RowData> largeState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Code would be a bit more self-documenting if you used here:
private final ValueStateMultiSetState<RowData> smallState;
private final LinkedMultiSetState<RowData> largeState;
```
And it would also fit the Java doc description then. So either change the types to concrete classes, or rephrase the java doc?
} | ||
|
||
private boolean isEmptyCaching(OrderedMultiSetState<RowData> state) throws IOException { | ||
state.loadCache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would move this loadCache
call out of here, to the top of the execute
method.
} else { | ||
list.set(idx, toAdd); | ||
} | ||
valuesState.update(list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to cache writes? Or SinkUpsertMaterializer
doesn't do more than a single write per input record?
/** | ||
* Simple implementation of {@link OrderedMultiSetState} based on plain {@code ValueState<List>}. | ||
*/ | ||
class ValueStateMultiSetState implements OrderedMultiSetState<RowData> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you benchmark the current SinkUpsertMaterializer
vs new one using ValueStateMultiSetState
?
*/ | ||
@Internal | ||
@Experimental | ||
public interface OrderedMultiSetState<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be actually named SequencedMultiSetState
? Take a look at SequencedSet and SequencedCollection?
/** | ||
* Add row, replacing any matching existing ones. | ||
* | ||
* @return RowKind.UPDATE_AFTER if an existing row was replaced; INSERT otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated doc?
final boolean append = rowSqn == null; | ||
final boolean existed = highSqn != null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean append
and existed
? Either rename or add a comment?
|
||
final Long oldSqn = append ? null : rowSqn; | ||
final long newSqn = append ? (existed ? highSqn + 1 : 0) : oldSqn; | ||
final long newSize = existed ? (append ? oldSize + 1 : oldSize) : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does !append
mean the newly added record overwritten previous record? Via upsert-key?
Those three lines above are not very readable, maybe they are lacking some comments? Maybe it would be better to express it via
if (append) {
if (existed) {
// comment explaining
oldSqn = ...
newSqn = ...
newSize = ...
} else {
// comment explaining
...
} else {
...
or
if (append && existed) {
// comment explaining
oldSqn = ...
newSqn = ...
newSize = ...
else if (...) {
...
?
/** | ||
* Add the given element using a normal (non-multi) set semantics: if a matching element exists | ||
* already, replace it (the timestamp is updated). | ||
*/ | ||
SizeChangeInfo add(T element, long timestamp) throws Exception; | ||
|
||
/** Add the given element using a multi-set semantics, i.e. append. */ | ||
SizeChangeInfo append(T element, long timestamp) throws Exception; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both? 🤔
* @see org.apache.flink.api.common.state.ValueState | ||
*/ | ||
@Internal | ||
public class LinkedMultiSetState implements OrderedMultiSetState<RowData> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have test coverage for this quite complex class. Ditto for Value
and Adaptive
versions, so probably one test suite to be executed on different implementations with adaptive configured in a couple of different ways?
Plus some dedicated tests for switchover in Adaptive
version?
Introduce
OrderedMultiSetState
and 3 implementations (map, value, adaptive) to be used in SInkUpsertMaterializerV2.Test coverage is currently provided on the operator level (#27070).
I'm planning to add lower-level unit tests later to this PR.