Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-13541] More intelligent caching of CoGBK values. #16354

Merged
merged 8 commits into from
Dec 31, 2021

Conversation

robertwb
Copy link
Contributor

A minimal number of elements are cached for each tag, possibly in addition to a global number of elements cached.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@reuvenlax reuvenlax changed the title [BEAM-13541] More intellegent caching of CoGBK values. [BEAM-13541] More intelligent caching of CoGBK values. Dec 25, 2021
@robertwb
Copy link
Contributor Author

Run Java PreCommit

1 similar comment
@aaltay
Copy link
Member

aaltay commented Dec 28, 2021

Run Java PreCommit

@aaltay
Copy link
Member

aaltay commented Dec 28, 2021

FYI, reference to the previously failing precommit : https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4426/

@robertwb
Copy link
Contributor Author

R: @emilymye or @aaltay or @lukecwik

Copy link
Member

@aaltay aaltay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

}

/**
* Assigns a monotonically increasing index to each item in teh underling Reiterator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/teh/the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

this.unions = Iterators.peekingIterator(unions);
this.containsTag = containsTag;
// Used to keep track of what has been observed so far.
private final int[] lastObserved;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these 3 variables are arrays? Only the index 0 is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are zero-length arrays because we want to share these values among all copies of the reiterator. Basically they're like pointers. Added a comment to clarify.

Copy link
Contributor

@emilymye emilymye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed a few typos

@Test
@SuppressWarnings("BoxedPrimitiveEquality")
public void testCachedResults() {
// Ensure we don't fail below due to odd VM settings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiousity, what odd VM settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.lang.Integer.IntegerCache.high. Clarified to be more explicit.

Copy link
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we will respect DEFAULT_IN_MEMORY_ELEMENT_COUNT. We will have at most DEFAULT_IN_MEMORY_ELEMENT_COUNT + (NUM_TAGS - 1) * DEFAULT_MIN_ELEMENTS_PER_TAG this is fine but wanted to confirm that this was your intent.

@@ -98,8 +104,7 @@ public CoGbkResult(
throw new IllegalStateException(
"union tag " + unionTag + " has no corresponding tuple tag in the result schema");
}
List<Object> valueList = (List<Object>) valueMap.get(unionTag);
valueList.add(value.getValue());
valuesByTag.get(unionTag).add(value.getValue());
}

if (taggedIter.hasNext()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use guard style

if (!taggedIter.hasNext()) {
  valueMap = (List) valuesByTag;
}

// If we get here, there were more elements than we can afford to
// keep in memory, so we copy the re-iterable of remaining items
// and append filtered views to each of the sorted lists computed earlier.
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are pros and cons to this, but I agree it's a slight improvement. Done.

private static class TagIterable<T> implements Iterable<T> {
int tag;
int cacheSize;
Supplier<Boolean> forceCache;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from a refactor, removing.

@@ -59,6 +60,8 @@

private static final int DEFAULT_IN_MEMORY_ELEMENT_COUNT = 10_000;

private static final int DEFAULT_MIN_ELEMENTS_PER_TAG = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based upon the code it looks like this is used as a max per tag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we will cache always at least this many values per tag, regardless of whether DEFAULT_IN_MEMORY_ELEMENT_COUNT was "used up" for other tags. I'll clarify.

@Test
@SuppressWarnings("BoxedPrimitiveEquality")
public void testCachedResults() {
// Ensure we don't fail below due to a non-default java.lang.Integer.IntegerCache.high setting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion of which values are cached vs. re-created relies on not-so-small integers not being cached.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that. I was trying to highlight the spacing issue in the comment. Also consider using a different type which isn't interned by the JVM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Strings have their own (more difficult to reason about) interning issues, and using a non-primitive gets really verbose here, which is why I stayed with integers.

@lukecwik
Copy link
Member

Run Java PreCommit

@robertwb
Copy link
Contributor Author

Running into BEAM-13575.

@robertwb
Copy link
Contributor Author

Run Java PreCommit

@robertwb
Copy link
Contributor Author

Same failure with unrelated org.apache.beam.runners.flink.FlinkRequiresStableInputTest.testParDoRequiresStableInput . I'm going to go ahead and merge so we can get a cherry pick going.

@robertwb robertwb merged commit 3ab26ce into apache:master Dec 31, 2021
robertwb added a commit to robertwb/incubator-beam that referenced this pull request Dec 31, 2021
lukecwik added a commit to lukecwik/incubator-beam that referenced this pull request Dec 31, 2021
@lukecwik
Copy link
Member

Spot bugs consistently fails with:

DLS 	Dead store to tail in new org.apache.beam.sdk.transforms.join.CoGbkResult(CoGbkResultSchema, Iterable, int, int)

Fixed in #16407

lukecwik added a commit that referenced this pull request Dec 31, 2021
emilymye pushed a commit to emilymye/beam that referenced this pull request Jan 4, 2022
emilymye pushed a commit to emilymye/beam that referenced this pull request Jan 5, 2022
emilymye pushed a commit to emilymye/beam that referenced this pull request Jan 5, 2022
emilymye pushed a commit to emilymye/beam that referenced this pull request Jan 5, 2022
emilymye pushed a commit to emilymye/beam that referenced this pull request Jan 5, 2022
emilymye added a commit that referenced this pull request Jan 6, 2022
#16354, #16407) (#16421)

Co-authored-by: Robert Bradshaw <robertwb@google.com>
Co-authored-by: Lukasz Cwik <lukecwik@gmail.com>
laraschmidt added a commit to laraschmidt/beam that referenced this pull request Mar 22, 2022
tushar19 added a commit to twitter-forks/beam that referenced this pull request Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants