[BEAM-5987] Cache and share materialized side inputs between Spark tasks #7091

dmvk · 2018-11-20T17:09:38Z

We should try to reuse deserialized side inputs among spark tasks.

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

dmvk · 2018-11-20T17:45:01Z

Run Spark ValidatesRunner

VaclavPlajt · 2018-11-21T14:06:27Z

runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java

+      Collections.synchronizedMap(new WeakHashMap<>());
+
+  /**
+   * Id that is consistent among executors. We can not use stepName because of possible collisions.


I do not get meaning of 'consistent' here. Do you mean random (most likely distinct) even within one JVM ?

After deserialization on the executor side

VaclavPlajt

LGTM

dmvk · 2018-11-21T15:59:49Z

This still needs some effort as it does not handle the case when sideinput is used in different DoFns

mareksimunek · 2018-11-28T09:23:49Z

Run Spark ValidatesRunner

mareksimunek · 2018-12-03T09:08:39Z

Run Spark ValidatesRunner

iemejia · 2018-12-10T10:51:43Z

runners/spark/src/main/java/org/apache/beam/runners/spark/util/SideInputStorage.java

+ * Side inputs are stored in {@link Cache} with weakValues so if there is no reference to a value,
+ * sideInput is garbage collected.
+ */
+public class SideInputStorage {


package private and same for constructor, make access as tight as needed.

iemejia · 2018-12-10T10:59:31Z

runners/spark/src/main/java/org/apache/beam/runners/spark/util/CachedSideInputReader.java

+   * Keep references for the whole lifecycle of CachedSideInputReader otherwise sideInput needs to
+   * be de-serialized again.
+   */
+  private Set<?> sideInputReferences = new HashSet<>();


Aren't values ever removed from here? Or I am misreading this one, seems like it can overflow and even prevent SideInputStorage from being GCed.

references removed because different solution was used. more

iemejia · 2018-12-10T11:01:18Z

It would be nice to test that this behaves as expected and does not leak (not being GCed) and does not rematerialize.

iemejia

Let some comments, given the potential issue of 'growing memory use' as a side effect it would be really nice to add some test(s).

mareksimunek · 2019-01-07T14:58:49Z

After trying several approaches we decided to keep it simple and go with expireAfterAccess to drop values from cache.

Solution with weak values didn't bring desired behavior because when MultiDoFnFunction finished for one executor, it immediately garbage collected de-serialized side input (because it lost references due to end of life for CachedSideInputReader). It kept side input in cache only if there was overlapped running of multiple MultiDoFnFunction . In our case for one JVM it a de-serialized side input up to 10x.

With expireAfterAccess side input is de-serialized only once. I chose 5 min eviction duration as best compromise but I am open to discussion if it should be configurable.

Disadvantage for expireAfterAccess solution could be potential higher memory consumption if SideInputStorage isn't access long time so nothing can be evicted. I don't know how to recognize when MultiDoFnFunction is finished so I can call cache.cleanup() to trigger eviction for expired items. Also not sure if this is even a problem.

mareksimunek · 2019-01-11T14:42:49Z

Run Spark ValidatesRunner

mareksimunek · 2019-01-11T15:15:09Z

Run Spark ValidatesRunner

mareksimunek · 2019-01-11T17:00:02Z

Run Java PreCommit

iemejia · 2019-01-16T14:18:13Z

Run Spark ValidatesRunner

iemejia

I let two comments on possible issues that I am not 100% sure if they are correct with this PR.
Pinging also @amitsela to see if he has something to say in particular in the static state. Thanks!

iemejia · 2019-01-16T16:29:26Z

runners/spark/src/main/java/org/apache/beam/runners/spark/util/SideInputStorage.java

+class SideInputStorage {
+
+  /** JVM deserialized side input cache. */
+  private static final Cache<Key<?>, Optional<?>> materializedSideInputs =


I am a bit worried on the possible consequence of a collision of the Key<view, window> tuple in particular if a bad implementation of equals is around. This is not relative to this PR but since the state is now static this makes the likelihood of this happening bigger.

how do other runners cache side inputs? by which key? this sounds like something the SDK could provide guidance on (@kennknowles)

The window should use windowCoder.structuralValue(window) which is required to behave identically to a full serialization for shuffle purposes. The view itself is just a tag so it should have good enough equals as-is. There is already caching in the now-donated Dataflow Java worker, if you look through uses and subclasses of SideInputReader.

I also think you have bigger troubles if you have collision in view or window(on more places sparkRunner relies on that) so I will leave it as it is. Is that ok?

iemejia · 2019-01-16T16:31:19Z

runners/spark/src/main/java/org/apache/beam/runners/spark/util/CachedSideInputReader.java

@@ -86,9 +55,27 @@ private CachedSideInputReader(SideInputReader delegate) {
  @Override
  public <T> T get(PCollectionView<T> view, BoundedWindow window) {


I am worried on the possible semantics consequences of CachedSideInputReader.get() returning a null value when it is not in the Cache. Wouldn't it imply that a window could get an empty side input assigned?
The documentation on this is not really clear (pinging @kennknowles to see if I am misreading it).
Wonder if there is a test to validate that this cannot happen or if we can create one somehow?

Ah, the meaning of that comment is that null is a value. You can have a PCollection<@Nullable Foo> that contains just one copy of null and use View.asSingleton() and the side input returns the null.

In other words, get must return a value of type T. But the type T may itself be @Nullable Something. The annotation on SideInputReader should be removed. It is is incorrect if we use a static analysis that understands this. Findbugs does not understand this but we should aspire for our annotations to be correct so the documentation is clear.

kennknowles · 2019-01-16T16:40:52Z

runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkPCollectionView.java

@@ -85,6 +88,13 @@ private SideInputBroadcast createBroadcastHelper(
      PCollectionView<?> view, JavaSparkContext context) {
    Tuple2<byte[], Coder<Iterable<WindowedValue<?>>>> tuple2 = pviews.get(view);
    SideInputBroadcast helper = SideInputBroadcast.create(tuple2._1, tuple2._2);
+    String pCollectionName =
+        view.getPCollection() != null ? view.getPCollection().getName() : "UNKNOWN";
+    LOG.info(


Maybe LOG.debug?

changed to debug

kennknowles · 2019-01-16T16:41:08Z

runners/spark/src/main/java/org/apache/beam/runners/spark/util/CachedSideInputReader.java

-import java.util.HashMap;
-import java.util.Map;
-import java.util.Objects;
+import com.google.common.cache.Cache;


Vendored Guava?

fixed to vendored

kennknowles · 2019-01-16T16:43:53Z

runners/spark/src/main/java/org/apache/beam/runners/spark/util/CachedSideInputReader.java

@@ -86,9 +55,27 @@ private CachedSideInputReader(SideInputReader delegate) {
  @Override
  public <T> T get(PCollectionView<T> view, BoundedWindow window) {


Ah, the meaning of that comment is that null is a value. You can have a PCollection<@Nullable Foo> that contains just one copy of null and use View.asSingleton() and the side input returns the null.

In other words, get must return a value of type T. But the type T may itself be @Nullable Something. The annotation on SideInputReader should be removed. It is is incorrect if we use a static analysis that understands this. Findbugs does not understand this but we should aspire for our annotations to be correct so the documentation is clear.

kennknowles · 2019-01-16T16:45:34Z

runners/spark/src/main/java/org/apache/beam/runners/spark/util/CachedSideInputReader.java

+                    SizeEstimator.estimate(result));
+                return Optional.ofNullable(result);
+              });
+      return optionalResult.orElse(null);


Ismaël is right. The delegate.get is not @Nullable at this place in the abstraction. You don't need to check for it (unless there's some other bug somewhere) and you shouldn't convert Optional.absent() to null.

I chose this solution because guava Cache doesn't allow null values and I didn't realize I will break semantic meaning. I will try to find out different solution.

Ah, thank you for clarifying. That is a good attempt. The problem is that these will incorrectly be turned into the same thing:

Optional.ofNullable(null).orElse(null) == null

Optional.ofNullable(Optional.absent()).orElse(null) == null

The fact that Optional.of(null) throws NPE is a mistake in the design (both Java and Guava). Maybe the point of the design is to convince people to not use null, which is a billion dollar good idea. But it makes Optional<T> not correctly parametric in T.

I think that if you actually convert null into Optional.of(Optional.absent()) and other values v into Optional.of(Optional.of(v)) you can simulate the behavior it should have had in the first place. Or you could make your own little replacement of Optional.

Thanks for suggestions, Optional combo would be not very readable.
I made my own wrapper where I simply wrap the value so I can put null into the cache.
https://github.com/apache/beam/pull/7091/files#diff-b123f0f1ca9646966a641a458b74cfbcR92

iemejia

LGTM, will do some minor touches and rebase manually to merge. Thanks a lot @mareksimunek and @dmvk !

…de inputs between Spark tasks

iemejia · 2019-02-08T14:05:55Z

Merged!

dmvk force-pushed the dejv/spark_shared_cached_side_inputs branch from b997857 to 976678f Compare November 20, 2018 17:41

VaclavPlajt reviewed Nov 21, 2018

View reviewed changes

VaclavPlajt approved these changes Nov 21, 2018

View reviewed changes

mareksimunek force-pushed the dejv/spark_shared_cached_side_inputs branch 3 times, most recently from b804a31 to 6a22322 Compare November 27, 2018 14:46

mareksimunek force-pushed the dejv/spark_shared_cached_side_inputs branch 2 times, most recently from ab6adc1 to a84bbc1 Compare November 29, 2018 16:01

mareksimunek force-pushed the dejv/spark_shared_cached_side_inputs branch from a84bbc1 to 80e147c Compare December 3, 2018 13:19

iemejia reviewed Dec 10, 2018

View reviewed changes

iemejia requested changes Dec 10, 2018

View reviewed changes

mareksimunek force-pushed the dejv/spark_shared_cached_side_inputs branch from 0b542a9 to cc404ac Compare January 7, 2019 14:30

mareksimunek force-pushed the dejv/spark_shared_cached_side_inputs branch from 8c5c5b0 to 72aee8e Compare January 11, 2019 15:18

VaclavPlajt force-pushed the dejv/spark_shared_cached_side_inputs branch from 72aee8e to 0e32d96 Compare January 14, 2019 11:15

iemejia reviewed Jan 16, 2019

View reviewed changes

kennknowles requested changes Jan 16, 2019

View reviewed changes

David Moravek and others added 3 commits January 22, 2019 16:24

[BEAM-5987] Spark: Share cached side inputs between tasks.

daca4cd

[BEAM-5987] deserialized sideInputs are cached in executor

55df568

[BEAM-5987] cached spark side inputs are evicted based on time access

c72eee9

mareksimunek added 2 commits January 22, 2019 16:24

[BEAM-5987] added more logging about broadcast

6a71b50

[BEAM-5987] fix for sideInputWithNull

ddfe7dd

mareksimunek force-pushed the dejv/spark_shared_cached_side_inputs branch from 0e32d96 to ddfe7dd Compare January 22, 2019 16:17

apache deleted a comment from mareksimunek Feb 1, 2019

iemejia approved these changes Feb 8, 2019

View reviewed changes

iemejia changed the title ~~[BEAM-5987] Spark: Share cached side inputs between tasks.~~ [BEAM-5987] Cache and share materialized side inputs between Spark tasks Feb 8, 2019

iemejia added a commit that referenced this pull request Feb 8, 2019

Merge pull request #7091: [BEAM-5987] Cache and share materialized si…

f1bb53a

…de inputs between Spark tasks

iemejia closed this Feb 8, 2019

iemejia mentioned this pull request Feb 8, 2019

[BEAM-5392] GroupByKey optimized for non-merging windows #7601

Merged

2 tasks

		@@ -86,9 +55,27 @@ private CachedSideInputReader(SideInputReader delegate) {
		@Override
		public <T> T get(PCollectionView<T> view, BoundedWindow window) {

[BEAM-5987] Cache and share materialized side inputs between Spark tasks #7091

[BEAM-5987] Cache and share materialized side inputs between Spark tasks #7091

Conversation

dmvk commented Nov 20, 2018

Post-Commit Tests Status (on master branch)

dmvk commented Nov 20, 2018

VaclavPlajt Nov 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VaclavPlajt left a comment

Choose a reason for hiding this comment

dmvk commented Nov 21, 2018

mareksimunek commented Nov 28, 2018

mareksimunek commented Dec 3, 2018

iemejia Dec 10, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iemejia commented Dec 10, 2018

iemejia left a comment

Choose a reason for hiding this comment

mareksimunek commented Jan 7, 2019 • edited

mareksimunek commented Jan 11, 2019

mareksimunek commented Jan 11, 2019

mareksimunek commented Jan 11, 2019

iemejia commented Jan 16, 2019

iemejia left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iemejia Jan 16, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iemejia left a comment

Choose a reason for hiding this comment

iemejia commented Feb 8, 2019

VaclavPlajt Nov 21, 2018 •

edited

iemejia Dec 10, 2018 •

edited

mareksimunek commented Jan 7, 2019 •

edited

iemejia left a comment •

edited

iemejia Jan 16, 2019 •

edited