[SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch #12944

sitalkedia · 2016-05-06T00:55:21Z

What changes were proposed in this pull request?

Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch

How was this patch tested?

Tested by running a job on the cluster and the shuffle read time was reduced by 50%.

AmplabJenkins · 2016-05-06T00:57:15Z

Can one of the admins verify this patch?

holdenk · 2016-05-06T06:34:56Z

So a very minor style thing; it seems like the rest of the configuration values are exposed through accessor methods on TransportConf rather than directly exposing getInt, it might be better to try and expose this in the same was as serverThreads() or numConnectionsPerPeer()?

HyukjinKwon · 2016-05-06T13:50:11Z

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexRecord.java

+  public long getLength() {
+    return length;
+  }
+}


And a newline at the end of this file maybe.

will fix, thanks

The convention seems to be to not have newlines at the end of files.

It seems it was rebased but I guess I meant below:

-} +} \ No newline at end of file

meaning

EDITED: Oh, i just found https://github.com/apache/spark/blob/master/scalastyle-config.xml#L281-L282 and https://github.com/apache/spark/blob/master/scalastyle-config.xml#L117

sitalkedia · 2016-05-06T14:08:19Z

@holdenk - TransportConf is not specific to the , it is used to create Transport client in other modules as well. Since number of index cache entry is very specific to the ShuffleService, I did not want to expose that as an api in the TransportConf. Let me know what you think about it.

sitalkedia · 2016-05-10T18:31:06Z

cc - @rxin

sitalkedia · 2016-05-24T14:12:46Z

cc - @srowen

sitalkedia · 2016-06-15T00:51:17Z

Can anyone take a look at it?
cc - @rxin

ericl · 2016-07-18T23:43:59Z

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java

+    ByteBuffer buffer = ByteBuffer.allocate(size);
+    DataInputStream dis = new DataInputStream(new FileInputStream(indexFile));
+    dis.readFully(buffer.array());
+    dis.close();


close() in finally block?

ericl · 2016-07-19T20:52:19Z

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

@@ -95,6 +109,9 @@ public ExternalShuffleBlockResolver(TransportConf conf, File registeredExecutorF
      Executor directoryCleaner) throws IOException {
    this.conf = conf;
    this.registeredExecutorFile = registeredExecutorFile;
+    int indexCacheEntries = conf.getInt(SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES,
+                                        DEFAULT_SPARK_SHUFFLE_SERVICE_INDEX_CACHE_ENTRIES);
+    this.shuffleIndexCache = new ShuffleIndexCache(indexCacheEntries);


= new ShuffleIndexCache(conf.getInt("spark.shuffle.service.index.cache.size", 1024))

ericl · 2016-07-20T20:31:25Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

@@ -53,7 +53,7 @@ private[spark] class CoarseGrainedExecutorBackend(
  private[this] val ser: SerializerInstance = env.closureSerializer.newInstance()

  override def onStart() {
-    logInfo("Connecting to driver: " + driverUrl)
+    logInfo("Connecting to driver, skedia1: " + driverUrl)


my bad, removed it.

ericl · 2016-07-20T20:44:08Z

lgtm @JoshRosen

sitalkedia · 2016-07-24T17:12:30Z

@JoshRosen, can you take a look?

JoshRosen · 2016-08-04T21:47:56Z

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

-      long offset = in.readLong();
-      long nextOffset = in.readLong();
+      ShuffleIndexInformation shuffleIndexInformation = shuffleIndexCache.get(indexFile);
+      ShuffleIndexRecord shuffleIndexRecord = shuffleIndexInformation.getIndex(reduceId);


It turns out that this call will fail with ArrayIndexOutOfBoundsException if the reduceId is too large. In the old code, an invalid reduceId would lead to an IOException because we'd skip past the end of the input stream and try to read.

However, I don't think that this subtle change in behavior is going to necessarily cause problems from the caller's perspective since ArrayIndexOutOfBoundsException is also a RuntimeException and this code was already throwing RuntimeException in the "index file is missing" error case. Therefore, this looks good to me!

JoshRosen · 2016-08-04T21:50:20Z

LGTM. I suppose we could also add similar functionality to the non-shuffle-service version of IndexShuffleBlockResolver, but I think that's a much lower priority because I suspect that most folks who want to optimize production shuffle performance will be using the external shuffle service anyways.

I've re-tested this locally and have confirmed that it still compiles and passes relevant tests, so I'm going to merge this to master. Thanks @sitalkedia!

JoshRosen · 2016-08-04T22:14:11Z

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java

+import com.google.common.cache.LoadingCache;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import sun.nio.ch.IOUtil;


Doh, it looks like this unused import is breaking things. Let me hotfix to remove it.

JoshRosen · 2016-08-04T22:17:56Z

Hotfixing in #14499 to fix the build. My bad.

Author: Josh Rosen <joshrosen@databricks.com> Closes #14499 from JoshRosen/hotfix.

sitalkedia · 2016-08-05T15:49:35Z

Thanks @JoshRosen !

HyukjinKwon reviewed May 6, 2016
View reviewed changes

sitalkedia force-pushed the shuffle_service branch from 9ca0309 to 56236ac Compare May 6, 2016 17:58

ericl reviewed Jul 18, 2016
View reviewed changes

sitalkedia force-pushed the shuffle_service branch from 56236ac to a4db516 Compare July 19, 2016 20:35

ericl reviewed Jul 19, 2016
View reviewed changes

sitalkedia force-pushed the shuffle_service branch from a4db516 to a0799ef Compare July 20, 2016 17:33

ericl reviewed Jul 20, 2016
View reviewed changes

[SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch

d099367

sitalkedia force-pushed the shuffle_service branch from a0799ef to d099367 Compare July 20, 2016 20:43

JoshRosen reviewed Aug 4, 2016
View reviewed changes

asfgit closed this in 9c15d07 Aug 4, 2016

JoshRosen reviewed Aug 4, 2016
View reviewed changes

JoshRosen added a commit to JoshRosen/spark that referenced this pull request Aug 4, 2016

[HOTFIX] Remove unnecessary imports from apache#12944 that broke build

a64ee0f

asfgit pushed a commit that referenced this pull request Aug 4, 2016

[HOTFIX] Remove unnecessary imports from #12944 that broke build

d91c675

Author: Josh Rosen <joshrosen@databricks.com> Closes #14499 from JoshRosen/hotfix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch #12944

[SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch #12944

sitalkedia commented May 6, 2016

AmplabJenkins commented May 6, 2016

holdenk commented May 6, 2016

HyukjinKwon May 6, 2016 •

edited

Loading

sitalkedia May 6, 2016

ericl Jul 18, 2016

HyukjinKwon Jul 19, 2016 •

edited

Loading

sitalkedia Jul 19, 2016

sitalkedia commented May 6, 2016

sitalkedia commented May 10, 2016

sitalkedia commented May 24, 2016

sitalkedia commented Jun 15, 2016

ericl Jul 18, 2016

sitalkedia Jul 19, 2016

ericl Jul 19, 2016

sitalkedia Jul 20, 2016

ericl Jul 20, 2016

sitalkedia Jul 20, 2016

ericl commented Jul 20, 2016

sitalkedia commented Jul 24, 2016

JoshRosen Aug 4, 2016

JoshRosen commented Aug 4, 2016

JoshRosen Aug 4, 2016

JoshRosen commented Aug 4, 2016

sitalkedia commented Aug 5, 2016

[SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch #12944

[SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch #12944

Conversation

sitalkedia commented May 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented May 6, 2016

holdenk commented May 6, 2016

HyukjinKwon May 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jul 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sitalkedia commented May 6, 2016

sitalkedia commented May 10, 2016

sitalkedia commented May 24, 2016

sitalkedia commented Jun 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Jul 20, 2016

sitalkedia commented Jul 24, 2016

Choose a reason for hiding this comment

JoshRosen commented Aug 4, 2016

Choose a reason for hiding this comment

JoshRosen commented Aug 4, 2016

sitalkedia commented Aug 5, 2016

HyukjinKwon May 6, 2016 •

edited

Loading

HyukjinKwon Jul 19, 2016 •

edited

Loading