SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs #4050

sryza · 2015-01-14T21:19:35Z

...mbineFileSplits

SparkQA · 2015-01-14T21:22:37Z

Test build #25567 has started for PR 4050 at commit 9962dd0.

This patch merges cleanly.

ksakellis · 2015-01-14T21:32:00Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

@@ -219,6 +220,9 @@ class HadoopRDD[K, V](
      val bytesReadCallback = if (split.inputSplit.value.isInstanceOf[FileSplit]) {
        SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
          split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf)
+      } else if (split.inputSplit.value.isInstanceOf[CombineFileSplit]) {
+        SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
+          split.inputSplit.value.asInstanceOf[CombineFileSplit].getPath(0), jobConf)


Can you push this logic down to the SparkHadoopUtil so that we don't duplicate it in two places (HadoopRDD and NewHadoopRDD).

The issue is that those are actually two different classes. There's a CombineFileSplit for the old MR API (used by HadoopRDD) and a CombineFileSplit for the new one (used by NewHadoopRDD).

Yes, SparkHadoopUtil) can check for those classes. It can have a matcher on the 4 classes (2 new and 2 old). So the call from hadoopRdd would be something like:
SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(split.inputSplit, jobConf)
Not a big deal i guess since in SparkHadoopUtil you'll have four cases but at least that logic is centralized.

ksakellis · 2015-01-14T21:33:22Z

This mostly LGTM. My only concern is with the proliferation of copy pasta between the HadoopRDD and NewHadoopRDD.

SparkQA · 2015-01-14T22:31:29Z

Test build #25567 has finished for PR 4050 at commit 9962dd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-14T22:31:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25567/
Test PASSed.

pwendell · 2015-01-18T07:50:07Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

@@ -219,6 +220,9 @@ class HadoopRDD[K, V](
      val bytesReadCallback = if (split.inputSplit.value.isInstanceOf[FileSplit]) {
        SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
          split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf)
+      } else if (split.inputSplit.value.isInstanceOf[CombineFileSplit]) {
+        SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(


Is it guaranteed that all paths in the CombineFileSplit have the same filesystem?

Also one related question after I dug around a bit more. Hadoops FileSystem.getAllStatistics() returns a list where you can only distinguish one file system from another via the scheme. What happens if two different hdfs:// filesystems are being read from within the same thread (for instance if two Hadoop RDD's are coalesced)? Is the assumption that this will never happen?

SparkQA · 2015-01-20T22:12:44Z

Test build #25848 has started for PR 4050 at commit ff8a4cb.

This patch merges cleanly.

SparkQA · 2015-01-20T23:20:55Z

Test build #25848 has finished for PR 4050 at commit ff8a4cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-20T23:20:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25848/
Test PASSed.

shenh062326 · 2015-01-25T02:45:16Z

If we use a inputFormat that don‘t instanc of org.apache.hadoop.mapreduce.lib.input.{CombineFileSplit, FileSplit}, then we can't get information of input metrics.

squito · 2015-01-26T19:12:36Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

-          case _ => None
+      val bytesReadCallback = inputMetrics.bytesReadCallback.orElse {
+        val inputSplit = split.inputSplit.value
+        if (inputSplit.isInstanceOf[FileSplit] || inputSplit.isInstanceOf[CombineFileSplit]) {


this is fine as is, but fyi you can do the same thing in a pattern match:

split.inputSplit.value match { case _: FileSplit | _: CombineFileSplit => SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(jobConf) case _ => None }

Ah, yours looks prettier, will switch it.

… CombineFileSplits

sryza · 2015-01-26T21:02:45Z

If we use a inputFormat that don‘t instanc of org.apache.hadoop.mapreduce.lib.input.{CombineFileSplit, FileSplit}, then we can't get information of input metrics.

This is the desired behavior. The input metrics are currently only able to track the bytes read from a Hadoop-compatible file system. Many InputFormats (e.g. DBInputFormat) don't read from Hadoop-compatible file systems, so reporting "bytes read" would be misleading.

SparkQA · 2015-01-26T21:02:50Z

Test build #26109 has started for PR 4050 at commit 0d504f1.

This patch merges cleanly.

pwendell · 2015-01-26T22:03:45Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

-    val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
-    val scheme = qualifiedPath.toUri().getScheme()
-    val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
+  private def getFileSystemThreadStatistics(conf: Configuration): Seq[AnyRef] = {


does this need to take a +conf+ object anymore... can we just remove the +conf+'s throughout this callstack?

SparkQA · 2015-01-26T22:07:41Z

Test build #26109 has finished for PR 4050 at commit 0d504f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-26T22:07:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26109/
Test PASSed.

pwendell · 2015-01-26T22:08:23Z

Looking good, added some comments.

One thing - could we change the title here to reflect the actual change (maybe we could even open a new JIRA or something). This now is a broader change, something like "Track Hadoop from all filesystems access inside of a task".

Second, we don't actually add a test here for the case that was reported, notably the use of CompbineFileSpits. Can we explicitly test that to make sure we don't regress behavior?

sryza · 2015-01-26T23:49:23Z

Edited the JIRA title and added tests for the CombineFileSplits. Tested both against Hadoop 2.3 (which doesn't support getFSBytesReadCallback) and Hadoop 2.5 (which does).

SparkQA · 2015-01-26T23:52:43Z

Test build #26124 has started for PR 4050 at commit 864514b.

This patch merges cleanly.

SparkQA · 2015-01-27T01:03:23Z

Test build #26124 has finished for PR 4050 at commit 864514b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-27T01:03:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26124/
Test PASSed.

pwendell · 2015-01-27T23:42:32Z

Cool - thanks Sandy!

rxin · 2015-02-02T01:14:33Z

Hey FYI, I think this patch is causing some problems. I got the following exception today:

15/02/02 01:12:20 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, ip-172-31-13-57.us-west-2.compute.internal): java.lang.ClassNotFoundException: org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:191)
    at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180)
    at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118)
    at scala.Option.orElse(Option.scala:257)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:117)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:281)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:248)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:281)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:248)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:281)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:248)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

sryza · 2015-02-02T17:55:34Z

@rxin this should be fixed by SPARK-5492

ksakellis reviewed Jan 14, 2015
View reviewed changes

pwendell reviewed Jan 18, 2015
View reviewed changes

sryza force-pushed the sandy-spark-5199 branch from 9962dd0 to ff8a4cb Compare January 20, 2015 22:11

sryza mentioned this pull request Jan 22, 2015

[SPARK-5347][CORE] Change FileSplit to InputSplit in update inputMetrics #4150

Closed

squito reviewed Jan 26, 2015
View reviewed changes

sryza added 3 commits January 26, 2015 12:57

SPARK-5199. Input metrics should show up for InputFormats that return…

cdbc3e8

… CombineFileSplits

Get metrics from all filesystems

915c7e6

Prettify

0d504f1

sryza force-pushed the sandy-spark-5199 branch from ff8a4cb to 0d504f1 Compare January 26, 2015 20:59

pwendell reviewed Jan 26, 2015
View reviewed changes

sryza changed the title ~~SPARK-5199. Input metrics should show up for InputFormats that return Co...~~ SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs Jan 26, 2015

Add tests and fix bug

864514b

asfgit closed this in b1b35ca Jan 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs #4050

SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs #4050

sryza commented Jan 14, 2015

SparkQA commented Jan 14, 2015

ksakellis Jan 14, 2015

sryza Jan 14, 2015

ksakellis Jan 15, 2015

ksakellis commented Jan 14, 2015

SparkQA commented Jan 14, 2015

AmplabJenkins commented Jan 14, 2015

pwendell Jan 18, 2015

SparkQA commented Jan 20, 2015

SparkQA commented Jan 20, 2015

AmplabJenkins commented Jan 20, 2015

shenh062326 commented Jan 25, 2015

squito Jan 26, 2015

sryza Jan 26, 2015

sryza commented Jan 26, 2015

SparkQA commented Jan 26, 2015

pwendell Jan 26, 2015

SparkQA commented Jan 26, 2015

AmplabJenkins commented Jan 26, 2015

pwendell commented Jan 26, 2015

sryza commented Jan 26, 2015

SparkQA commented Jan 26, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

pwendell commented Jan 27, 2015

rxin commented Feb 2, 2015

sryza commented Feb 2, 2015

SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs #4050

SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs #4050

Conversation

sryza commented Jan 14, 2015

SparkQA commented Jan 14, 2015

ksakellis Jan 14, 2015

Choose a reason for hiding this comment

sryza Jan 14, 2015

Choose a reason for hiding this comment

ksakellis Jan 15, 2015

Choose a reason for hiding this comment

ksakellis commented Jan 14, 2015

SparkQA commented Jan 14, 2015

AmplabJenkins commented Jan 14, 2015

pwendell Jan 18, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 20, 2015

SparkQA commented Jan 20, 2015

AmplabJenkins commented Jan 20, 2015

shenh062326 commented Jan 25, 2015

squito Jan 26, 2015

Choose a reason for hiding this comment

sryza Jan 26, 2015

Choose a reason for hiding this comment

sryza commented Jan 26, 2015

SparkQA commented Jan 26, 2015

pwendell Jan 26, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 26, 2015

AmplabJenkins commented Jan 26, 2015

pwendell commented Jan 26, 2015

sryza commented Jan 26, 2015

SparkQA commented Jan 26, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

pwendell commented Jan 27, 2015

rxin commented Feb 2, 2015

sryza commented Feb 2, 2015