[SPARK-23270][Streaming][WEB-UI]FileInputDStream Streaming UI 's records should not be set to the default value of 0, it should be the total number of rows of new files. #20437

guoxiaolongzte · 2018-01-30T11:33:38Z

What changes were proposed in this pull request?

FileInputDStream Streaming UI 's records should not be set to the default value of 0, it should be the total number of rows of new files.
------------------------------------------in FileInputDStream.scala start------------------------------------
val inputInfo = StreamInputInfo(id, 0, metadata) // set to the default value of 0
ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
case class StreamInputInfo(
inputStreamId: Int, numRecords: Long, metadata: Map[String, Any] = Map.empty)
------------------------------------in FileInputDStream.scala end--------------------------

------------------------------------------in DirectKafkaInputDStream.scala start------------------------------------
val inputInfo = StreamInputInfo(id, rdd.count, metadata) //set to rdd count as numRecords
ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
case class StreamInputInfo(
inputStreamId: Int, numRecords: Long, metadata: Map[String, Any] = Map.empty)
------------------------------------in DirectKafkaInputDStream.scala end----------------------

test method：
./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark/tmp/

fix after:

How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

…rds should not be set to the default value of 0, it should be the total number of rows of new files.

srowen · 2018-01-30T15:05:59Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

@@ -157,7 +157,9 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
    val metadata = Map(
      "files" -> newFiles.toList,
      StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
-    val inputInfo = StreamInputInfo(id, 0, metadata)
+    var numRecords = 0L


I'm not sure if this change is correct, but, you should write it as rdds.map(_.count).sum

guoxiaolongzte · 2018-01-31T02:23:07Z

Thanks, thank you for your review.

jerryshao · 2018-01-31T07:09:04Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

@@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
    val metadata = Map(
      "files" -> newFiles.toList,
      StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
-    val inputInfo = StreamInputInfo(id, 0, metadata)
+    val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata)


This will kick off a new Spark job to read files and count, which will bring in obvious overhead. Whereas count in DirectKafkaInputDStream only calculates offsets.

Because of this little overhead, that 'Records' is not recorded? This is a obvious bug.

This is not a small overhead. The changes will read/scan all the new files, this is a big overhead for streaming application (data is unnecessarily read twice).

I see what you mean. I'll try to make it read once. Can you give me some idea?

I'm not sure if there's a solution to fix it here.

Asynchronous processing, does not affect the backbone of the Streaming job, also can get the number of records.

I'm not in favor of such changes. No matter the process is sync or async, because reportInfo is invoked here, so you have to wait for the process to end.

Anyway I think reading twice is unacceptable for streaming scenario (even for batch scenario). I guess the previous code set to "0" by intention.

If we can add a switch parameter, the default value is false.

If it is true, then it needs to be count (read the file again)， so that the records can be correctly counted.

Of course, it shows that when the parameter is opened to true, the streaming performance problem will be affected.

I don't think it's a good idea. Actually I'm incline of leaving as it is.

I am very sad. I'm working on whether there's a better way.

AmplabJenkins · 2018-06-09T00:14:48Z

Can one of the admins verify this patch?

[SPARK-23270][Streaming][WEB-UI]FileInputDStream Streaming UI 's reco…

41148c6

…rds should not be set to the default value of 0, it should be the total number of rows of new files.

srowen reviewed Jan 30, 2018

View reviewed changes

fix the code

742034e

jerryshao reviewed Jan 31, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Jul 16, 2018

[INFRA] Close stale PR #21781

Closed

asfgit closed this in 1a4fda8 Jul 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23270][Streaming][WEB-UI]FileInputDStream Streaming UI 's records should not be set to the default value of 0, it should be the total number of rows of new files. #20437

[SPARK-23270][Streaming][WEB-UI]FileInputDStream Streaming UI 's records should not be set to the default value of 0, it should be the total number of rows of new files. #20437

guoxiaolongzte commented Jan 30, 2018

srowen Jan 30, 2018

guoxiaolongzte commented Jan 31, 2018 •

edited

jerryshao Jan 31, 2018

guoxiaolongzte Jan 31, 2018

jerryshao Jan 31, 2018

guoxiaolongzte Jan 31, 2018

jerryshao Jan 31, 2018

guoxiaolongzte Feb 1, 2018

jerryshao Feb 1, 2018

guoxiaolongzte Feb 1, 2018 •

edited

jerryshao Feb 1, 2018

guoxiaolongzte Feb 1, 2018

AmplabJenkins commented Jun 9, 2018

[SPARK-23270][Streaming][WEB-UI]FileInputDStream Streaming UI 's records should not be set to the default value of 0, it should be the total number of rows of new files. #20437

[SPARK-23270][Streaming][WEB-UI]FileInputDStream Streaming UI 's records should not be set to the default value of 0, it should be the total number of rows of new files. #20437

Conversation

guoxiaolongzte commented Jan 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

guoxiaolongzte commented Jan 31, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoxiaolongzte Feb 1, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jun 9, 2018

guoxiaolongzte commented Jan 31, 2018 •

edited

guoxiaolongzte Feb 1, 2018 •

edited