[SPARK-28912][STREAMING] Fixed MatchError in getCheckpointFiles() #25654

avkgh · 2019-09-02T17:27:46Z

What changes were proposed in this pull request?

This change fixes issue SPARK-28912.

Why are the changes needed?

If checkpoint directory is set to name which matches regex pattern used for checkpoint files then logs are flooded with MatchError exceptions and old checkpoint files are not removed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually.

Start Hadoop in a pseudo-distributed mode.
In another terminal run command nc -lk 9999

In the Spark shell execute the following statements:

val ssc = new StreamingContext(sc, Seconds(30))
ssc.checkpoint("hdfs://localhost:9000/checkpoint-01")
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)                                
wordCounts.print()                               
ssc.start()                       
ssc.awaitTermination()

dongjoon-hyun · 2019-09-02T18:05:19Z

Welcome to Apache Spark community. Thank you for your first contribution, @avkgh .

dongjoon-hyun · 2019-09-02T18:05:27Z

ok to test

dongjoon-hyun · 2019-09-02T18:07:38Z

streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala

@@ -102,7 +102,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
 private[streaming]
 object Checkpoint extends Logging {
  val PREFIX = "checkpoint-"
-  val REGEX = (PREFIX + """([\d]+)([\w\.]*)""").r


Could you add a unit test case to streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala?
Please use a test case name, SPARK-28912 Fixed MatchError in getCheckpointFiles. That should be fail without your patch.

SparkQA · 2019-09-02T18:53:44Z

Test build #110022 has finished for PR 25654 at commit b990196.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-03T05:02:59Z

streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala

@@ -102,7 +102,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
 private[streaming]
 object Checkpoint extends Logging {
  val PREFIX = "checkpoint-"
-  val REGEX = (PREFIX + """([\d]+)([\w\.]*)""").r
+  val REGEX = (PREFIX + """([\d]{9,})([\w\.]*)""").r


I think it will technically introduce a behaviour change since it targets to support the checkpoint- name with numbers. Let's clarify it.

The intention behind this change was to skip invalid (or perhaps too old) checkpoint files since numeric part of checkpoint file name consists of current time in milliseconds and therefore cannot be shorter than 9 digits.
This caused some unit tests to fail because they are using ManualClock which reports fake time allowing generation of shorter checkpoint file names like checkpoint-2000 (where 2000 is supposedly current time in milliseconds).
Now I consider this change in regex redundant and unnecessary since filtering out directories and matching only the final component of a path (p.getName) should be sufficient to prevent MatchErrors.
I will revert this change to fix unit test fails.

srowen · 2019-09-03T13:44:50Z

streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala

@@ -102,7 +102,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
 private[streaming]
 object Checkpoint extends Logging {
  val PREFIX = "checkpoint-"
-  val REGEX = (PREFIX + """([\d]+)([\w\.]*)""").r
+  val REGEX = (PREFIX + """([\d]{9,})([\w\.]*)""").r


This seems to make it always expect 9+ digits, whereas it accepts 1 or more now. Your case is checkpoint-01 so I'm missing how this works?

Please read my previous reply to HyukjinKwon. I explained there what was intended.
In my case checkpoint-01 is a checkpoint directory name which matches positively using current regex accepting 1 or more digits.

srowen

Got it. I agree with leaving out the regex change then.
I'm OK with this; all the better if you can convert your repro into a simple unit test.

SparkQA · 2019-09-03T15:21:33Z

Test build #110043 has finished for PR 25654 at commit 812d867.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-09-05T18:42:15Z

Nice catch, +1 on unit test since old checkpoint files are not removed can be asserted.

dongjoon-hyun · 2019-09-05T21:57:43Z

Gentle ping, @avkgh .

avkgh · 2019-09-05T22:20:49Z

I will add a unit test when I have the time. I plan to do it within next 2 days.

SparkQA · 2019-09-06T13:17:14Z

Test build #110239 has finished for PR 25654 at commit 1835fe4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-09-06T13:51:45Z