[SPARK-18373][SPARK-18529][SS][Kafka]Make failOnDataLoss=false work with Spark jobs #15820

zsxwing · 2016-11-09T01:10:56Z

What changes were proposed in this pull request?

This PR adds CachedKafkaConsumer.getAndIgnoreLostData to handle corner cases of failOnDataLoss=false.

It also resolves SPARK-18529 after refactoring codes: Timeout will throw a TimeoutException.

How was this patch tested?

Because I cannot find any way to manually control the Kafka server to clean up logs, it's impossible to write unit tests for each corner case. Therefore, I just created test("stress test for failOnDataLoss=false") which should cover most of corner cases.

I also modified some existing tests to test for both failOnDataLoss=false and failOnDataLoss=true to make sure it doesn't break existing logic.

zsxwing · 2016-11-09T01:11:55Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

+      }
+    }).start()
+
+    val testTime = 1.minutes


I changed this to 20 minutes and test it locally. It passed.

zsxwing · 2016-11-09T01:12:54Z

Use https://github.com/apache/spark/pull/15820/files?w=1 to review PR to ignore space changes.

zsxwing · 2016-11-09T01:13:39Z

cc @tdas @marmbrus @koeninger

SparkQA · 2016-11-09T02:01:40Z

Test build #68380 has finished for PR 15820 at commit e8eff9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

koeninger

In general, getAndIgnoreLostData is hard to read due to the length and early returns, and I'm pretty sure I've missed something.
I know the actual underlying cases are complicated to get right, but is it possible to refactor it?

koeninger · 2016-11-09T03:04:23Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+          logWarning(s"Buffer miss for $groupId $topicPartition [$offset, ${record.offset})")
+        }
+        nextOffsetInFetchedData = record.offset + 1
+        return record


Are these two early returns actually necessary?

koeninger · 2016-11-09T03:14:26Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+  }
+
+  private def reset(): Unit = {
+    nextOffsetInFetchedData = -2


This use of -2 as a magic number here and earlier in the file is a little misleading, since the new consumer won't actually let you seek to -2 as a means of indicating earliest

koeninger · 2016-11-09T03:43:57Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      }
+    } else if (!fetchedData.hasNext()) {
+      // The last pre-fetched data has been drained.
+      seek(offset)


I don't think it's necessary to seek every time the fetched data is empty, in normal operation the poll should return the next offset, right?

koeninger · 2016-11-09T03:47:01Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+    }
+
+    logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset")
+    var outOfOffset = false


Can this var be eliminated by just using a single try around the if / else? It's the same catch condition in either case

koeninger · 2016-11-09T03:48:31Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+          return null
+        } else {
+          // Case 4 or 5
+          getAndIgnoreLostData(offset, untilOffset, pollTimeoutMs)


Why isn't this an early return?

Unless I'm misreading, this is a recursive call without changing the arguments. Why is it guaranteed to terminate?

koeninger · 2016-11-09T03:52:06Z

Wow, looks like the new github comment interface did all kinds of weird things, apologies about that.

zsxwing · 2016-11-09T21:35:39Z

@koeninger Thanks for reviewing. Refactored the codes to avoid using early returns and addressed your comments.

SparkQA · 2016-11-09T22:16:50Z

Test build #68421 has finished for PR 15820 at commit 7afac17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

koeninger

Thanks, I think that version is easier to read, and hopefully in normal operation won't be recursing much so the @tailrec loss wont be an efficiency hit.

I'm mostly concerned about clarifying the timeout situation at this point.

koeninger · 2016-11-10T17:00:38Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+        null
+      } else {
+        // beginningOffset <= offset <= min(latestOffset - 1, untilOffset - 1)
+        getAndIgnoreLostData(offset, untilOffset, pollTimeoutMs)


I'm clearer on why this terminates, but I think it's worth a comment, since it's a mutually recursive call without changing arguments.

+1
Especially since with the loss of the use of @tailrec, this must now prove it will terminate within a limited stack size, and should prove it will under most stack size configurations.

koeninger · 2016-11-10T17:03:12Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      getRecordFromFetchedData(offset, untilOffset)
+    } catch {
+      case e: OffsetOutOfRangeException =>
+        logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e)


I think it's worth the warning explicitly stating that data has been lost

koeninger · 2016-11-10T17:07:54Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+    if (!fetchedData.hasNext()) {
+      // We cannot fetch anything after `poll`. Two possible cases:
+      // - `beginningOffset` is `offset` but there is nothing for `beginningOffset` right now.
+      // - Cannot fetch any date before timeout.


As a user, I'm not sure that setting failOnDataLoss=false would make me know that a timeout would cause me to miss data in my spark job (that might otherwise still be in kafka)

If throwing an exception here, test("access offset 0 in Spark job but the topic has been deleted") will fail. It seems a reasonable case.

I don't think topic deletion is anywhere near as common as timeouts, and topic deletion is something user initiated. As a user, is skipping the rest of the offsets in a timeout really what you would want to happen (it isn't for me)? If so, does this need a separate configuration?

SparkQA · 2016-11-11T03:00:26Z

Test build #68508 has finished for PR 15820 at commit 3aa9d7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-11-15T23:01:03Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      untilOffset: Long,
+      pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = {
+    // scalastyle:off
+    // When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive):


I dont get this, there is no reference to latestOffset in this method. What does it refer to?

latestOffset means the latest offset that we get from the Kafka.

Also is beginningOffset the earlier offset available in Kafka? Then lets rename this as earliestOffset, and "earliest" and "latest" and more popular terms in the context of Kafka

Yes. Will update the terms.

Okay i can guess that you used beginningOffset because of consumer.seekToBeginning. But then we should be consistent with seekToEnd as well by calling latest offset as endOffset. Rather, lets be consistent with more well known names "earliest" and "latest".

tdas

Two main high level points

Should not have separate code paths for failOnDataLoss = true/false.
Merge the if conditions in getRecordFromFetchedData to reduce it from 4 cases to 3.
Needs better inline documentation as the code is not easy to understand. Sometimes better to document using ascii art ;)

tdas · 2016-11-15T23:05:07Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+    // 3. The topic is deleted.
+    //      There is nothing to fetch, return null.
+    // 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`.
+    //      We cannot detect this case. We can still fetch data like nothing happens.


nit: nothing happened.

tdas · 2016-11-15T23:07:39Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+
+  /**
+   * Get the earliest record in [offset, untilOffset) from the fetched data. If there is no such
+   * record, returns null. Must be called after `poll`.


nit: return null. Note that this method must be ...

This is not true. This sounds like poll() has to be called immediately before this method is called. This is not the case in the above usage where poll is called only if the conditions match.
Rather say "This must be called after some data has already been fetched using poll."

tdas · 2016-11-15T23:11:08Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      null
+    } else {
+      val record = fetchedData.next()
+      if (record.offset >= untilOffset) {


Can you document what cases this can happen.

tdas · 2016-11-15T23:19:18Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      untilOffset: Long,
+      pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = {
+    // scalastyle:off
+    // When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive):


Okay i can guess that you used beginningOffset because of consumer.seekToBeginning. But then we should be consistent with seekToEnd as well by calling latest offset as endOffset. Rather, lets be consistent with more well known names "earliest" and "latest".

tdas · 2016-11-15T23:24:13Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+    } else {
+      val record = fetchedData.next()
+      if (record.offset >= untilOffset) {
+        logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $untilOffset)")


Please improve the log message based on when this can happen. something like "There may have been some data loss because some data may have been aged out in Kafka and is therefore unavailable for processing. If you want your streaming query to fail on such cases, set option ...... "

tdas · 2016-11-15T23:39:45Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+        logInfo(s"Initial fetch for $topicPartition $offset")
+        seek(offset)
+        poll(pollTimeoutMs)
+      } else if (!fetchedData.hasNext()) {


can these two conditions be merged as

if (offset != nextOffsetInFetchedData || !fetchedData.hasNext()) { seek(offset) poll(pollTimeoutMs) }

tdas · 2016-11-15T23:40:47Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+        // beginningOffset <= offset <= min(latestOffset - 1, untilOffset - 1)
+        //
+        // This will happen when a topic is deleted and recreated, and new data are pushed very fast
+        // , then we will see `offset` disappears first then appears again. Although the parameters


nit: comma after new line

tdas · 2016-11-15T23:42:47Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      untilOffset: Long,
+      pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = {
+    val beginningOffset = getBeginningOffset()
+    if (beginningOffset <= offset) {


nit: its easier to think when "offset" is before in the condition, i.e. offset >= beginningOffset

tdas · 2016-11-15T23:45:31Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      }
+    } else {
+      if (beginningOffset >= untilOffset) {
+        // offset <= untilOffset - 1 < beginningOffset


improve docs by addition plain word explanation. e.g. required offset is earlier than the available offsets

also improve printed warning. gave an example in another comment.

tdas · 2016-11-15T23:47:34Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+        null
+      } else {
+        // offset < beginningOffset <= untilOffset - 1
+        logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $beginningOffset)")


improve docs and warning.

SparkQA · 2016-11-16T19:54:25Z

Test build #68731 has finished for PR 15820 at commit 52ea0d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-16T20:03:48Z

Test build #68733 has finished for PR 15820 at commit 5475c59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-16T20:14:42Z

Test build #68735 has finished for PR 15820 at commit 3b51e1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

As we spoke offline, this can further simplified. I am posting my detailed comments anyways, all of them are not applicable if you are going to refactor the code.

tdas · 2016-11-17T00:29:23Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      untilOffset: Long,
+      pollTimeoutMs: Long,
+      failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]] = {
+    require(offset < untilOffset, s"offset: $offset, untilOffset: $untilOffset")


give better error message, saying "offset must always be less than untilOffset"

tdas · 2016-11-17T00:30:02Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      // The last pre-fetched data has been drained.
+      // Seek to the offset because we may call seekToBeginning or seekToEnd before this.
+      seek(offset)
+      poll(pollTimeoutMs)


Why havent these two cases been merged?

tdas · 2016-11-17T00:31:38Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

@@ -96,10 +289,20 @@ private[kafka010] case class CachedKafkaConsumer private(
    logDebug(s"Polled $groupId ${p.partitions()}  ${r.size}")
    fetchedData = r.iterator
  }
+
+  private def getCurrentOffsetRange(): (Long, Long) = {


getCurrentOffsetRange -> getValidOffsetRange or getAvailableOffsetRange to make it more clear on what this is.
And add docs.

tdas · 2016-11-17T00:44:08Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala


  /**
   * Get the record for the given offset, waiting up to timeout ms if IO is necessary.
   * Sequential forward access will use buffers, but random access will be horribly inefficient.
+   *
+   * If `failOnDataLoss` is `false`, it will try to get the earliest record in


nit: earliest available record

tdas · 2016-11-17T00:54:37Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+  /**
+   * Get the earliest record in [offset, untilOffset) from the fetched data. If there is no such
+   * record, return null. Note that this must be called after some data has already been fetched
+   * using poll.


... if there is not such record, return null and clear the fetched data.

tdas · 2016-11-17T00:57:31Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala


  /**
   * Get the record for the given offset, waiting up to timeout ms if IO is necessary.
   * Sequential forward access will use buffers, but random access will be horribly inefficient.
+   *
+   * If `failOnDataLoss` is `false`, it will try to get the earliest record in
+   * `[offset, untilOffset)` when some illegal state happens. Otherwise, an `IllegalStateException`


"Otherwise" means, when failOnDataLoss is false but not some illegal state? its confusing.

Rather just rewrite it as
When failOnDataLoss is true, this will either return record at offset if available, or throw exception.
When failOnDataLoss is false, this will either return record at offset if available, or return the next earliest available record < untilOffset, or null. It will not throw any exception.

tdas · 2016-11-17T01:04:39Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      if (record.offset >= untilOffset) {
+        // This may happen when records are aged out.
+        val message =
+          if (failOnDataLoss) {


There is ALREADY a if statement in reportDataLoss on failOnDataLoss. having another one, looks too complicated. i think best to not have an if statement here. and same for other places. this too much complication for documentation.

SparkQA · 2016-11-18T00:55:53Z

Test build #68811 has finished for PR 15820 at commit 6eb1cb9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-11-18T23:13:59Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+   * [offset, untilOffset) are invalid (e.g., the topic is deleted and recreated), it will return
+   * `UNKNOWN_OFFSET`.
+   */
+  private def getNextEarliestOffset(offset: Long, untilOffset: Long): Long = {


nit: rename to getEarliestAvailableOffsetBetween(...)

tdas · 2016-11-18T23:17:55Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

@@ -47,40 +51,191 @@ private[kafka010] case class CachedKafkaConsumer private(

  /** Iterator to the already fetch data */
  private var fetchedData = ju.Collections.emptyIterator[ConsumerRecord[Array[Byte], Array[Byte]]]
-  private var nextOffsetInFetchedData = -2L
+  private var nextOffsetInFetchedData = UNKNOWN_OFFSET

  /**
   * Get the record for the given offset, waiting up to timeout ms if IO is necessary.


nit: update docs to clarify earlier that this may not return offset
Get the record for the given offset if available. Otherwise it will either throw error (if failOnDataLoss = true), or return the next available offset within [offset, untilOffset].

Use @param to explain pollTimeoutMs and others in more detail

tdas · 2016-11-18T23:20:09Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+   */
+  private def fetchData(
+      offset: Long,
+      untilOffset: Long,


untilOffset and failOnDataLoss not really needed.

tdas · 2016-11-18T23:21:27Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

+          val topic = newTopic()
+          topics += topic
+          testUtils.createTopic(topic, partitions = 1)
+        case 1 =>


Update test to recreate same topics.

tdas · 2016-11-18T23:24:15Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

+      .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
+      .as[(String, String)]
+    KafkaSourceSuite.globalTestUtils = testUtils
+    val query = kafka.map(kv => kv._2.toInt).writeStream.foreach(new ForeachWriter[Int] {


add explanation.

move this test lower. after the basic tests.

SparkQA · 2016-11-19T00:58:11Z

Test build #68878 has finished for PR 15820 at commit 2fc98cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

koeninger · 2016-11-19T16:10:35Z

Because the comment made by me and +1'ed by marmbrus is hidden at this point, I just want to re-iterate that this patch should not skip the rest of the partition in the case that a timeout happens.

zsxwing · 2016-11-21T19:02:50Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+      // `OffsetOutOfRangeException` to let the caller handle it.
+      // - Cannot fetch any data before timeout. TimeoutException will be thrown.
+      val (earliestOffset, latestOffset) = getAvailableOffsetRange()
+      if (offset < earliestOffset || offset >= latestOffset) {


@koeninger Just updated the timeout logic. It will check the current available offset range and use it to distinguish these two cases.

SparkQA · 2016-11-21T19:29:57Z

Test build #68948 has finished for PR 15820 at commit d0bcba0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T20:28:35Z

Test build #68949 has finished for PR 15820 at commit dcf6126.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T21:08:21Z

Test build #68952 has finished for PR 15820 at commit 35f9b1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-11-21T23:28:30Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

-      seek(offset)
-      poll(pollTimeoutMs)
+    // The following loop is basically for `failOnDataLoss = false`. When `failOnDataLoss` is
+    // `false`, firstly, we will try to fetch the record at `offset`. If no such record, then we


nit:
firstly --> first,
no such record exists

overall +1, thanks for this explanation.

tdas

LGTM, just a few documentation improvements. This is much cleaner and easier to understand now.

tdas · 2016-11-22T00:14:22Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

   */
-  def get(offset: Long, pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = {
+  def get(


Can you also document that it can return null.

tdas · 2016-11-22T00:20:00Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+        case e: OffsetOutOfRangeException =>
+          // When there is some error thrown, it's better to use a new consumer to drop all cached
+          // states in the old consumer. We don't need to worry about the performance because this
+          // is not a normal path.


normal --> common

tdas · 2016-11-22T00:25:24Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala

+  val INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE =
+    """
+      |There may have been some data loss because some data may have been aged out in Kafka or
+      | the topic has been deleted and is therefore unavailable for processing. If you want your


nit: better grammar

Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed.

Similarly change below.

tdas · 2016-11-22T00:26:52Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

@@ -413,13 +451,59 @@ class KafkaSourceSuite extends KafkaSourceTest {
    )
  }

+  test("Delete a topic when a Spark job is running") {


nit: D -> d, for consistency.

SparkQA · 2016-11-22T00:56:31Z

Test build #68969 has finished for PR 15820 at commit 8558139.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-11-22T04:50:27Z

@tdas I did some changes to make the stress test stable and ran it for 20 minutes without errors. I also confirmed the warning logs did appear in the unit test logs.

SparkQA · 2016-11-22T05:13:33Z

Test build #68978 has finished for PR 15820 at commit 1b8d56e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-22T05:17:10Z

Test build #3431 has started for PR 15820 at commit 1b8d56e.

srowen · 2016-11-22T11:09:10Z

@zsxwing should the title include SPARK-18529 per your comment on the JIRA?

zsxwing · 2016-11-22T18:36:24Z

@zsxwing should the title include SPARK-18529 per your comment on the JIRA?

Thanks for pointing out. Updated.

SparkQA · 2016-11-22T19:49:04Z

Test build #3432 has finished for PR 15820 at commit 1b8d56e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-11-22T22:14:58Z

LGTM. Merging this in master and 2.1

…with Spark jobs ## What changes were proposed in this pull request? This PR adds `CachedKafkaConsumer.getAndIgnoreLostData` to handle corner cases of `failOnDataLoss=false`. It also resolves [SPARK-18529](https://issues.apache.org/jira/browse/SPARK-18529) after refactoring codes: Timeout will throw a TimeoutException. ## How was this patch tested? Because I cannot find any way to manually control the Kafka server to clean up logs, it's impossible to write unit tests for each corner case. Therefore, I just created `test("stress test for failOnDataLoss=false")` which should cover most of corner cases. I also modified some existing tests to test for both `failOnDataLoss=false` and `failOnDataLoss=true` to make sure it doesn't break existing logic. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15820 from zsxwing/failOnDataLoss. (cherry picked from commit 2fd101b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…with Spark jobs ## What changes were proposed in this pull request? This PR adds `CachedKafkaConsumer.getAndIgnoreLostData` to handle corner cases of `failOnDataLoss=false`. It also resolves [SPARK-18529](https://issues.apache.org/jira/browse/SPARK-18529) after refactoring codes: Timeout will throw a TimeoutException. ## How was this patch tested? Because I cannot find any way to manually control the Kafka server to clean up logs, it's impossible to write unit tests for each corner case. Therefore, I just created `test("stress test for failOnDataLoss=false")` which should cover most of corner cases. I also modified some existing tests to test for both `failOnDataLoss=false` and `failOnDataLoss=true` to make sure it doesn't break existing logic. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#15820 from zsxwing/failOnDataLoss.

Make failOnDataLoss=false stable for Spark jobs

e8eff9f

zsxwing commented Nov 9, 2016

View reviewed changes

koeninger reviewed Nov 9, 2016

View reviewed changes

Refector

7afac17

koeninger reviewed Nov 10, 2016

View reviewed changes

Address

3aa9d7e

tdas reviewed Nov 15, 2016

View reviewed changes

tdas suggested changes Nov 16, 2016

View reviewed changes

zsxwing added 4 commits November 15, 2016 16:36

beginningOffset -> earliestOffset

4b11baa

Refector codes to address TD's comments

52ea0d6

Fix comment

5475c59

Unify the messages in KafkaSource and CachedKafkaConsumer

3b51e1e

tdas suggested changes Nov 17, 2016

View reviewed changes

Refactor again

6eb1cb9

tdas reviewed Nov 18, 2016

View reviewed changes

check the current offset range when cannot poll data

d0bcba0

zsxwing commented Nov 21, 2016

View reviewed changes

Increase timeout for topic deletion from 10s to 30s

dcf6126

Add logs

35f9b1a

tdas reviewed Nov 21, 2016

View reviewed changes

Fix comments

8558139

tdas reviewed Nov 22, 2016

View reviewed changes

fix docs

2ec78ac

Fix flaky topic deletion, tupe sleep seconds and check the warn logs

1b8d56e

zsxwing changed the title ~~[SPARK-18373][SS][Kafka]Make failOnDataLoss=false work with Spark jobs~~ [SPARK-18373][SPARK-18529][SS][Kafka]Make failOnDataLoss=false work with Spark jobs Nov 22, 2016

asfgit closed this in 2fd101b Nov 22, 2016

zsxwing deleted the failOnDataLoss branch November 22, 2016 22:34

[SPARK-18373][SPARK-18529][SS][Kafka]Make failOnDataLoss=false work with Spark jobs #15820

[SPARK-18373][SPARK-18529][SS][Kafka]Make failOnDataLoss=false work with Spark jobs #15820

Conversation

zsxwing commented Nov 9, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

zsxwing commented Nov 9, 2016

zsxwing commented Nov 9, 2016

SparkQA commented Nov 9, 2016

koeninger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koeninger commented Nov 9, 2016

zsxwing commented Nov 9, 2016

SparkQA commented Nov 9, 2016

koeninger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 11, 2016

tdas Nov 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 16, 2016

SparkQA commented Nov 16, 2016

SparkQA commented Nov 16, 2016

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2016

tdas Nov 18, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 19, 2016

koeninger commented Nov 19, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2016

SparkQA commented Nov 21, 2016

SparkQA commented Nov 21, 2016

Choose a reason for hiding this comment

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Nov 9, 2016 •

edited

tdas Nov 15, 2016 •

edited

tdas Nov 18, 2016 •

edited