KAFKA-4485; Follower should be in the isr if its FetchRequest has fetched up to the logEndOffset of leader #2208

lindong28 · 2016-12-03T03:34:24Z

No description provided.

ijuma · 2016-12-05T10:10:50Z

becketqin · 2016-12-06T18:39:04Z

core/src/main/scala/kafka/cluster/Partition.scala

          if(!inSyncReplicas.contains(replica) &&
             assignedReplicas.map(_.brokerId).contains(replicaId) &&
-                  replica.logEndOffset.offsetDiff(leaderHW) >= 0) {
+             logReadResult.info.fetchOffsetMetadata.messageOffset >= Math.max(replica.lastLeaderLogEndOffset, logReadResult.hw)) {


This approach solves problem stated in the ticket when replication factor >= 3, but it seems not solving the problem of replication factor = 2. When replication factor is 2 and ISR only contains leader, the high watermark will essentially be the log end offset of the leader. If we compare the fetch starting offset with the max of lastLeaderLogEndOffset and the current high watermark, it may still always appear to be lagging behind if there are small frequent produce requests.

This might not be that bad, as long as the replica is fetching faster than the producing rate, it will eventually catch up. I think this patch is an improvement but not solving all the issues. Let me think a bit more on this to see if we can have a more thorough solution.

I just updated the solution in https://issues.apache.org/jira/browse/KAFKA-4485. I think should fix the problem for replication factor = 2 as well. Can you take a look?

lindong28 · 2016-12-08T05:20:32Z

@junrao @ijuma I have implemented the solution described in https://issues.apache.org/jira/browse/KAFKA-4485. Since this patch will change the way we measure ISR set and high watermark, it is better to get more eyes on this. Can you take a look?

asfbot · 2016-12-10T01:59:57Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/54/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2016-12-10T02:07:57Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/55/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2016-12-10T02:08:10Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/56/
Test FAILed (JDK 8 and Scala 2.11).

junrao · 2016-12-12T05:14:46Z

core/src/main/scala/kafka/cluster/Partition.scala

-  private def maybeIncrementLeaderHW(leaderReplica: Replica): Boolean = {
-    val allLogEndOffsets = inSyncReplicas.map(_.logEndOffset)
+  private def maybeIncrementLeaderHW(leaderReplica: Replica, curTime: Long = time.milliseconds): Boolean = {
+    val allLogEndOffsets = assignedReplicas.filter(curTime - _.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs).map(_.logEndOffset)


Hmm, not sure about this. One of the things that we have to guarantee is that a committed message (i.e., any message with offset < HW) must be present in every ISR. With this change, do we still guarantee that? It seems that assignedReplicas.filter(curTime - _.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs) could be a subset of ISR.

@junrao Yes it is still guaranteed, by making sure that HW >= LEO of every replica in ISR in maybeExpandIsr(). assignedReplicas.filter(curTime - _.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs) will be a superset of ISR, because lag of any replica in ISR should not exceed replicaManager.config.replicaLagTimeMaxMs.

Ah I see you point. You are right, it may be subset of ISR because maybeShrink() is called periodically. I just fixed this problem by explicitly including ISR in consideration here.

junrao

@lindong28 : Thanks for the patch. A few comments.

junrao · 2016-12-14T02:29:38Z

core/src/main/scala/kafka/cluster/Partition.scala

          if(!inSyncReplicas.contains(replica) &&
             assignedReplicas.map(_.brokerId).contains(replicaId) &&
-                  replica.logEndOffset.offsetDiff(leaderHW) >= 0) {
+             logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.hw &&
+             logReadResult.fetchTimeMs - replica.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs) {


Now that we changed how HW is advanced, do we still need the second test?

Yes, it is still needed. Otherwise the hw watermark may decrease when a replica catches up.

For example, let's say replication factor = 2, isr includes only the leader. At time t1, hw = 100 and leader's LEO = 100. The leader receives follower's fetch request at t1 with offset = 90, and sends back fetch response in offset range [90, 100). At time t2, hw = 120 and leader's LEO = 120. The leader receives follower's fetch request at t2 with offset = 100. Suppose t2 - t1 < replicaLagTimeMaxMs, if we don't have 2nd check, the follower would be added to the isr set and hw would decrease from 120 to 100.

Will it? The follower can only be added to ISR if its fetch offset is >= HW, right?

Wait.. I misunderstand the 2nd check..

@junrao I just realized that the second test is assignedReplicas.map(_.brokerId).contains(replicaId). My comment above is related to the 3rd test logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.hw.

I think we still need the 2nd test. It seems to me that the usage of this second check is not affected by this patch. I think the 2nd test is added to prevent a broker from being added to ISR if this broker is not in the replica set of this partition. A broker out of replica set may send fetch request to the leader in the case of partition reassignment, if the controller has shrunk the replica set of this partition but the out-dated follower isn't aware of this yet. This scenario will still happen after this patch.

Ah I see. If we do the test provided above, the maybeExpandIsr() will be slightly inaccurate in theory depending on how often maybeShrinkIsr() is executed. Let say replicaLagTimeMaxMs = 10 seconds and maybeShrinkIsr() is executed every 2 seconds. Then a replica in ISR may lag behind leader's LEO by up to 12 seconds, which means that HW may lag behind leader's LEO by up to 12 seconds. Thus a replica may lag behind leader's LEO by up to 12 seconds even if its fetch offset >= hw. If we apply this example to the test above, logReadResult.fetchTimeMs - replica.lastCaughtUpTimeMs can be larger than replicaManager.config.replicaLagTimeMaxMs even if logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.hw.

Therefore I think this check is nice to have. On the other hand it should be OK to remove that check as well since the 20% inaccuracy happens with low probability and is probably not a big deal in practice.

Another benefit is readability. Because logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.hw doesn't guarantee logReadResult.fetchTimeMs - replica.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs, have this in the check allows us explicitly enforce the requirement of ISR (i.e. a replica is in ISR iff its replica lag <= replicaLagTimeMaxMs) in the code with best effort (because maybeShrinkIsr() doesn't enforce it strictly).

@junrao Thanks for review. Is there anything I need to do for this patch?

@lindong28 : Thanks for the explanation. It makes sense. The only thing is that currently in ReplicaFetcherThread.shouldFollowerThrottle(), we need to know if a follower is in ISR or not, and this is purely based on the fetch offset and the HW. So, just checking the HW here makes the decision on whether a replica is in-sync more consistent between the leader and the follower.

@junrao I see. I just removed this check to make them more consistent. And I added comment in maybeExpandIsr() to explain it. I think both implementation will work with or without this check.

junrao · 2016-12-14T02:29:51Z

core/src/main/scala/kafka/cluster/Replica.scala

+   * Else if the FetchRequest reads up to the log end offset of the the leader when the previous fetch request was received,
+   * set the lastCaughtUpTimeMsUnderlying to the time when the previous fetch request was received.
+   */
+  def maybeUpdateCatchUpTimestamp(logReadResult: LogReadResult) {


If we don't need to change the logic in maybeExpandIsr(), could the logic in this method be just folded into updateLogReadResult()?

You are right, this can be folded into updateLogReadResult(). They need to be separately originally at the first version of this patch. I forgot to change it in the second version. I will update it now.

junrao · 2016-12-14T02:30:18Z

core/src/main/scala/kafka/cluster/Replica.scala

+   * set the lastCaughtUpTimeMsUnderlying to the time when the previous fetch request was received.
+   */
+  def maybeUpdateCatchUpTimestamp(logReadResult: LogReadResult) {
+    if (logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.leaderLogEndOffset)


If this check is true, the check in line 68 must also be true. So, it seems just doing the check in line 68 is enough?

It makes a difference when this is the first fetch request from this follower to this leader after this leader starts. Also this can make lastCaughtUpTimeMsUnderlying slightly more accurate in general. Thus I think this check is nice to have. But it probably still works even if we remove this check.

junrao · 2016-12-14T02:30:27Z

core/src/main/scala/kafka/server/ReplicaManager.scala

@@ -214,7 +216,7 @@ class ReplicaManager(val config: KafkaConfig,

  def startup() {
    // start ISR expiration thread
-    scheduler.schedule("isr-expiration", maybeShrinkIsr, period = config.replicaLagTimeMaxMs, unit = TimeUnit.MILLISECONDS)
+    scheduler.schedule("isr-expiration", maybeShrinkIsr, period = config.replicaLagTimeMaxMs/5, unit = TimeUnit.MILLISECONDS)


Is this changed intended?

Yes, it is indented. I think this makes the guarantee of replicaLagTimeMaxMs better. Otherwise, a replica can be stay in ISR for up to 2 x replicaLagTimeMaxMs even if it is out of sync. Say replicaLagTimeMaxMs = 100 sec, the follower's lastCatchUpTime = t1. maybeShrinkIsr() is executed at time t1 + 99 and time t1 + 199. The follower will be able to stay in ISR for 199 seconds and only removed from ISR at time t1 + 199. We can reduce this inaccuracy to 20% by running maybeShrinkIsr once every replicaLagTimeMaxMs/5.

asfbot · 2016-12-14T03:47:41Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/129/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2016-12-14T03:49:56Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/130/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2016-12-14T03:52:12Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/128/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2016-12-16T20:17:52Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/197/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2016-12-16T21:01:27Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/195/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2016-12-16T21:18:16Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/196/
Test FAILed (JDK 8 and Scala 2.12).

junrao

@lindong28 : Thanks for the updated patch. Left a few more comments. Do you have any test results to share? Basically, how effective this change is in dealing with the issue with constant small requests at the leader and whether there is any degradation in other situations.

junrao · 2016-12-16T22:39:02Z

core/src/main/scala/kafka/cluster/Partition.scala

          if(!inSyncReplicas.contains(replica) &&
             assignedReplicas.map(_.brokerId).contains(replicaId) &&
-                  replica.logEndOffset.offsetDiff(leaderHW) >= 0) {
+             logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.hw &&
+             logReadResult.fetchTimeMs - replica.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs) {


@lindong28 : Thanks for the explanation. It makes sense. The only thing is that currently in ReplicaFetcherThread.shouldFollowerThrottle(), we need to know if a follower is in ISR or not, and this is purely based on the fetch offset and the HW. So, just checking the HW here makes the decision on whether a replica is in-sync more consistent between the leader and the follower.

junrao · 2016-12-16T22:39:11Z

core/src/main/scala/kafka/cluster/Partition.scala

@@ -298,7 +300,8 @@ class Partition(val topic: String,

          // check if the HW of the partition can now be incremented
          // since the replica maybe now be in the ISR and its LEO has just incremented
-          maybeIncrementLeaderHW(leaderReplica)
+          // TODO: is this maybeIncrementLeaderHW() necessary?


I think we need this since every time the follower's fetch offset advances, we may need to advance the HW.

I see. So we call maybeIncrementLeaderHW() here not because of maybeIncrementLeaderHW(), but because of maybeIncrementLeaderHW. I was mislead by the comment because it says ...since the replica maybe now be in the ISR and its LEO has just incremented. Having a replica newly added to ISR should not increment HW.

It seems clear to call maybeIncrementLeaderHW() in maybeIncrementLeaderHW(). It's not a big deal. I will simply update the comment.

junrao · 2016-12-16T22:39:26Z

core/src/main/scala/kafka/cluster/Replica.scala

@@ -36,25 +36,41 @@ class Replica(val brokerId: Int,
  // for local replica it is the log's end offset, for remote replicas its value is only updated by follower fetch
  @volatile private[this] var logEndOffsetMetadata: LogOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata

+  // The log end offset value at the time the leader receives last FetchRequest from this follower.
+  // This is used to determine the last catch-up time of the follower
+  @volatile var lastFetchLeaderLogEndOffset: Long = 0L


Should we initialize this to MAX_LONG? Otherwise, the first fetch request from a follower will make it a caught-up replica.

I agree that we should by default consider a replica to be out-of-sync. I have made the following two fixes. I have kept lastFetchLeaderLogEndOffset as 0L. This is because the first fetch request from a follower will set lastCaughtUpTimeMsUnderlying to lastFetchTimeMs, which is initialized to 0L in Replica.java.

Initialize lastCaughtUpTimeMsUnderlying to AtomicLong(0L) in Replica.java

When broker receives LeaderAndIsrRequest to make it leader for a partition, reset lastCaughtUpTimeMsUnderlying of all replicas of this partition to 0L.

junrao · 2016-12-16T22:39:29Z

core/src/main/scala/kafka/cluster/Replica.scala

+
+  // The time when the leader receives last FetchRequest from this follower
+  // This is used to determine the last catch-up time of the follower
+  @volatile var lastFetchTimeMs: Long = 0L


It doesn't seem this is used?

It is used in updateLogReadResult(). When offset of fetch request >= lastFetchLeaderLogEndOffset, the lastCatchUpTime will be lastFetchTimeMs if it is current time.

junrao · 2016-12-16T22:46:37Z

core/src/main/scala/kafka/cluster/Replica.scala

+   * set the lastCaughtUpTimeMsUnderlying to the time when the current fetch request was received.
+   *
+   * Else if the FetchRequest reads up to the log end offset of the the leader when the previous fetch request was received,
+   * set the lastCaughtUpTimeMsUnderlying to the time when the previous fetch request was received.


Perhaps we can add a comment why we need to do this? Basically the situation where the leader gets constant small produce requests.

Sure. Added comment now.

junrao · 2016-12-16T22:46:43Z

core/src/main/scala/kafka/server/ReplicaManager.scala

@@ -217,7 +219,8 @@ class ReplicaManager(val config: KafkaConfig,

  def startup() {
    // start ISR expiration thread
-    scheduler.schedule("isr-expiration", maybeShrinkIsr, period = config.replicaLagTimeMaxMs, unit = TimeUnit.MILLISECONDS)
+    // A follower can lag behind leader for up to config.replicaLagTimeMaxMs x (1 + 20%) before it is removed from ISR
+    scheduler.schedule("isr-expiration", maybeShrinkIsr, period = config.replicaLagTimeMaxMs / 5, unit = TimeUnit.MILLISECONDS)


Thanks for the explanation. It makes sense. I am just a bit concerned of the overhead of calling maybeShrinkIsr() to frequently. Perhaps we can do /2?

Sure. I will change this to /2.

On the other hand, I am not sure it is is expensive. When ISR of most partitions don't have to be shrinked, the cost of maybeShrinkIsr() is comparing the latchCatchUpTimeMs of all replicas with current time. Would this cost still be a concern if all these are done in memory without system call? I can update the patch to call time.milliseconds() only once.

Now that we remove the check in maybeExpandIsr(), it is probably more important to run maybeShrinkIsr() sooner than later. Note that if we don't shrink ISR soon enough, it is more likely to have message duplication. Here is the scenario:

replication factor = 3 and min isr = 2. Current ISR set includes all 3 replicas.

Replica max lag = 10 seconds, request timeout = 12 seconds, we run maybeShrink() every 5 seconds.

At time t, A follower stops fetching but doesn't go offline.

At time t, a produce request with ack = -1 is received by leader and the message is appended to log. Now leader will for all replicas in the ISR to fetch this message before replying to producer.

At time t + 12, leader sends request timeout error to producer. producer retries. The message is appended to log again.

At time t + 15, that follower is removed from ISR. HW is increased to include that message twice. Consumer will receive that message twice.

To prevent this scenario we need request timeout to be larger than max replica lag. If we allow maybeShrink() to run infrequently, then the producer's request timeout needs to be larger, which is not desirable.

lindong28 · 2016-12-18T09:28:06Z

@junrao Yes, this patch can observably reduce the rolling bounce time of Kafka cluster in my experiments since we only bounce a new broker when there is no URP in the cluster. Initially when I was testing rolling bounce time of a cluster of 150 brokers on 15 machines, I found that the rolling bounce can not finish and got killed by deployment tools after 1 hour because one broker just keeps reporting URP. This happens consistently for more than 5 tries. After I used this patch, the rolling bounce of the entire cluster would finish in 45 minutes.

I haven't observed specific degradation due to this patch.

asfbot · 2016-12-18T10:33:37Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/225/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2016-12-18T11:06:32Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/226/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2016-12-18T11:28:52Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/227/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2016-12-18T22:43:01Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/230/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2016-12-18T23:47:04Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/231/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2016-12-19T00:00:24Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/229/
Test PASSed (JDK 7 and Scala 2.10).

junrao · 2016-12-20T00:35:00Z

clients/src/main/java/org/apache/kafka/clients/CommonClientConfigs.java

@@ -77,7 +77,8 @@
    public static final String REQUEST_TIMEOUT_MS_DOC = "The configuration controls the maximum amount of time the client will wait "
                                                         + "for the response of a request. If the response is not received before the timeout "
                                                         + "elapses the client will resend the request if necessary or fail the request if "
-                                                         + "retries are exhausted.";
+                                                         + "retries are exhausted. request.timeout.ms should be larger than replica.lag.time.max.ms "
+                                                         + "to reduce message duplication caused by unnecessary producer retry.";


Perhaps make it clear that replica.lag.time.max.ms is a broker side config. Also, since this is specific to the producer, could we add it only to producer config?

Sure. Fixed now.

junrao · 2016-12-20T00:35:09Z

core/src/main/scala/kafka/cluster/Partition.scala

@@ -273,20 +273,23 @@ class Partition(val topic: String,
  }

  /**
-   * Check and maybe expand the ISR of the partition.
+   * Check and maybe expand the ISR of the partition. A replica is in ISR of the partition if and only if
+   * replica's LEO >= HW and replica's lag <= replicaLagTimeMaxMs.


Is the comment on replicaLagTimeMaxMs still relevant?

It is outdated. The comment is updated now.

junrao · 2016-12-20T00:35:53Z

core/src/main/scala/kafka/cluster/Partition.scala

-                  replica.logEndOffset.offsetDiff(leaderHW) >= 0) {
+             // This approximates the requirement logReadResult.fetchTimeMs - replica.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs.
+             // We don't directly specify the above requirement in order to make maybeExpandIsr() consistent with ReplicaFetcherThread.shouldFollowerThrottle()
+             // A offset replica whose lag > ReplicaFetcherThread may still exceed hw because maybeShrinkIsr() is called periodically


The comment reads a bit verbose. How about the following?

Technically, a replica shouldn't be in ISR if it hasn't caught up for longer than replicaLagTimeMaxMs, even if its log end offset is >= HW. However, to be consistent with how the follower determines whether a replica is in-sync, we only check HW.

Sure. Comment is updated.

junrao · 2016-12-20T00:36:02Z

core/src/main/scala/kafka/cluster/Partition.scala

+             // This approximates the requirement logReadResult.fetchTimeMs - replica.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs.
+             // We don't directly specify the above requirement in order to make maybeExpandIsr() consistent with ReplicaFetcherThread.shouldFollowerThrottle()
+             // A offset replica whose lag > ReplicaFetcherThread may still exceed hw because maybeShrinkIsr() is called periodically
+             logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.hw) {


I think we should use leaderReplica.highWatermark instead of logReadResult.hw since the former is more precise. Actually, could we just keep the original code?

Sure. Replaced with replica.logEndOffset.offsetDiff(leaderHW) >= 0

junrao · 2016-12-20T00:36:50Z

core/src/main/scala/kafka/cluster/Partition.scala

@@ -362,12 +365,18 @@ class Partition(val topic: String,
   * 1. Partition ISR changed
   * 2. Any replica's LEO changed
   *
+   * We only increase HW if HW is smaller than LEO of all replicas whose lag <= replicaLagTimeMaxMs.
+   * This means that if a replica does not lag much behind leader and the replica's LEO is smaller than HW, HW will
+   * wait for this replica to catch up so that this replica can be added to ISR set.


Tweak the comment a bit to the following. What do you think?

The HW is determined by the smallest log end offset among all replicas that are in sync or are considered caught-up. This way, if a replica is considered caught-up, but its log end offset is smaller than HW, we will wait for this replica to catch up to the HW before advancing the HW. This helps the situation when the ISR only includes the leader replica and a follower tries to catch up. If we don't wait for the follower when advancing the HW, the follower's log end offset may keep falling behind the HW (determined by the leader's log end offset) and therefore will never be added to ISR.

Sure. Comment is updated now.

junrao · 2016-12-20T00:38:13Z

core/src/main/scala/kafka/cluster/Replica.scala

+  def resetLastCatchUpTime() {
+    lastFetchLeaderLogEndOffset = 0L
+    lastFetchTimeMs = 0L
+    lastCaughtUpTimeMsUnderlying.set(0L)


Hmm, when the leadership switches to a replica, this could cause an in-sync replica to be out of ISR immediately before it can make a fetch request to the new leader. In this case, we want to give every ISR the benefit of doubt and assume that it's caught up at this moment. For non-in-sync replicas, we want to assume that initially it's not caught up.

Since maybeShrinkIsr() is executed periodically on the order of seconds, most likely the follower's lastCaughtUpTime will be updated before maybeShrinkIsr() can be called to remove it from ISR. This method is called so that hw can be incremented despite a follower that just becomes offline.

But I just realized that we only need to call resetLastCatchUpTime() for thsoe replicas that are not in ISR. I will update the patch to fix it.

junrao · 2016-12-20T00:38:19Z

core/src/main/scala/kafka/server/KafkaConfig.scala

@@ -476,7 +476,8 @@ object KafkaConfig {
  val ControllerMessageQueueSizeDoc = "The buffer size for controller-to-broker-channels"
  val DefaultReplicationFactorDoc = "default replication factors for automatically created topics"
  val ReplicaLagTimeMaxMsDoc = "If a follower hasn't sent any fetch requests or hasn't consumed up to the leaders log end offset for at least this time," +
-  " the leader will remove the follower from isr"
+  " the leader will remove the follower from isr. replica.lag.time.max.ms should be smaller than request.timeout.ms" +
+  " to reduce message duplication caused by unnecessary producer retry."


Is this necessary given the change in client config?

I thought it is good to have to in both places. But it is not necessary. I will remove it.

junrao · 2016-12-20T00:38:34Z

core/src/main/scala/kafka/server/ReplicaManager.scala

-        val readToEndOfLog = initialLogEndOffset.messageOffset - logReadInfo.fetchOffsetMetadata.messageOffset <= 0
-
-        LogReadResult(logReadInfo, localReplica.highWatermark.messageOffset, partitionFetchSize, readToEndOfLog, None)
+        LogReadResult(logReadInfo, initialHighWatermark, initialLogEndOffset, fetchTimeMs, partitionFetchSize, None)


Now that there are quite a few params of type long, could we use named params to instantiate LogReadResult? Ditto below.

I am not sure if we need named params here because the variable name is already very close to the parameter name and should explain what they stand for. I can update it to use named parameters as well. Please let me know.

The reason for using named parameters is really to avoid accidentally passing in params in the wrong order since the types are the same. I recommend that we use that for code safety.

I see. Fixed now.

junrao · 2016-12-20T00:38:38Z

core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala

@@ -126,7 +123,7 @@ class IsrExpirationTest {

    // Make the remote replica not read to the end of log. It should be not be out of sync for at least 100 ms
    for(replica <- partition0.assignedReplicas() - leaderReplica)
-      replica.updateLogReadResult(new LogReadResult(FetchDataInfo(new LogOffsetMetadata(10L), MemoryRecords.EMPTY), -1L, -1, false))
+      replica.updateLogReadResult(new LogReadResult(FetchDataInfo(new LogOffsetMetadata(10L), MemoryRecords.EMPTY), 10L, 15L, time.milliseconds, -1))


Could we use named parameters when initializing LogReadResult? Ditto below.

Sure. Fixed now.

junrao · 2016-12-20T00:39:26Z

core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala

@@ -165,6 +162,10 @@ class IsrExpirationTest {
    allReplicas.foreach(r => partition.addReplicaIfNotExists(r))
    // set in sync replicas for this partition to all the assigned replicas
    partition.inSyncReplicas = allReplicas.toSet
+    // make the remote replica read to the end of log
+    for(replica <- partition.assignedReplicas() - leaderReplica)
+      replica.updateLogReadResult(new LogReadResult(FetchDataInfo(new LogOffsetMetadata(10L), MemoryRecords.EMPTY), 10L, 10L, time.milliseconds, -1))


It's a bit weird to update LogReadResult to a specific offset in a util function. Should the caller do this or at least pass in a required offset?

I agree. How about I replace 10L with 0L here? This seems more general.

asfbot · 2016-12-20T02:22:58Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/257/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2016-12-20T02:22:59Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/256/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2016-12-20T02:23:45Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/258/
Test FAILed (JDK 8 and Scala 2.11).

junrao · 2016-12-20T17:06:10Z

clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java

-    private static final String REQUEST_TIMEOUT_MS_DOC = CommonClientConfigs.REQUEST_TIMEOUT_MS_DOC;
+    private static final String REQUEST_TIMEOUT_MS_DOC = CommonClientConfigs.REQUEST_TIMEOUT_MS_DOC
+                                                        + " request.timeout.ms should be larger than replica.lag.time.max.ms "
+                                                        + "to reduce message duplication caused by unnecessary producer retry.";


Hmm, this should be in the producer config, right? Also, could we mention that replica.lag.time.max.ms is a broker side config? Otherwise, people would assume this is another producer side config.

Sorry, my mistake.. It is fixed now.

junrao · 2016-12-20T17:06:39Z

core/src/main/scala/kafka/cluster/Partition.scala

+        val r = getOrCreateReplica(replica)
+        if (!partitionStateInfo.isr.contains(replica))
+          r.resetLastCatchUpTime()
+      })


For replicas in ISR, it seems that we need to set lastCaughtUpTime to be now so that they don't get dropped out of ISR immediately since shrinkIsr() can be called at any time.

If the broker is already the leader of this partition, then the broker already have up-to-date lastCaughtUpTime of each follower. If the broker was not the leader of this partition, in the worst case the isr will temporarily drop below min isr of this partition, making this partition unavailable for produce operation. The producer will retry upon NotEnoughReplicasException and should succeed soon. It seems OK, because when there is leadership change from one broker to another, producer need to update metadata and retry anyway.

On the other hand, if we set lastCaughtUpTime to be now, then in the worst case a replica will stay in ISR for 2 x ReplicaLagTimeMaxMs, which seems worse because it breaks the semantics of ReplicaLagTimeMaxMs (in addition to how we execute maybeShrinkIsr()).

What do you think? I can set lastCaughtUpTime to now if you think that is better.

@junrao There is another optimization I will make after your suggestion. That is, when broker receives LeaderAndIsrRequest to become leader of a partition, for each follower of this partition, the lastFetchTimeMs will be set to curTime and lastFetchLeaderLogEndOffset will be set to current LEO.

Yes, it makes sense to lastFetchTimeMs to curTime. Could we just leave lastFetchLeaderLogEndOffset to 0 or -1?

Another thing is that we probably only need to do this optimization when the leader is indeed changing.

junrao · 2016-12-20T17:07:07Z

core/src/main/scala/kafka/cluster/Replica.scala

+    if (logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.leaderLogEndOffset)
+      lastCaughtUpTimeMsUnderlying.set(logReadResult.fetchTimeMs)
+    else if (logReadResult.info.fetchOffsetMetadata.messageOffset >= lastFetchLeaderLogEndOffset)
+      lastCaughtUpTimeMsUnderlying.set(lastFetchTimeMs)


I was thinking what if on the first follower fetch request, lastCaughtUpTimeMsUnderlying is set to logReadResult.fetchTimeMs and in the 2nd call, it's set to lastFetchTimeMs. But this is probably not a problem since lastFetchTimeMs in the 2nd call will be the same as logReadResult.fetchTimeMs in the first call.

junrao · 2016-12-20T17:07:15Z

core/src/main/scala/kafka/cluster/Replica.scala

-    if(logReadResult.isReadFromLogEnd) {
-      lastCaughtUpTimeMsUnderlying.set(time.milliseconds)
-    }
+  def resetLastCatchUpTime() {


resetLastCatchUpTime => resetLastCaughtUpTime

junrao · 2016-12-20T17:07:30Z

core/src/main/scala/kafka/server/ReplicaManager.scala

-        val readToEndOfLog = initialLogEndOffset.messageOffset - logReadInfo.fetchOffsetMetadata.messageOffset <= 0
-
-        LogReadResult(logReadInfo, localReplica.highWatermark.messageOffset, partitionFetchSize, readToEndOfLog, None)
+        LogReadResult(logReadInfo, initialHighWatermark, initialLogEndOffset, fetchTimeMs, partitionFetchSize, None)


The reason for using named parameters is really to avoid accidentally passing in params in the wrong order since the types are the same. I recommend that we use that for code safety.

junrao · 2016-12-20T17:07:48Z

core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala

+                                  hw = 0L,
+                                  leaderLogEndOffset = 0L,
+                                  fetchTimeMs =time.milliseconds,
+                                  readSize = -1))


Hmm, is this right? It seems that we need to set lastCaughtUpTime only on the leader replica.

I think it is right. lastCaughtUpTime doesn't matter for leader replica since leader is always the latest replica. getPartitionWithAllReplicasInIsr() is used in this IsrExpirationTest.java to get initialize replicas of a given partition so that all its replicas are in ISR. Thus we need to set lastCaughtUpTime of all these replicas to now.

Ok, thanks for the explanation. Could you add a space after = in =time.milliseconds?

Sure. Fixed now.

asfbot · 2016-12-20T19:30:47Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/277/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2016-12-20T19:33:57Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/278/
Test PASSed (JDK 8 and Scala 2.11).

junrao · 2016-12-20T20:00:16Z

clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java

@@ -210,7 +210,9 @@

    /** <code>request.timeout.ms</code> */
    public static final String REQUEST_TIMEOUT_MS_CONFIG = CommonClientConfigs.REQUEST_TIMEOUT_MS_CONFIG;
-    private static final String REQUEST_TIMEOUT_MS_DOC = CommonClientConfigs.REQUEST_TIMEOUT_MS_DOC;
+    private static final String REQUEST_TIMEOUT_MS_DOC = CommonClientConfigs.REQUEST_TIMEOUT_MS_DOC
+                                                        + " request.timeout.ms should be larger than replica.lag.time.max.ms "


Could we mention that replica.lag.time.max.ms is a broker side config?

Sorry, I forgot that. It is updated now.

Hmm, did you push?

junrao · 2016-12-20T20:00:27Z

core/src/main/scala/kafka/cluster/Partition.scala

+        val r = getOrCreateReplica(replica)
+        if (!partitionStateInfo.isr.contains(replica))
+          r.resetLastCaughtUpTime()
+      })


I think we just need to optimize for the common case. The common case is that a replica is switching from follower to leader and all existing ISRs are expected to continue to be caught up. So, setting lastCaughtUpTime to now for ISRs avoids ISR churns. This is what the original code was trying to do (by initialing lastCaughtUpTimeMsUnderlying to current time when a replica is created). It is true that in the worse case, a replica will be removed from ISR after 2 x ReplicaLagTimeMaxMs, but that should be rare.

I see. It is fixed as you suggested. Thanks for the explanation.

junrao · 2016-12-20T20:00:38Z

core/src/main/scala/kafka/server/ReplicaManager.scala

@@ -592,13 +600,13 @@ class ReplicaManager(val config: KafkaConfig,
                 _: ReplicaNotAvailableException |
                 _: OffsetOutOfRangeException) =>
          LogReadResult(FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY), -1L,
-            partitionFetchSize, false, Some(e))
+            -1L, -1L, partitionFetchSize, Some(e))


Could we use named params here and below?

Sure. All instantiation of LogReadResult are using named params now.

I didn't change this because most values are -1L already in this instance.

junrao · 2016-12-20T20:00:50Z

core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala

+                                  hw = 0L,
+                                  leaderLogEndOffset = 0L,
+                                  fetchTimeMs =time.milliseconds,
+                                  readSize = -1))


Ok, thanks for the explanation. Could you add a space after = in =time.milliseconds?

asfbot · 2016-12-20T20:40:24Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/276/
Test FAILed (JDK 7 and Scala 2.10).

…ched up to the logEndOffset of leader

…es LeaderAndIsrRequest.

ijuma · 2016-12-21T00:03:08Z

core/src/main/scala/kafka/cluster/Partition.scala

+      (assignedReplicas() - leaderReplica).foreach(replica => {
+        val lastCaughtUpTimeMs = if (inSyncReplicas.contains(replica)) curTimeMs else 0L
+        replica.resetLastCaughtUpTime(curLeaderLogEndOffset, curTimeMs, lastCaughtUpTimeMs)
+      })


Nit: multi-line foreach can be written more clearly like (eliminating some parens):

foreach { replica => ... }

There's one other case like this in the diff.

Thanks. I have fixed this in two places.

asfbot · 2016-12-21T00:55:50Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/291/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2016-12-21T00:57:38Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/292/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2016-12-21T01:45:32Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/290/
Test FAILed (JDK 7 and Scala 2.10).

junrao

@lindong28 : Thanks for the patience. A few more comments the latest patch.

junrao · 2016-12-21T02:29:35Z

core/src/main/scala/kafka/cluster/Partition.scala

+      (assignedReplicas() - leaderReplica).foreach{replica =>
+        val lastCaughtUpTimeMs = if (inSyncReplicas.contains(replica)) curTimeMs else 0L
+        replica.resetLastCaughtUpTime(curLeaderLogEndOffset, curTimeMs, lastCaughtUpTimeMs)
+      }


A couple of things.

I think we only need to call resetLastCaughtUpTime() if isNewLeader is true.

It's a bit confusing to explicitly pass in lastFetchLeaderLogEndOffset and lastFetchTimeMs. I am thinking perhaps it's better to have two separate methods resetAsCaughtUp(currentTime: Long) and resetAsNotCaughtUp(). For in-sync replicas, we call the former and for other replicas, we call the latter. In resetAsCaughtUp(), we set lastCaughtUpTime to currentTime, and also set lastFetchLeaderLogEndOffset = 0 and lastFetchTimeMs = currentTime, which makes it a bit hard for the replica to get dropped out of ISR. In resetAsNotCaughtUp(), we set the lastCaughtUpTime to 0 and also set lastFetchLeaderLogEndOffset = MAX_LONG and lastFetchTimeMs = 0, which makes it a bit hard for the replica to get added back to ISR.

I think we need to call resetLastCaughtUpTime() even if isNewLeader is false. For example, say a follower is fully caught up before it is offline. If the leader doesn't reset lastCaughtUpTime of this follower, then in the following ReplicaLagTimeMaxMs the replica will be in ISR and hw of this partition can not increase, which essentially makes this partition unavailable to producers.

I think we should initialize lastFetchLeaderLogEndOffset to current LEO of the leader and lastFetchTimeMs to current time regardless of whether the follower is in ISR. Although the variables are named as lastFetch* and updated when the fetch request is received, they are actually used to record the latest snapshot of ( leader LEO, time) pair. They should be interpreted in this way: the leader's LEO is x at time t. If follower's LEO >= x, then its lastCaughtTimeMs >= t. Given this interpretation, it probably doesn't make sense to set lastFetchLeaderLogEndOffset = MAX_LONG and lastFetchTimeMs = 0.

Thanks for the explanation. Both make sense.

junrao · 2016-12-21T02:29:45Z

core/src/main/scala/kafka/cluster/Replica.scala

@@ -36,25 +36,51 @@ class Replica(val brokerId: Int,
  // for local replica it is the log's end offset, for remote replicas its value is only updated by follower fetch
  @volatile private[this] var logEndOffsetMetadata: LogOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata

+  // The log end offset value at the time the leader receives last FetchRequest from this follower.
+  // This is used to determine the last catch-up time of the follower


catch-up => caught up; There are a few other places like that.

Sorry. Previously I corrected all places by doing grep catchup -Iirn but missed this. Just now I corrected two places after doing grep catch-up -Iirn. I hope I have caught everything like this.

junrao · 2016-12-21T02:30:08Z

core/src/main/scala/kafka/cluster/Replica.scala

@@ -36,25 +36,51 @@ class Replica(val brokerId: Int,
  // for local replica it is the log's end offset, for remote replicas its value is only updated by follower fetch
  @volatile private[this] var logEndOffsetMetadata: LogOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata

+  // The log end offset value at the time the leader receives last FetchRequest from this follower.
+  // This is used to determine the last catch-up time of the follower
+  @volatile private var lastFetchLeaderLogEndOffset: Long = 0L


I think it's better to initialize the replica to the notCaughtUp state, i.e., lastCaughtUpTime = 0, lastFetchLeaderLogEndOffset = MAX_LONG and lastFetchTimeMs = 0?

junrao · 2016-12-21T02:32:53Z

clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java

@@ -210,7 +210,9 @@

    /** <code>request.timeout.ms</code> */
    public static final String REQUEST_TIMEOUT_MS_CONFIG = CommonClientConfigs.REQUEST_TIMEOUT_MS_CONFIG;
-    private static final String REQUEST_TIMEOUT_MS_DOC = CommonClientConfigs.REQUEST_TIMEOUT_MS_DOC;
+    private static final String REQUEST_TIMEOUT_MS_DOC = CommonClientConfigs.REQUEST_TIMEOUT_MS_DOC
+                                                        + " request.timeout.ms should be larger than value of broker side config replica.lag.time.max.ms"


should be larger than value of broker side config replica.lag.time.max.ms => should be larger than replica.lag.time.max.ms, a broker side configuration ?

Sure. Fixed now.

asfbot · 2016-12-21T03:28:19Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/304/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2016-12-21T03:33:46Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/303/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2016-12-21T03:35:21Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/305/
Test FAILed (JDK 8 and Scala 2.11).

junrao · 2016-12-21T05:30:34Z

@lindong28 : Thanks for the patch. LGTM

ijuma · 2016-12-21T13:22:56Z

Finally had a chance to go through this in detail, seems like a nice improvement.

…ched up to the logEndOffset of leader Author: Dong Lin <lindong28@gmail.com> Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com> Closes apache#2208 from lindong28/KAFKA-4485

lindong28 force-pushed the KAFKA-4485 branch from c2c2483 to ea784a1 Compare December 3, 2016 03:48

becketqin reviewed Dec 6, 2016

View reviewed changes

junrao reviewed Dec 12, 2016

View reviewed changes

junrao reviewed Dec 14, 2016

View reviewed changes

lindong28 force-pushed the KAFKA-4485 branch from 2f5689a to 4684fe5 Compare December 16, 2016 19:42

junrao reviewed Dec 16, 2016

View reviewed changes

lindong28 force-pushed the KAFKA-4485 branch from 2e4c970 to d00250f Compare December 18, 2016 09:30

lindong28 force-pushed the KAFKA-4485 branch from d1cb5b5 to 78767ef Compare December 18, 2016 22:06

junrao reviewed Dec 20, 2016

View reviewed changes

lindong28 force-pushed the KAFKA-4485 branch from eb069f2 to 6ccc10d Compare December 20, 2016 23:57

lindong28 added 7 commits December 20, 2016 15:58

KAFKA-4485; Follower should be in the isr if its FetchRequest has fet…

6005e9b

…ched up to the logEndOffset of leader

Initialize lastCatchUpTimeMs to 0L. Reset it to 0L when broker receiv…

5ca45f3

…es LeaderAndIsrRequest.

Address comment

e2da89c

Fix integration tests

f3b80ad

Address comment

31ac893

Update comment and variable names

fe73d27

Address comment

6025f9e

lindong28 force-pushed the KAFKA-4485 branch from 6ccc10d to 6025f9e Compare December 20, 2016 23:59

ijuma reviewed Dec 21, 2016

View reviewed changes

Update code style

08d6a3a

junrao reviewed Dec 21, 2016

View reviewed changes

Update comment

0198fb4

asfgit closed this in fc88bac Dec 21, 2016

lindong28 deleted the KAFKA-4485 branch January 2, 2017 20:13

KAFKA-4485; Follower should be in the isr if its FetchRequest has fetched up to the logEndOffset of leader #2208

KAFKA-4485; Follower should be in the isr if its FetchRequest has fetched up to the logEndOffset of leader #2208

Conversation

lindong28 commented Dec 3, 2016

ijuma commented Dec 5, 2016

Choose a reason for hiding this comment

lindong28 Dec 8, 2016 • edited

Choose a reason for hiding this comment

lindong28 commented Dec 8, 2016 • edited

asfbot commented Dec 10, 2016

asfbot commented Dec 10, 2016

asfbot commented Dec 10, 2016

Choose a reason for hiding this comment

lindong28 Dec 12, 2016 • edited

Choose a reason for hiding this comment

lindong28 Dec 12, 2016 • edited

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindong28 Dec 14, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asfbot commented Dec 14, 2016

asfbot commented Dec 14, 2016

asfbot commented Dec 14, 2016

asfbot commented Dec 16, 2016

asfbot commented Dec 16, 2016

asfbot commented Dec 16, 2016

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindong28 Dec 18, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindong28 commented Dec 18, 2016 • edited

asfbot commented Dec 18, 2016

asfbot commented Dec 18, 2016

asfbot commented Dec 18, 2016

asfbot commented Dec 18, 2016

asfbot commented Dec 18, 2016

asfbot commented Dec 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindong28 Dec 20, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindong28 Dec 8, 2016 •

edited

lindong28 commented Dec 8, 2016 •

edited

lindong28 Dec 12, 2016 •

edited

lindong28 Dec 12, 2016 •

edited

lindong28 Dec 14, 2016 •

edited

lindong28 Dec 18, 2016 •

edited

lindong28 commented Dec 18, 2016 •

edited

lindong28 Dec 20, 2016 •

edited