New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-4089: KafkaProducer expires batch when metadata is stale #1791
Conversation
* The metadata has expired if an update is explicitly requested or an update is due now. | ||
*/ | ||
public synchronized boolean hasExpired(long nowMs) { | ||
return this.needUpdate || this.timeToNextUpdate(nowMs) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the body of this method is equivalent to the following, which is already used in the method timeToNextUpdate(..)
. This kind of duplicates the same logic in two methods.
long timeToExpire = needUpdate ? 0 : Math.max(this.lastSuccessfulRefreshMs + this.metadataExpireMs - nowMs, 0);
return timeToExpire == 0;
ping @junrao . Will you able to take a look? Some context on the patch. The current expiration mechanism is based on the muted partitions in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this change makes sense after a discussion with @becketqin @MayureshGharat @sutambe
It is interesting to summarize the evolution of the effect of request timeout on batches in the accumulator. Let me know if you see any errors. cc others who were involved in those patches @junrao @hachikuji @ijuma
- KIP-19: if partition metadata is missing, then can expire if batch has been ready for longer than request timeout
- KAFKA-2805: removed the check on missing metadata. i.e., can expire any batch that has been ready for longer than request timeout. (This was to handle the case of a cluster being down and thus no metadata refresh)
- KAFKA-3388: Expire only if we think we cannot send; and we infer that we can send based on fact that there is an in-flight request that has not yet timed out.
- KAFKA-4089: Now adds the additional check that metadata isn't too old/stale before expiring: i.e., metadata exists and is within metadata-max-age + (request timeout + backoff) * 3
I left a couple of minor comments in the patch and may have a few more as I'd like to do another pass. In the mean time can you address these and think about updating our html docs? I think the request timeout documentation is woefully inadequate especially with these nuances at play. It may also be reasonable to add a one-liner to the original KIP-19 wiki stating that there have been a few changes to its interpretation and the user should refer to the current html docs as source of truth. In fact there are a few statements in that KIP that I find are no longer true (even prior to this patch).
@@ -47,6 +47,8 @@ | |||
|
|||
public static final long TOPIC_EXPIRY_MS = 5 * 60 * 1000; | |||
private static final long TOPIC_EXPIRY_NEEDS_UPDATE = -1L; | |||
private static final long DEFAULT_RETRY_BACKOFF_MS = 100L; | |||
private static final long DEFAULT_METADATA_MAX_AGE_MS = 60 * 60 * 1000L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing, but this is slightly inconsistent with the max age default in the producer config (which is five minutes). One hour seems too high for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored DEFAULT_METADATA_MAX_AGE_MS
to 5 mins.
@@ -223,17 +223,29 @@ private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, C | |||
* Abort the batches that have been sitting in RecordAccumulator for more than the configured requestTimeout | |||
* due to metadata being unavailable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
// (1) the partition does not have a batch in flight. | ||
// (2) Either the metadata is too stale or we don't have a leader for a partition. | ||
// | ||
// The first condition prevents later batches from being expired while an earlier batch is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current statement makes a statement that seems to apply more generally and only later narrows it down to the strict ordering case.
So I would make this a little clearer by stating at the beginning itself that this applies to max.in.flight=1. So something like:
In the case where the user wishes to achieve strict ordering, (i.e., max.in.flight.request.per.connection=1) the first condition helps ensure that batches also expire in order. (Partitions are only muted if strict ordering is enabled and there are in-flight batches.)
// Note that `muted` is only ever populated if `max.in.flight.request.per.connection=1` so this protection | ||
// We check if the batch should be expired if we know that we can't make progress on a given | ||
// topic-partition. Specifically, we check if | ||
// (1) the partition does not have a batch in flight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the partition does not have an in-flight batch
// is only active in this case. Otherwise the expiration order is not guaranteed. | ||
if (!muted.contains(tp)) { | ||
// | ||
// The second condition allows expiration of lingering batches if we don't have a leader for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean ready
(as opposed to lingering
)?
// | ||
// Finally, we expire batches if the last metadata refresh was too long ago. We might run in to | ||
// this situation when the producer is disconnected from all the brokers. Note that stale metadata | ||
// is significantly longer than metadata.max.age. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Significantly longer" needs to be made more clear/precise - see my overall comment.
@@ -375,6 +382,19 @@ public void wakeup() { | |||
this.client.wakeup(); | |||
} | |||
|
|||
/* metadata becomes "stale" for batch expiry purpose when the time since the last successful update exceeds | |||
* the metadataStaleMs value. This value must be greater than the metadata.max.age and some delta to account | |||
* for a few retries and transient network disconnections. A small number of retries (3) are chosen because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not terribly thrilled about a hard-coded retry limit here. Let me think a little more on this but I understand that there may not be a much better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. This can instead be the producer's retry count. If the retry count is zero, then we will have to allow for at least one metadata request past its max age. So the staleness threshold will be: retryCount * (backoff + requestTimeout) + maxAge
a9f3661
to
8358632
Compare
Two more points:
|
// | ||
// Finally, we expire batches if the last metadata refresh was too long ago. I.e., > {@link Sender#metadataStaleMs}. | ||
// We might run in to this situation when the producer is disconnected from all the brokers. | ||
if (!muted.contains(tp) && (isMetadataStale || cluster.leaderFor(tp) == null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using !muted
I think it would be much clearer to alias this to a boolean that is named: guaranteeExpirationOrder
similar to the guaranteeMessageOrder
boolean in Sender
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor comments while I review the more interesting parts. :)
* The max allowable age of metadata. | ||
*/ | ||
public long maxAge() { | ||
return this.metadataExpireMs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using two different names here and elsewhere? It would be good to be consistent. Also, whatever we choose, we generally include the unit at the end (e.g. maxAgeMs
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata.maxAgeMs
makes sense to me. However, metadata.maxMetadataAgeMs
sounds redundant.
@@ -334,6 +334,7 @@ private KafkaProducer(ProducerConfig config, Serializer<K> keySerializer, Serial | |||
} catch (Throwable t) { | |||
// call close methods if internal objects are already constructed | |||
// this is to prevent resource leak. see KAFKA-2121 | |||
t.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. zapped.
8358632
to
76c453d
Compare
* the backoff, staleness determination is delayed by that factor. | ||
*/ | ||
static long getMetadataStaleMs(long metadataMaxAgeMs, int requestTimeoutMs, long refreshBackoffMs, int retries) { | ||
return metadataMaxAgeMs + Math.max(retries, 1) * (requestTimeoutMs + refreshBackoffMs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1 correct here? The comment says 3 retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -49,9 +49,11 @@ | |||
|
|||
public static final long TOPIC_EXPIRY_MS = 5 * 60 * 1000; | |||
private static final long TOPIC_EXPIRY_NEEDS_UPDATE = -1L; | |||
private static final long DEFAULT_RETRY_BACKOFF_MS = 100L; | |||
private static final long DEFAULT_METADATA_MAX_AGE_MS = 300 * 1000L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to put that in ConsumerConfig so that we can reference the constant when defining METADATA_MAX_AGE_CONFIG in ConsumerConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount); | ||
|
||
boolean stale = isMetadataStale(now, metadata, this.metadataStaleMs); | ||
if (stale || !result.unknownLeaderTopics.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is the logic here correct? Let's say there is a network issue and the producer can't talk to any broker. All leaders are not be empty. It will take at least 300 seconds before the metadata becomes stable. Then some of the records will have to sit in the bufferpool much longer than the default 30 second request timeout before being expired. This seems a big change in behavior.
Also, would it be better to consolidate the check here into accumulator.abortExpiredBatches()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition in Sender.java is a quick check to avoid expensive iteration in abortExpiredBatches
when we can. I.e., when metadata is fresh and leaders for all partitions are known there's no reason to drop into the iteration in abortExpiredBatches
. We'll not expire any batches in that case.
@becketqin @sutambe : I am not sure that I fully understand the motivation of the change. For the MirrorMaker issue that you mentioned, is it because the updating of metadata request takes long? Couldn't you just configure a larger request timeout for the producer in MirrorMaker? |
@junrao @ijuma thanks for taking a look. Initially, I had also wondered about bumping up the request timeout to a much higher value (depending on how many batches are pending in the partition) - it could work, but I think there are shortcomings in the original request timeout proposal (KIP-19), its implementation and follow-ups that are worth summarizing. First off, with KIP-19 the primary objective there was to add a request timeout to requests that were out on the wire. Without that, sudden broker death (i.e., killed/disk failure, etc.) could cause clients to hang and the accumulator to fill up even with a new leader being available. KIP-19’s scope broadened quite a bit to include the application-level send timeout ( That condition did not work well with the scenario of a cluster being down which is why in KAFKA-2805 the metadata check was removed. … and in KAFKA-3388 we sort of went back to the metadata check (specifically, see #1056 (diff)) Bumping up the request timeout works but is an artificial way to get around this. It would be clearer to have a separate accumulator timeout but that is leaking too much of the producer’s internals to the user IMO and there are still various nuances to consider. So yes there is a change in behavior but I think we have already been making similar changes in the referenced jiras. So I think its effect is not as significant than it seems. This is especially so because KIP-19 made gross oversimplifications on accumulator timeouts as I described above; it is reasonable to believe that users are for the most part unaware of these nuances. In fact one only has to read the current javadoc for |
1e5dbcc
to
b01ae18
Compare
|
||
// Test batches not in retry | ||
for (int i = 0; i < appends; i++) { | ||
accum.append(tp1, 0L, key, value, null, maxBlockTimeMs); | ||
assertEquals("No partitions should be ready.", 0, accum.ready(cluster, time.milliseconds()).readyNodes.size()); | ||
} | ||
// Make the batches ready due to batch full | ||
// subtest1: Make the batches ready due to batch full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having one giant test which includes these subtests, could we split them into separate test cases? It's easier to debug failures that way since the test case is more isolated. It also lets you provide a meaningful name to the test which makes the nature of the failure clearer just looking at the output from the build.
@junrao We can bump up the request timeout to avoid expiration of the batches in the accumulator. But it is really a workaround as Joel said. The following is my take on KIP-19 and the timeouts. Request timeout is really for serving the purpose of preventing the clients from waiting for a response forever. The reason the clients timeout a request is because they have no idea whether the broker can still make progress. This is something like my laptop does not respond to anything and I don't know what happened, so I reboot it. Ideally we should have a separate timeout configuration for batches in the accumulator to prevent the batches from sitting in the accumulator forever. If the accumulator is a completely independent module, it would be similar to something like a Another important reason we do not have a separate timeout config for the batches in the accumulator was that this timeout is not an SLA, i.e. we do not provide guarantee of a strict batch timeout in the accumulator. This is different from request timeout, which is both an SLA and a "panic then reboot". That is a compromise we made in KIP-19 because it is difficult to do and sometimes unreasonable. So my understanding for KIP-19 is that:
As of the past changes,
|
…common config names
@junrao - thanks again for taking the time to think through this. I agree with the theme of these comments that we should ideally provide guaranteed timeouts to users. I’m also certain that most users currently have the wrong perception that there is such a thing as a guaranteed timeout in the producer today. The reality is that there are so many if’s and buts in the current code that they will be surprised to learn that we actually do not have a guaranteed timeout from when the send returns to when the callback fires (or future is ready). I think we are all in agreement that increasing request timeout kind of works. The problem with this approach is that it forces exposing users to the intricacies of where the request timeout is applicable. As I mentioned earlier, I think KIP-19 (which was originally for wire timeouts) became overloaded with other timeouts and the end-state is inadequate. Even in its end state the intent was that if we can make progress then do not timeout batches in the accumulator, but over time the implementation diverged from that intent. The root problem seems to be that we are using request timeout for the time spent in the accumulator as well (and doing a poor job of that to boot). The clock starts ticking when the request is added to the accumulator. However, it is reset when the request is written out on the wire. So the code clearly admits guilt :) - that it is overloading what was intended to be a wire timeout for the accumulator. So in order to accommodate a high volume producer (such as a mirror maker that is catching up) in which batches may spend considerable time in the accumulator, it forces the user to set an artificially high request timeout. Again, this works but at the expense of increasing the time-to-detect a broker failure. Also, reducing the buffer size is precisely the opposite of what should be done in a high-volume producer. As suggested earlier in this conversation, we could have an explicit accumulator timeout to avoid these hidden effects. However, that exposes users to internals of the producer. @becketqin I think had a better idea - which is to add an explicit callback/future timeout. While I know that many are loathe to adding more timeouts, I think this addresses the problem and gives us the opportunity to actually provide intuitive guarantees to users. So the timeouts would end up being:
Any thoughts? We can put together a KIP if this sounds reasonable. |
btw, this was touched upon in KIP-19 and we wanted to avoid the separate timeout due to its interaction with retries. So if we do this then we would document that |
@jjkoshy , @becketqin : Yes, having a separate callback.timeout seems easier to understand than piggybacking on metadata.age.ms. About the bufferpool size in the producer in MM. I am curious how large you set it too. Batching is definitely the most effective way of getting higher throughput, but setting it larger beyond a certain point won't help any more and only increases the amount of the time a message is in the accumulator. For example, the default 32MB can still allow the producer to batch about 3KB per partition with 10K partitions. |
@jjkoshy I prefer Assuming we'll set a large default value for Thoughts? |
@junrao I found it is a little difficult to tune the buffer pool size because it is for all the partitions. For example, in a bootstrap case, we may have topic T with a lot of data to copy and the other topics do not have much to catch up. At a certain point, what could happen is that all the 32 MB of memory is taken by topic T and that would be 2K batches for that topic assuming 16 KB batch size. In this case, the producer may still expire the batches even though the buffer pool is already small. Another thing is that we actually allocate batches at So it is kind of hard to tune the buffer pool size because on one hand 32 MB may still not be small enough, but on the other hand the 32 MB is too small. |
@jjkoshy adding the callback timeout seems nice and makes it more clear. On the other hand can we have a single timeout which kafka internally splits up in to these 3 and we can also factor in retires? In that way user has to provide a single timeout and the request can block during the initial phase or get back the results with in that time period. I understand having a single timeout might not be possible, but IMO it would be good to give it a thought since we are planning to add one more timeout config. |
@MayureshGharat I may have misunderstood, but exposing a single timeout and splitting internally sounds like it would make matters much worse. In fact the suggestion above was to add an additional timeout to fix the issue that we are overloading one timeout for both the request timeout and the time spent in the accumulator. |
b01ae18
to
4aa79b7
Compare
Is it worth closing this PR until we have the KIP? |
Closing the PR in anticipation of KIP-91. |
No description provided.