[BEAM-2455] Backlog size retrieval for Kinesis source#3551
[BEAM-2455] Backlog size retrieval for Kinesis source#3551pawel-kaczmarczyk wants to merge 2 commits intoapache:masterfrom ocadotechnology:kinesis_backlog_size
Conversation
|
@jbonofre @davorbonaci @jkff, can you please take a look? |
|
R: @jbonofre |
| .setStreamName(streamName) | ||
| .setInitialPosition( | ||
| new StartingPoint(checkNotNull(initialPosition, "initialPosition"))) | ||
| public Read withStreamName(String streamName) { |
There was a problem hiding this comment.
This is a backwards-incompatible change to a public API: it will cause user code to stop compiling. Beam has had an API-stable release, so we can no longer make incompatible changes: please revert this one, or make it backwards-compatible. Same applies to other renames and signature changes in this PR.
There was a problem hiding this comment.
Thanks @jkff. I thought there's still time to shape the API of Kinesis connector as it's still in early development stage. It also has the @experimental(Experimental.Kind.SOURCE_SINK) annotation which states that there might be some incompatible changes in the API.
So in case we want to keep it backward-compatible I would suggest marking old "from" methods as deprecated and keeping the newly added "with" methods (as per your suggestion from https://issues.apache.org/jira/browse/BEAM-1428)
There's was also a rename of KinesisClientProvider interface to AWSClientsProvider. This one fortunately shouldn't break anything as KinesisClientProvider was not public (but it should be as it was fairly pointed out in #3540). So the interface along with corresponding builder method "withAWSClientsProvider" can be safely renamed.
What do you think?
There was a problem hiding this comment.
Oh right, I didn't notice it was marked Experimental. I guess then it's fine to proceed with the rest of the changes in this PR; I defer for further review to JB.
|
Hi @jbonofre, did you have a chance to take a look at this PR? |
|
On it pretty soon. |
|
Hi @jbonofre, it's just a reminder ;) |
|
R: @rangadi can you review this? |
| * When this limit is exceeded the runner might decide to scale the amount of resources | ||
| * allocated to the pipeline in order to speed up ingestion. | ||
| */ | ||
| public Read withAllowedRecordLateness(Duration allowedRecordLateness) { |
There was a problem hiding this comment.
AllowedLateness in Beam refers to event time (i.e. how late a records event time could be compared to current watermark). Here it looks like you want to let user to provide some buffer before the source starts counting the backlog. Could you rename it?
Looks like the current implementation essentially disables backlog if this is > 0, right?
There was a problem hiding this comment.
As I wrote in general comment, this parameter should disable scaling up when events are late not more than allowedRecordLateness. So maybe I will rename it to "upToDateThreshold" to not confuse it with AllowedLateness concept in Beam. What do you think?
You are right that current implementation will actually never check the backlog size due to the way that reader's watermark is implemented. I'm planning to fix this in https://issues.apache.org/jira/browse/BEAM-2467. Maybe I should start the work with that other ticket but I noticed the dependency after submitting this PR. I've also reported 2 other improvements for this Kinesis source that will be submitted once these 2 are ready.
There was a problem hiding this comment.
Yes, a different name would be better. threshold suffix seems appropriate.
The order of fixes could be either. Your convenience. If this is going in first, please add a TODO comment here that the feature does not work until water mark is implemented as in the jira.
| } | ||
| long backlogBytes = lastBacklogBytes; | ||
| try { | ||
| backlogBytes = kinesis.getBacklogBytes(source.getStreamName(), watermark); |
There was a problem hiding this comment.
This method might be called quite often. Since the stats are updated only once very minute, you could cache the values for a few seconds. Since getWatermark() returns now, does getBacklogBytes ever return non-zero?
There was a problem hiding this comment.
The getWatermark() method must be fixed in order to enable backlog size counting, as I mentioned in my previous comment.
Do you think caching is essential now? Setting allowedRecordLateness parameter to some reasonable number should significantly reduce the number of calls here.
There was a problem hiding this comment.
I would think so, but you can postpone it with a comment noting about the performance. With backlogThreshold (lateness), this might be mute when the pipeline is upto date, but the extra latency penalty kicks in when the pipeline is already has high backlog, which can exacerbate the problem that caused the delay in the first place.
Caching is requires just storing last backlog and request time in local transient state of the reader.
rangadi
left a comment
There was a problem hiding this comment.
Thanks for PR. Could you add some description of what this does and any relevant design decisions? The jira linked does not have much info either.
|
Thanks for your review @rangadi. The only purpose of this backlog size counting is to enable autoscaling. Kinesis API does not provide any information about the remaining events in the stream so I had to use the CloudWatch service in order to get that info. I've also introduced the allowedRecordLateness parameter in order to limit the autoscaling when events are not that much late. It helped in our case when we observed workers count fluctuations due to temporary events bursts. |
|
Replied inline to comments in the code. I do see why you have 'threshold', my comment was mainly about the 'lateness' in the name. |
|
Ok, I've just applied suggested changes. Can you please take a look? |
|
Hi @pawel-kaczmarczyk, thanks for the updates. Could you ask someone familiar with KinesisIO to +1 overall solution? I don't know much about Kinesis. |
| * speed up ingestion. | ||
| * | ||
| * <p> | ||
| * TODO: This feature will not work properly with current {@link KinesisReader#getWatermark()} |
There was a problem hiding this comment.
Not sure I understand this comment. It seems to say that the feature is currently working incorrectly - in that case, I'd suggest to remove the feature and re-add it when it's possible to make it work correctly, because there's no benefit to the user in having the API expose a feature that doesn't work. However, you say above that you've been running this and it was working for you. Can you clarify?
There was a problem hiding this comment.
Thanks @jkff for your review. This feature works for us in our custom Kinesis connector implementation for Dataflow SDK where we have implemented some features required for production usage. Now I'm trying to port these features to Beam SDK (BEAM-2455, BEAM-2467, BEAM-2468, BEAM-2469). We already have all the required code but it just seemed a better option to split the changes into few pull requests for clarity. Now I see that I should submit BEAM-2467 before this one but as I explained to Raghu, I noticed the dependency after submitting this PR.
So if it's not a problem I would leave this feature with TODO comment that will be fixed in BEAM-2467.
There was a problem hiding this comment.
| if (backlogBytesLastCheckTime.plus(backlogBytesCheckThreshold).isAfterNow()) { | ||
| return lastBacklogBytes; | ||
| } | ||
| long backlogBytes = lastBacklogBytes; |
There was a problem hiding this comment.
Not sure why you need this: why not just do lastBacklogBytes = kinesis.getBacklogBytes at line 178, and get rid of the backlogBytes variable and line 185?
There was a problem hiding this comment.
It just seemed more clear probably but it's not needed. I'll apply your suggestion
|
Also - @rtshadow wrote the original Kinesis connector; maybe he can help with this review too? |
| public long getTotalBacklogBytes() { | ||
| Instant watermark = getWatermark(); | ||
| if (watermark.plus(upToDateThreshold).isAfterNow()) { | ||
| return 0L; |
There was a problem hiding this comment.
@pawel-kaczmarczyk I don't understand why cannot we simply return the actual number of backlog bytes? It should be up to the execution service to decide whether that number is high enough that scaling should occur
There was a problem hiding this comment.
We can simply return the backlog bytes and it works well in most cases. However in our case we observed unnecessary scaling during short spikes. Also there were issues when there were not so many events in the stream backlog but Kinesis was slow to return them. In such situation events are lagging behind only for a minute or two but runner may still decide to scale up.
There was a problem hiding this comment.
Fair enough.
Can you please write that explanation in the comment in the code? Otherwise it's hard to understand why it is here.
Also, I'm not big fan of the "upToDataThreshold", but can't really come up with anything much better
There was a problem hiding this comment.
This sounds like something that should be fixed in the Dataflow Runner's streaming autoscaling logic (which, conveniently, is @rangadi's realm :) ), rather than in KinesisIO - though I'm ok keeping it as a temporary workaround while Dataflow Runner is being fixed. Could you file a JIRA explaining the behavior you observed from Dataflow runner's autoscaling, and link it from here, with a TODO to remove the knob?
Also @rangadi can you confirm that according to your understanding this knob is in fact conceptually a good workaround for the poor behavior of streaming autoscaling in this case?
There was a problem hiding this comment.
Sort of yes. In Dataflow autoscaling, some of the important parameters are not tunable. E.g. it upscales if the estimated time for remaining work is longer than 15 seconds. Not all pipelines might be this sensitive.. also in this case the actual backlog is reported by another aggregator of metrics which seems to update for every minute.
This is a reasonable work around. I'm not a fan of the name of the method either, but don't have much better suggestion.
| if (watermark.plus(upToDateThreshold).isAfterNow()) { | ||
| return 0L; | ||
| } | ||
| if (backlogBytesLastCheckTime.plus(backlogBytesCheckThreshold).isAfterNow()) { |
There was a problem hiding this comment.
is this method called so often that we need to cache the response from CloudWatch?
There was a problem hiding this comment.
I'm not sure how often this gets called in other runners but it wasn't a problem in case of Dataflow. This was suggested by @rangadi as a precaution.
There was a problem hiding this comment.
Another important aspect is that comment implied this backlog is updated on Kenesis only at minute granularity. Fetching more often does not have any advantage.
Dataflow calls getWatermark() method after every "bundle" which could take only a few milliseconds or at most on the order of second. Assuming fetching backlog adds latency due to RPC to external system, it could severely affect performance, without providing any benefit.
| public long getTotalBacklogBytes() { | ||
| Instant watermark = getWatermark(); | ||
| if (watermark.plus(upToDateThreshold).isAfterNow()) { | ||
| return 0L; |
There was a problem hiding this comment.
Fair enough.
Can you please write that explanation in the comment in the code? Otherwise it's hard to understand why it is here.
Also, I'm not big fan of the "upToDataThreshold", but can't really come up with anything much better
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
[BEAM-<Jira issue #>] Description of pull requestmvn clean verify.<Jira issue #>in the title with the actual Jira issuenumber, if there is one.
Individual Contributor License Agreement.