[SPARK-18779][STREAMING][KAFKA] Messages being received only from one partition when using Spark Streaming integration for Kafka 0.10 with kafka client library at 0.10.1 #16278

pnakhe · 2016-12-14T11:21:56Z

What changes were proposed in this pull request?

This pull request is to fix for SPARK-18779. When using kafka 0.10.1.0 messages are being read only from one partition. The current kafka-spark 0.10 integration ships with kafka 0.10.0.1 where messages are read from all partitions but using kafka client 0.10.1.0 client, messages are read from only one partition.

In the ConsumerStrategy class there is a pause on the consumer. We never resume the consumer and that seems to causing the issue. The KafkaConsumer implementation has changed between 10.0.1 and 10.1.0 which has exposed this issue. The solution to this issue is to resume the consumer before we find the position in DirectKafkaInputDStream class in the latestOffsets method.The reason the issue is not seen in the current setup is because pause/resume logic is changed in the latest kafka version. We dont seem to have a resume for the pause and hence this fix is necessary.

This patch fixes the issue.

How was this patch tested?

The spark-kafka test cases were run to check no regressions were caused. I have checked that messages are being read from all partitions for both 0.10.0.1 kafka client and 0.10.1.0 client.

In the ConsumerStrategy class there is a pause on the consumer. We never resume the consumer and that seems to causing the issue. The KafkaConsumer implementation has changed between 10.0.1 and 10.1.0 which has exposed this issue. The solution to this issue is to resume the consumer before we find the position in DirectKafkaInputDStream class in the latestOffsets method. I have tested this fix and it works fine. The reason the issue is not seen in the current setup is because pause/resume logic is changed in the latest kafka version. We dont seem to have a resume for the pause and hence this fix is necessary.

AmplabJenkins · 2016-12-14T11:22:15Z

Can one of the admins verify this patch?

zsxwing · 2016-12-14T19:14:08Z

pause/resume logic is changed in the latest kafka version

Could you post the Kafka JIRA for this change? Just want to understand the issue.

HyukjinKwon · 2017-02-09T14:36:51Z

(@pnakhe gentle ping, I am curious too)

pnakhe · 2017-02-09T14:48:00Z

@HyukjinKwon Well the issue was not with spark after all. It was a regression on kafka between 0.10.1.0 and 0.10.1.1. Its fixed as part of https://issues.apache.org/jira/browse/KAFKA-4547

I have updated the defect with the same

HyukjinKwon · 2017-02-11T12:05:41Z

Aha, thanks for the details, then is this PR/JIRA closable maybe?

Closes apache#16819 Closes apache#13467 Closes apache#16083 Closes apache#17135 Closes apache#8785 Closes apache#16278 Closes apache#16997 Closes apache#17073 Closes apache#17220

srowen added a commit to srowen/spark that referenced this pull request Mar 22, 2017

Close stale PRs.

d88bc61

Closes apache#16819 Closes apache#13467 Closes apache#16083 Closes apache#17135 Closes apache#8785 Closes apache#16278 Closes apache#16997 Closes apache#17073 Closes apache#17220

srowen mentioned this pull request Mar 22, 2017

[INFRA] Close stale PRs #17386

Closed

asfgit closed this in b70c03a Mar 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18779][STREAMING][KAFKA] Messages being received only from one partition when using Spark Streaming integration for Kafka 0.10 with kafka client library at 0.10.1 #16278

[SPARK-18779][STREAMING][KAFKA] Messages being received only from one partition when using Spark Streaming integration for Kafka 0.10 with kafka client library at 0.10.1 #16278

pnakhe commented Dec 14, 2016

AmplabJenkins commented Dec 14, 2016

zsxwing commented Dec 14, 2016

HyukjinKwon commented Feb 9, 2017

pnakhe commented Feb 9, 2017

HyukjinKwon commented Feb 11, 2017

[SPARK-18779][STREAMING][KAFKA] Messages being received only from one partition when using Spark Streaming integration for Kafka 0.10 with kafka client library at 0.10.1 #16278

[SPARK-18779][STREAMING][KAFKA] Messages being received only from one partition when using Spark Streaming integration for Kafka 0.10 with kafka client library at 0.10.1 #16278

Conversation

pnakhe commented Dec 14, 2016

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Dec 14, 2016

zsxwing commented Dec 14, 2016

HyukjinKwon commented Feb 9, 2017

pnakhe commented Feb 9, 2017

HyukjinKwon commented Feb 11, 2017