[SPARK-18779][STREAMING][KAFKA] Messages being received only from one partition when using Spark Streaming integration for Kafka 0.10 with kafka client library at 0.10.1 #16278
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This pull request is to fix for SPARK-18779. When using kafka 0.10.1.0 messages are being read only from one partition. The current kafka-spark 0.10 integration ships with kafka 0.10.0.1 where messages are read from all partitions but using kafka client 0.10.1.0 client, messages are read from only one partition.
In the ConsumerStrategy class there is a pause on the consumer. We never resume the consumer and that seems to causing the issue. The KafkaConsumer implementation has changed between 10.0.1 and 10.1.0 which has exposed this issue. The solution to this issue is to resume the consumer before we find the position in DirectKafkaInputDStream class in the latestOffsets method.The reason the issue is not seen in the current setup is because pause/resume logic is changed in the latest kafka version. We dont seem to have a resume for the pause and hence this fix is necessary.
This patch fixes the issue.
How was this patch tested?
The spark-kafka test cases were run to check no regressions were caused. I have checked that messages are being read from all partitions for both 0.10.0.1 kafka client and 0.10.1.0 client.