[SPARK-22056][Streaming] Add subconcurrency for KafkaRDDPartition #19274

fhan688 · 2017-09-19T08:29:19Z

JIRA Issue：https://issues.apache.org/jira/browse/SPARK-22056

When spark streaming consuming data from Kafka in direct way , partition in Kafka and KafkaRDDPartition in spark streaming are now bijection. To enhance the computing ability of spark streaming, we always to increase the number of partitions in Kafka , but too many increments may lead problems in Kafka like leader selection.
So , we introduce a new mechanism that change bijection to one-to-many which controls by a new parameter named "topic.partition.subconcurrency". This mechanism will divide one KafkaRDDPartition to many according to the parameter in spark streaming side , thus will make spark streaming use computing resources more efficient and avoid the problems caused by increasing the Kafka partitions.

we test this in production , the processing capacity of spark streaming improves apparently.

bjkonglu · 2017-09-20T08:09:12Z

I tried this method . It worked well.

jerryshao · 2017-09-21T06:23:02Z

Will this break the assumption that one Kafka partition only map to one Spark partition?

fhan688 · 2017-09-21T10:36:13Z

Yes. One Kafka partition will map to many Spark partitions, thus more executors can be used.

jerryshao · 2017-09-26T09:48:06Z

Hi @loneknightpy , think a bit on your PR, I think this can also be done in the user side. User could create several threads in one task (RDD#mapPartitions) to consume the records concurrently, so such feature may not be so necessary to land in Spark's code. What do you think?

fhan688 · 2017-09-26T10:39:12Z

lonelytrooper... : P will more executors be used in RDD#mapPartitions way ? I'll try that later to see if it works. I think if Spark provides a convenient way for this , it would help users a lot and reduce their work , that still make sense. LOL
Besides , this feature achieves very good performance promotion in our production env.

jerryshao · 2017-09-26T12:53:43Z

Yes, I understand your scenario, but my concern is that your proposal is quite scenario specific, it may well serve your scenario, but somehow it breaks the design purpose of KafkaRDD. From my understanding lots of user use repartition or coalease to increase the parallelism, so somehow your problem can be solved by this.

fhan688 · 2017-09-27T09:06:57Z

Hi Jerry, thank you so much for discussing！ Actually, we tried 'repartition' before introducing this feature and for two reasons we give it up. First, it leads to shuffle which may influence a lot in real time applications. Second, performance promotion is quite limited in 'repartition' way. You mentioned the assumption at the front that one Kafka partition map to one Spark partition, I wonder why this assumption is so vital ?

jerryshao · 2017-09-27T09:12:35Z

This is because it is the only way to guarantee the ordering of data in Kafka partition mapping to Spark partition. Maybe some other users took as an assumption to write the code.

Let's see others' feedbacks. Ping @zsxwing @koeninger would you please weigh in this PR? Thanks!

fhan688 · 2017-09-27T09:25:29Z

I guessed that.. This is true, this feature can not ensure the ordering of data in one Kafka partition, but quite a few applications(like dealing with logs) do not need strict order guarantee in one Kafka partition. if they want, just do not use this feature, otherwise, this will achieves good performance promotion. So I think this feature maybe not so scenario specific. : P

fhan688 · 2017-09-27T09:26:51Z

Thank you so much for inviting more discussions!

koeninger · 2017-09-27T13:14:50Z

Search Jira and the mailing list, this idea has been brought up multiple times. I don't think breaking fundamental assumptions of Kafka (one consumer thread per group per partition) is a good idea.

AmplabJenkins · 2018-06-09T00:19:03Z

Can one of the admins verify this patch?

HyukjinKwon · 2018-07-16T02:13:50Z

ping @lonelytrooper for @koeninger's comment. Otherwise, let me propose to close this for now.

bjyfhanfei added 2 commits September 4, 2017 17:00

add partition subconcurrency

a896634

add topic.partition.subconcurrency

d113219

fhan688 changed the title ~~[SPARK-22056] Add subconcurrency for KafkaRDDPartition~~ [SPARK-22056][Streaming] Add subconcurrency for KafkaRDDPartition Sep 20, 2017

HyukjinKwon mentioned this pull request Jul 16, 2018

[INFRA] Close stale PR #21781

Closed

asfgit closed this in 1a4fda8 Jul 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22056][Streaming] Add subconcurrency for KafkaRDDPartition #19274

[SPARK-22056][Streaming] Add subconcurrency for KafkaRDDPartition #19274

fhan688 commented Sep 19, 2017 •

edited

bjkonglu commented Sep 20, 2017

jerryshao commented Sep 21, 2017 •

edited

fhan688 commented Sep 21, 2017

jerryshao commented Sep 26, 2017

fhan688 commented Sep 26, 2017

jerryshao commented Sep 26, 2017

fhan688 commented Sep 27, 2017

jerryshao commented Sep 27, 2017 •

edited

fhan688 commented Sep 27, 2017

fhan688 commented Sep 27, 2017

koeninger commented Sep 27, 2017

AmplabJenkins commented Jun 9, 2018

HyukjinKwon commented Jul 16, 2018

[SPARK-22056][Streaming] Add subconcurrency for KafkaRDDPartition #19274

[SPARK-22056][Streaming] Add subconcurrency for KafkaRDDPartition #19274

Conversation

fhan688 commented Sep 19, 2017 • edited

bjkonglu commented Sep 20, 2017

jerryshao commented Sep 21, 2017 • edited

fhan688 commented Sep 21, 2017

jerryshao commented Sep 26, 2017

fhan688 commented Sep 26, 2017

jerryshao commented Sep 26, 2017

fhan688 commented Sep 27, 2017

jerryshao commented Sep 27, 2017 • edited

fhan688 commented Sep 27, 2017

fhan688 commented Sep 27, 2017

koeninger commented Sep 27, 2017

AmplabJenkins commented Jun 9, 2018

HyukjinKwon commented Jul 16, 2018

fhan688 commented Sep 19, 2017 •

edited

jerryshao commented Sep 21, 2017 •

edited

jerryshao commented Sep 27, 2017 •

edited