Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improvement]Add an option to set the partition size of the final write stage #60

Merged
merged 2 commits into from Dec 20, 2022

Conversation

lexluo09
Copy link
Contributor

Proposed changes

Add an option to set the partition size of the final write stage

  1. We can increase the parallelism of the computation and reduce the write doris parallelism to reduce write compaction pressure.
  2. After the spark RDD is filtered, the number of records for each partition is small and the number of partitions is large. The writing frequency becomes high and resources are wasted.

Problem Summary:

Describe the overview of changes.

Checklist(Required)

  1. Does it affect the original behavior: (Yes/No/I Don't know)
  2. Has unit tests been added: (Yes/No/No Need)
  3. Has document been added or modified: (Yes/No/No Need)
  4. Does it need to update dependencies: (Yes/No)
  5. Are there any changes that cannot be rolled back: (Yes/No)

Further comments

before :
image

after :
image

image

@lexluo09
Copy link
Contributor Author

@hf200012 would you mind take a look?

@gnehil
Copy link
Contributor

gnehil commented Dec 19, 2022

LGTM, and please update the docs on the official website to explain that doris.sink.task.partition.size is required, and the difference between setting doris.sink.task.use.repartition to true or false.

Copy link
Contributor

@hf200012 hf200012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hf200012 hf200012 merged commit 45fe88a into apache:master Dec 20, 2022
@lexluo09
Copy link
Contributor Author

LGTM, and please update the docs on the official website to explain that doris.sink.task.partition.size is required, and the difference between setting doris.sink.task.use.repartition to true or false.

Ok, thank you very much for your advice

@lexluo09 lexluo09 deleted the add_sink_partition_size branch December 20, 2022 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants