Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Read HDFS data files with random sequence to distribute pressure #451

Closed
3 tasks done
zuston opened this issue Dec 29, 2022 · 5 comments · Fixed by #452
Closed
3 tasks done

[Improvement] Read HDFS data files with random sequence to distribute pressure #451

zuston opened this issue Dec 29, 2022 · 5 comments · Fixed by #452

Comments

@zuston
Copy link
Member

zuston commented Dec 29, 2022

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

In PR #396 to support concurrently writing single partition's data into multiple HDFS files, it's better to randomly read HDFS data files to distribute stress in client side.

How should we improve?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@zuston zuston changed the title [Improvement] Randomly read single HDFS data files to distribute stress [Improvement] Randomly read HDFS data files to distribute stress Dec 29, 2022
@zuston
Copy link
Member Author

zuston commented Dec 29, 2022

PTAL @advancedxy @jerqi

@zuston zuston changed the title [Improvement] Randomly read HDFS data files to distribute stress [Improvement] Randomly read HDFS data files to distribute pressure Dec 29, 2022
@zuston zuston changed the title [Improvement] Randomly read HDFS data files to distribute pressure [Improvement] Read HDFS data files with random sequence to distribute pressure Dec 29, 2022
@jerqi
Copy link
Contributor

jerqi commented Dec 29, 2022

I'm not sure that it can bring much performance improvement.

@zuston
Copy link
Member Author

zuston commented Dec 29, 2022

I'm not sure that it can bring much performance improvement.

To reduce the datanode pressure from multiple readers, especially for 1 replica.

@advancedxy
Copy link
Contributor

To reduce the datanode pressure from multiple readers, especially for 1 replica.

Normally, there should be only one reader to read one/multiple partition file(s)?

Do you encounter this case in prod.

@zuston
Copy link
Member Author

zuston commented Dec 30, 2022

If this is a huge skewed partition, there are many readers to handle this partition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants