-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1172: Add row count limit config in one stripe #1118
ORC-1172: Add row count limit config in one stripe #1118
Conversation
Thank you for making a PR, @dengweisysu. Please create an issue at https://issues.apache.org/jira/projects/ORC/issue and then bind the jira issue ID to the title. |
2c6f936
to
a03a881
Compare
@dengweisysu Thanks for the update, could you add a unit test to verify this feature? |
Thank you for making a PR, @dengweisysu . And +1 for @guiyanakuang 's comments. BTW, it seems that you removed the PR template. Please recover the PR template. |
eb32391
to
3b53641
Compare
@dongjoon-hyun done with PR template. |
The description of this config gives the impression that the number of stripe rows can be controlled at (0, @dengweisysu Could you change the description to make it more accurate. I have no further review views. |
3b53641
to
d8261d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @dengweisysu and @guiyanakuang .
Thank you for leading the review. Feel free to merge this, @guiyanakuang . |
Merged to main |
|
|
We need to add a new contributor to Apache ORC |
And, lastly, you had better close his original GitHub Issue in this PR. |
Let me know if it doesn't work still. I can help you. |
I have resolved and assigned to @dengweisysu in jira. It looks like I missed the way to close the issue by keyword in the commit messge, is there any way I can make up for it now? @dongjoon-hyun |
You should close #1117 manually in that case. |
### What changes were proposed in this pull request? add row count limit config "orc.stripe.row.count" to limit row count in one stripe. ### Why are the changes needed? for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split. In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size". But for different kind of table, the row count is difficult to use. for table with much columns( eg. 100 columns), 64MB may contain 5000 rows. for table with less columns(eg. 5 columns), 64MB may contain 100000 rows. for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low. So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe. The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261) ### How was this patch tested? testStripeRowCountLimit added. can be test by command below: ``` cd java ./mvnw -Dtest=TestWriterImpl test ``` (cherry picked from commit 7facf81) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
I landed this to branch-1.8. |
### What changes were proposed in this pull request? add row count limit config "orc.stripe.row.count" to limit row count in one stripe. ### Why are the changes needed? for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split. In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size". But for different kind of table, the row count is difficult to use. for table with much columns( eg. 100 columns), 64MB may contain 5000 rows. for table with less columns(eg. 5 columns), 64MB may contain 100000 rows. for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low. So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe. The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261) ### How was this patch tested? testStripeRowCountLimit added. can be test by command below: ``` cd java ./mvnw -Dtest=TestWriterImpl test ```
What changes were proposed in this pull request?
add row count limit config "orc.stripe.row.count" to limit row count in one stripe.
Why are the changes needed?
for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.
for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.
So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): rapidsai/cudf#9261
How was this patch tested?
testStripeRowCountLimit added.
can be test by command below:
closed #1117