Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1172: Add row count limit config in one stripe #1118

Merged
merged 2 commits into from
May 24, 2022

Conversation

dengweisysu
Copy link
Contributor

@dengweisysu dengweisysu commented May 17, 2022

What changes were proposed in this pull request?

add row count limit config "orc.stripe.row.count" to limit row count in one stripe.

Why are the changes needed?

for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.

for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): rapidsai/cudf#9261

How was this patch tested?

testStripeRowCountLimit added.
can be test by command below:

cd java
./mvnw -Dtest=TestWriterImpl test

closed #1117

@github-actions github-actions bot added the JAVA label May 17, 2022
@guiyanakuang
Copy link
Member

Thank you for making a PR, @dengweisysu.

Please create an issue at https://issues.apache.org/jira/projects/ORC/issue and then bind the jira issue ID to the title.

@dengweisysu dengweisysu changed the title add row count limit config in one stripe [ORC-1172]add row count limit config in one stripe May 18, 2022
@dongjoon-hyun dongjoon-hyun changed the title [ORC-1172]add row count limit config in one stripe ORC-1172: Add row count limit config in one stripe May 18, 2022
@dengweisysu dengweisysu force-pushed the feature/stripe_row_count_limit branch from 2c6f936 to a03a881 Compare May 19, 2022 02:32
@guiyanakuang
Copy link
Member

@dengweisysu Thanks for the update, could you add a unit test to verify this feature?

@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @dengweisysu . And +1 for @guiyanakuang 's comments.

BTW, it seems that you removed the PR template. Please recover the PR template.

@dengweisysu dengweisysu force-pushed the feature/stripe_row_count_limit branch 2 times, most recently from eb32391 to 3b53641 Compare May 20, 2022 11:15
@dengweisysu
Copy link
Contributor Author

@dongjoon-hyun done with PR template.

@guiyanakuang
Copy link
Member

The description of this config gives the impression that the number of stripe rows can be controlled at (0, STRIPE_ROW_COUNT]. In fact, the number of stripe rows should be in the range [STRIPE_ROW_COUNT, STRIPE_ROW_COUNT + max(batchSize, ROWS_PER_CHECK)) if there is enough data to be written.

@dengweisysu Could you change the description to make it more accurate. I have no further review views.
cc @dongjoon-hyun @wgtmac

@dengweisysu dengweisysu force-pushed the feature/stripe_row_count_limit branch from 3b53641 to d8261d3 Compare May 24, 2022 01:54
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @dengweisysu and @guiyanakuang .

@dongjoon-hyun
Copy link
Member

Thank you for leading the review. Feel free to merge this, @guiyanakuang .

@guiyanakuang guiyanakuang merged commit 7facf81 into apache:main May 24, 2022
@guiyanakuang
Copy link
Member

Merged to main

@dongjoon-hyun dongjoon-hyun added this to the 1.9.0 milestone May 24, 2022
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 24, 2022

@guiyanakuang
Copy link
Member

image
@dongjoon-hyun. It looks like I can't search for him, can you give some tips ?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 24, 2022

We need to add a new contributor to Apache ORC Contributors group at the following link in order to assign an issue to them. Please try because you are in Administrators group now.

@guiyanakuang
Copy link
Member

image
Please help update my permission. : )

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 24, 2022

You are in the Admin. Could you try again?

Screen Shot 2022-05-24 at 12 41 45 AM

Screen Shot 2022-05-24 at 12 41 09 AM

@dongjoon-hyun
Copy link
Member

And, lastly, you had better close his original GitHub Issue in this PR.

@dongjoon-hyun
Copy link
Member

Let me know if it doesn't work still. I can help you.

@guiyanakuang guiyanakuang linked an issue May 24, 2022 that may be closed by this pull request
@guiyanakuang
Copy link
Member

I have resolved and assigned to @dengweisysu in jira.

It looks like I missed the way to close the issue by keyword in the commit messge, is there any way I can make up for it now? @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

You should close #1117 manually in that case.

@dongjoon-hyun dongjoon-hyun modified the milestones: 1.9.0, 1.8.0 May 24, 2022
dongjoon-hyun pushed a commit that referenced this pull request May 24, 2022
### What changes were proposed in this pull request?
add row count limit config "orc.stripe.row.count" to limit row count in one stripe.

### Why are the changes needed?
for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.

for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261)

### How was this patch tested?
testStripeRowCountLimit added.
can be test by command below:
```
cd java
./mvnw -Dtest=TestWriterImpl test
```

(cherry picked from commit 7facf81)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

I landed this to branch-1.8.

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?
add row count limit config "orc.stripe.row.count" to limit row count in one stripe.

### Why are the changes needed?
for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.

for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261)

### How was this patch tested?
testStripeRowCountLimit added.
can be test by command below: 
```
cd java
./mvnw -Dtest=TestWriterImpl test
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ORC-1172: Add row count limit config for one stripe
3 participants