-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1667: Add check
tool to check the index of the specified column
#1862
Conversation
site/_docs/java-tools.md
Outdated
@@ -11,6 +11,7 @@ supports both the local file system and HDFS. | |||
|
|||
The subcommands for the tools are: | |||
|
|||
* bloom-filter (since ORC 2.1) - check the bloom filter of the specified column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Since we can backport
tools
module patch, let's change this toORC 2.0.1
. - The command name,
bloom-filter
, is a little ambiguous. Could you provide more intuitive name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change this to ORC 2.0.1.
OK
The command name, bloom-filter, is a little ambiguous.
How about using check-bloom-filter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds better to me for this context.
BTW, if you have more plan to extend this commend, you can use check
as subcommand or argument, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your suggestion.
Let us use the check
command. Based on this, we can add min and max statistics to check later.
bloom-filter
tool to check the bloom filter of the specified columncheck
tool to check the index of the specified column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you!
### What changes were proposed in this pull request? This PR aims to check the index of the specified column. We can test the filtering effect by specifying different types. `check --type stat` - Only use column statistics. `check --type bloom-filter` - Only use bloom filter. `check --type predicate` - Used in combination with column statistics and bloom filter. ### Why are the changes needed? ORC supports specifying multiple columns to generate bloom filter indexes, but it lacks a convenient tool to verify the effect of bloom filter. Parquet also has similar commands. [PARQUET-2138](https://issues.apache.org/jira/browse/PARQUET-2138): Add ShowBloomFilterCommand to parquet-cli ### How was this patch tested? Add UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #1862 from cxzl25/ORC-1667. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3b5b2a6) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR aims to check the index of the specified column.
We can test the filtering effect by specifying different types.
check --type stat
- Only use column statistics.check --type bloom-filter
- Only use bloom filter.check --type predicate
- Used in combination with column statistics and bloom filter.Why are the changes needed?
ORC supports specifying multiple columns to generate bloom filter indexes, but it lacks a convenient tool to verify the effect of bloom filter.
Parquet also has similar commands.
PARQUET-2138: Add ShowBloomFilterCommand to parquet-cli
How was this patch tested?
Add UT
Was this patch authored or co-authored using generative AI tooling?
No