Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1667: Add check tool to check the index of the specified column #1862

Closed
wants to merge 4 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Mar 27, 2024

What changes were proposed in this pull request?

This PR aims to check the index of the specified column.

We can test the filtering effect by specifying different types.

check --type stat - Only use column statistics.
check --type bloom-filter - Only use bloom filter.
check --type predicate - Used in combination with column statistics and bloom filter.

Why are the changes needed?

ORC supports specifying multiple columns to generate bloom filter indexes, but it lacks a convenient tool to verify the effect of bloom filter.

Parquet also has similar commands.
PARQUET-2138: Add ShowBloomFilterCommand to parquet-cli

How was this patch tested?

Add UT

Was this patch authored or co-authored using generative AI tooling?

No

@@ -11,6 +11,7 @@ supports both the local file system and HDFS.

The subcommands for the tools are:

* bloom-filter (since ORC 2.1) - check the bloom filter of the specified column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Since we can backport tools module patch, let's change this to ORC 2.0.1.
  • The command name, bloom-filter, is a little ambiguous. Could you provide more intuitive name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change this to ORC 2.0.1.

OK

The command name, bloom-filter, is a little ambiguous.

How about using check-bloom-filter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds better to me for this context.

BTW, if you have more plan to extend this commend, you can use check as subcommand or argument, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your suggestion.
Let us use the check command. Based on this, we can add min and max statistics to check later.

@cxzl25 cxzl25 changed the title ORC-1667: Add bloom-filter tool to check the bloom filter of the specified column ORC-1667: Add check tool to check the index of the specified column Mar 29, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you!

@dongjoon-hyun dongjoon-hyun added this to the 2.0.1 milestone Mar 29, 2024
dongjoon-hyun pushed a commit that referenced this pull request Mar 29, 2024
### What changes were proposed in this pull request?
This PR aims to check the index of the specified column.

We can test the filtering effect by specifying different types.

`check --type stat`  -  Only use column statistics.
`check --type bloom-filter` -  Only use bloom filter.
`check --type predicate`  - Used in combination with column statistics and bloom filter.

### Why are the changes needed?
ORC supports specifying multiple columns to generate bloom filter indexes, but it lacks a convenient tool to verify the effect of bloom filter.

Parquet also has similar commands.
[PARQUET-2138](https://issues.apache.org/jira/browse/PARQUET-2138): Add ShowBloomFilterCommand to parquet-cli

### How was this patch tested?
Add UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1862 from cxzl25/ORC-1667.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 3b5b2a6)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants