Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1017: Add sizes tool to determine and display the sizes of each column in a set of files. #925

Merged
merged 2 commits into from Oct 5, 2021

Conversation

omalley
Copy link
Contributor

@omalley omalley commented Oct 1, 2021

What changes were proposed in this pull request?

This patch adds a new tool that accounts for the total size of a set of ORC files. For files written by >= ORC 1.5, you'll get a column breakdown of the file. There are some virtual columns that are included:

  • _index the indexes that are used for skipping inside the stripe
  • _data the data in files written prior to ORC 1.5
  • _stripe_footer the stripe metadata
  • _file_footer the file metadata
  • _padding padding added to align stripes to HDFS block boundaries

I also added a new method on TypeDescription that gets the full field name, which is the inverse of findSubtype.

Why are the changes needed?

The tool helps diagnose the compression of a set of files.

How was this patch tested?

I added a test of the new TypeDescription.getFullFieldName. I ran the tool over some of the examples and some multiple-terabyte directories of production ORC files.

@github-actions github-actions bot added the JAVA label Oct 1, 2021
getting nailed on each patch when CI tests it patch.
@github-actions github-actions bot added the BUILD label Oct 1, 2021
@dongjoon-hyun
Copy link
Member

Let me take a look the CI failure.

@dongjoon-hyun
Copy link
Member

I re-triggered the GitHub Action .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @omalley .
I reviewed and manually tested.

@dongjoon-hyun dongjoon-hyun merged commit be0762b into apache:main Oct 5, 2021
@dongjoon-hyun dongjoon-hyun added this to the 1.8.0 milestone Nov 2, 2021
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 15, 2021

Since this is mostly tools-only change, I backported this to branch-1.7.
And, TypeDescription.java changes are addition-only.

dongjoon-hyun pushed a commit that referenced this pull request Dec 16, 2021
…olumn in a set of files. (#925)

### What changes were proposed in this pull request?

This patch adds a new tool that accounts for the total size of a set of ORC files. For files written by >= ORC 1.5, you'll get a column breakdown of the file. There are some virtual columns that are included:
- _index the indexes that are used for skipping inside the stripe
- _data the data in files written prior to ORC 1.5
- _stripe_footer the stripe metadata
- _file_footer the file metadata
- _padding padding added to align stripes to HDFS block boundaries

I also added a new method on TypeDescription that gets the full field name, which is the inverse of findSubtype.

### Why are the changes needed?

The tool helps diagnose the compression of a set of files.

### How was this patch tested?

I added a test of the new TypeDescription.getFullFieldName. I ran the tool over some of the examples and some multiple-terabyte directories of production ORC files.

(cherry picked from commit be0762b)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun dongjoon-hyun modified the milestones: 1.8.0, 1.7.2 Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants