Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1631: Support summary output in sizes command #1816

Closed
wants to merge 2 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Feb 26, 2024

What changes were proposed in this pull request?

Add support for summarizing the number of files, file sizes and file lines in the sizes command.

Why are the changes needed?

When we count the size of each field, we only know the percentage and the average size of each row, but we do not know the overall value.

How was this patch tested?

local test

java -jar orc-tools-2.1.0-SNAPSHOT-uber.jar sizes -h
usage: sizes
 -h,--help              Print help message
 -i,--ignoreExtension   Ignore ORC file extension
 -s,--summary           Summarize the number of files, file sizes, and
                        file rows
java -jar orc-tools-2.1.0-SNAPSHOT-uber.jar sizes -s
Total Files: 5
Total Sizes: 4803687270
Total Rows: 39820045
Percent  Bytes/Row  Name
  26.41  31.86

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the JAVA label Feb 26, 2024
@dongjoon-hyun dongjoon-hyun changed the title ORC-1631: Supports summary output in sizes command ORC-1631: Support summary output in sizes command Feb 26, 2024
@dongjoon-hyun dongjoon-hyun added this to the 2.0.0 milestone Feb 26, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cxzl25 .
Merged to main/2.0.

dongjoon-hyun pushed a commit that referenced this pull request Feb 26, 2024
### What changes were proposed in this pull request?
Add support for summarizing the number of files, file sizes and file lines in the sizes command.

### Why are the changes needed?
When we count the size of each field, we only know the percentage and the average size of each row, but we do not know the overall value.

### How was this patch tested?
local test

```bash
java -jar orc-tools-2.1.0-SNAPSHOT-uber.jar sizes -h
usage: sizes
 -h,--help              Print help message
 -i,--ignoreExtension   Ignore ORC file extension
 -s,--summary           Summarize the number of files, file sizes, and
                        file rows
```

```
java -jar orc-tools-2.1.0-SNAPSHOT-uber.jar sizes -s
```

```
Total Files: 5
Total Sizes: 4803687270
Total Rows: 39820045
Percent  Bytes/Row  Name
  26.41  31.86
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1816 from cxzl25/ORC-1631.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit f46e55a)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants