Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1567: Support -ignoreExtension option at sizes and count commands of orc-tools #1722

Closed
wants to merge 4 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Jan 3, 2024

What changes were proposed in this pull request?

Add the --ignoreExtension option.

java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar sizes --ignoreExtension path
java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar count --ignoreExtension path

Why are the changes needed?

The count and sizes commands provided by orc-tools now require that the file must have an orc suffix.
However, files in orc format do not necessarily have the orc suffix, which is inconvenient to use.

How was this patch tested?

@github-actions github-actions bot added the JAVA label Jan 3, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @cxzl25 . However, this looks risk to me because it can read all other garbage files and I'm not sure we have a preventive logic. This could be a regression.

In addition, the following could be misleading. Could you elaborate a little more about your Spark case?

the orc files generated by Hive and Spark do not necessarily have an orc suffix

FYI, in the following case, Spark generates ORC files with .orc-extension successfully. And, you can see that _SUCCESS, ._SUCCESS.crc, *.crc files exist in the same directory.

$ bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/03 05:44:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as 'sc' (master = local[64], app id = local-1704289474929).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.9)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(1).write.orc("/tmp/o")

scala> :quit
max spark-3.5.0-bin-hadoop3:$ ls -al /tmp/o
total 40
drwxr-xr-x   8 dongjoon  wheel  256 Jan  3 05:44 .
drwxrwxrwt  21 root      wheel  672 Jan  3 05:44 ..
-rw-r--r--   1 dongjoon  wheel    8 Jan  3 05:44 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Jan  3 05:44 .part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel   12 Jan  3 05:44 .part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Jan  3 05:44 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  116 Jan  3 05:44 part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  237 Jan  3 05:44 part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc

@deshanxiao
Copy link
Contributor

+1 for @dongjoon-hyun's concern. In fact, I am thinking that using the file suffix does not seem accurate too. Maybe we can consider the magic number "ORC" starting with orc to judge whether it is a ORC file. This will not have any performance impact as it is just the orc tool.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 4, 2024

If we don't use orc datasource, using the implementation of hive will generate a file name without orc suffix.

./bin/spark-sql
set spark.sql.hive.convertMetastoreCtas=false;
create table tmp_orc stored as orcfile as select id from range(1);
ls -al spark-warehouse/tmp_orc
total 24
drwxr-xr-x@ 6 csy  staff  192  1  4 23:34 .
drwxr-xr-x@ 4 csy  staff  128  1  4 23:34 ..
-rwxr-xr-x@ 1 csy  staff    8  1  4 23:34 .part-00000-5888ab4a-8773-4376-a320-0cd6f4df5889-c000.crc
-rwxr-xr-x@ 1 csy  staff   12  1  4 23:34 .part-00011-5888ab4a-8773-4376-a320-0cd6f4df5889-c000.crc
-rwxr-xr-x@ 1 csy  staff    0  1  4 23:34 part-00000-5888ab4a-8773-4376-a320-0cd6f4df5889-c000
-rwxr-xr-x@ 1 csy  staff  202  1  4 23:34 part-00011-5888ab4a-8773-4376-a320-0cd6f4df5889-c000

Can we do this? If the input is only one file, ignore the orc suffix?
This behavior is similar to FileDump. FileDump does not require the file to have an orc suffix.

@dongjoon-hyun
Copy link
Member

To @cxzl25 , Apache Spark has three ORC code path.

It seems that you are using HiveFileFormat. It follows Hive's behavior whose file extension is determined by hive.output.file.extension. Please try the following from the beginning.

$ bin/spark-sql -c hive.output.file.extension=orc
...
spark-sql (default)> create table t stored as orc as select id from range(1);
Time taken: 1.706 seconds

$ ls -al spark-warehouse/t
total 40
drwxr-xr-x  8 dongjoon  staff  256 Jan  4 13:50 .
drwxr-xr-x  3 dongjoon  staff   96 Jan  4 13:50 ..
-rw-r--r--  1 dongjoon  staff    8 Jan  4 13:50 ._SUCCESS.crc
-rw-r--r--  1 dongjoon  staff   12 Jan  4 13:50 .part-00000-492978c5-6833-4dd3-a417-eba7a49b9d44-c000.snappy.orc.crc
-rw-r--r--  1 dongjoon  staff   12 Jan  4 13:50 .part-00009-492978c5-6833-4dd3-a417-eba7a49b9d44-c000.snappy.orc.crc
-rw-r--r--  1 dongjoon  staff    0 Jan  4 13:50 _SUCCESS
-rw-r--r--  1 dongjoon  staff  116 Jan  4 13:50 part-00000-492978c5-6833-4dd3-a417-eba7a49b9d44-c000.snappy.orc
-rw-r--r--  1 dongjoon  staff  237 Jan  4 13:50 part-00009-492978c5-6833-4dd3-a417-eba7a49b9d44-c000.snappy.orc

@dongjoon-hyun
Copy link
Member

For this PR, could you try to add a new additional configuration to ignore the extension while keeping the existing default behavior, @cxzl25 ? For the additional feature, we can accept that while avoiding any breaking change.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 5, 2024

It seems that you are using HiveFileFormat. It follows Hive's behavior whose file extension is determined by hive.output.file.extension

Thanks for the reminder, because spark will convert the ctas statement, so I corrected the command slightly.(.orc)

./bin/spark-sql -c spark.hive.output.file.extension=.orc -c spark.sql.hive.convertMetastoreCtas=false
create table tmp_orc stored as orcfile as select id from range(1);
ls -al spark-warehouse/tmp_orc
-rwxr-xr-x@ 1 csy  staff    8  1  5 11:28 .part-00000-d2fb8ac6-e34b-4b1b-9a08-6c3580e4afc6-c000.orc.crc
-rwxr-xr-x@ 1 csy  staff   12  1  5 11:28 .part-00011-d2fb8ac6-e34b-4b1b-9a08-6c3580e4afc6-c000.orc.crc
-rwxr-xr-x@ 1 csy  staff    0  1  5 11:28 part-00000-d2fb8ac6-e34b-4b1b-9a08-6c3580e4afc6-c000.orc
-rwxr-xr-x@ 1 csy  staff  202  1  5 11:28 part-00011-d2fb8ac6-e34b-4b1b-9a08-6c3580e4afc6-c000.orc

could you try to add a new additional configuration to ignore the extension while keeping the existing default behavior

Thanks, let me do it.

@cxzl25 cxzl25 changed the title ORC-1567: Remove orc-tools restriction on orc suffix name ORC-1567: Add the -ignoreExtension configuration to the sizes and count commands of orc-tools Jan 5, 2024
@dongjoon-hyun dongjoon-hyun changed the title ORC-1567: Add the -ignoreExtension configuration to the sizes and count commands of orc-tools ORC-1567: Support -ignoreExtension option at sizes and count commands of orc-tools Jan 5, 2024
@dongjoon-hyun dongjoon-hyun modified the milestones: 1.9.3, 2.0.0 Jan 5, 2024
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 5, 2024

I set the milestone, 2.0.0.

@dongjoon-hyun
Copy link
Member

Please update the PR description, @cxzl25 . It looks outdated.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 5, 2024

Please update the PR description

Thanks. Updated.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this has an unexpected side-effect because Apache ORC allows empty ORC files whose size is zero. Please see the following, _SUCCESS.

$ java -jar tools/target/orc-tools-2.0.0-SNAPSHOT-uber.jar count -i /tmp/o
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
file:/tmp/o/part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc 1
file:/tmp/o/_SUCCESS 0
file:/tmp/o/part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc 0

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this PR has a clear limitation like the above, I believe it's enough to deliver because the users will take the risk when they use this option.

@dongjoon-hyun
Copy link
Member

Merged to main for Apache ORC 2.0.0. Thank you, @cxzl25 and @deshanxiao .

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 5, 2024

Thank you for your help and review! @dongjoon-hyun @deshanxiao

cxzl25 added a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…mmands of orc-tools

### What changes were proposed in this pull request?
Add the `--ignoreExtension` option.

```bash
java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar sizes --ignoreExtension path
java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar count --ignoreExtension path

```

### Why are the changes needed?
The `count` and `sizes` commands provided by `orc-tools` now require that the file must have an orc suffix.
However, files in orc format do not necessarily have the orc suffix, which is inconvenient to use.

### How was this patch tested?

Closes apache#1722 from cxzl25/ORC-1567.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants