ORC-1567: Support `-ignoreExtension` option at `sizes` and `count` commands of orc-tools #1722

cxzl25 · 2024-01-03T13:26:44Z

What changes were proposed in this pull request?

Add the --ignoreExtension option.

java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar sizes --ignoreExtension path
java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar count --ignoreExtension path

Why are the changes needed?

The count and sizes commands provided by orc-tools now require that the file must have an orc suffix.
However, files in orc format do not necessarily have the orc suffix, which is inconvenient to use.

How was this patch tested?

dongjoon-hyun

Thank you for making a PR, @cxzl25 . However, this looks risk to me because it can read all other garbage files and I'm not sure we have a preventive logic. This could be a regression.

In addition, the following could be misleading. Could you elaborate a little more about your Spark case?

the orc files generated by Hive and Spark do not necessarily have an orc suffix

FYI, in the following case, Spark generates ORC files with .orc-extension successfully. And, you can see that _SUCCESS, ._SUCCESS.crc, *.crc files exist in the same directory.

$ bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/03 05:44:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as 'sc' (master = local[64], app id = local-1704289474929).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.9)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(1).write.orc("/tmp/o")

scala> :quit
max spark-3.5.0-bin-hadoop3:$ ls -al /tmp/o
total 40
drwxr-xr-x   8 dongjoon  wheel  256 Jan  3 05:44 .
drwxrwxrwt  21 root      wheel  672 Jan  3 05:44 ..
-rw-r--r--   1 dongjoon  wheel    8 Jan  3 05:44 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Jan  3 05:44 .part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel   12 Jan  3 05:44 .part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Jan  3 05:44 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  116 Jan  3 05:44 part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  237 Jan  3 05:44 part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc

deshanxiao · 2024-01-04T03:29:29Z

+1 for @dongjoon-hyun's concern. In fact, I am thinking that using the file suffix does not seem accurate too. Maybe we can consider the magic number "ORC" starting with orc to judge whether it is a ORC file. This will not have any performance impact as it is just the orc tool.

cxzl25 · 2024-01-04T15:40:43Z

If we don't use orc datasource, using the implementation of hive will generate a file name without orc suffix.

./bin/spark-sql

set spark.sql.hive.convertMetastoreCtas=false;
create table tmp_orc stored as orcfile as select id from range(1);

ls -al spark-warehouse/tmp_orc
total 24
drwxr-xr-x@ 6 csy  staff  192  1  4 23:34 .
drwxr-xr-x@ 4 csy  staff  128  1  4 23:34 ..
-rwxr-xr-x@ 1 csy  staff    8  1  4 23:34 .part-00000-5888ab4a-8773-4376-a320-0cd6f4df5889-c000.crc
-rwxr-xr-x@ 1 csy  staff   12  1  4 23:34 .part-00011-5888ab4a-8773-4376-a320-0cd6f4df5889-c000.crc
-rwxr-xr-x@ 1 csy  staff    0  1  4 23:34 part-00000-5888ab4a-8773-4376-a320-0cd6f4df5889-c000
-rwxr-xr-x@ 1 csy  staff  202  1  4 23:34 part-00011-5888ab4a-8773-4376-a320-0cd6f4df5889-c000

Can we do this? If the input is only one file, ignore the orc suffix?
This behavior is similar to FileDump. FileDump does not require the file to have an orc suffix.

dongjoon-hyun · 2024-01-04T21:52:02Z

To @cxzl25 , Apache Spark has three ORC code path.

sql module's OrcFileFormat:
- https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
hive module's OrcFileFormat:
- https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
hive module's HiveFileFormat:
- https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

It seems that you are using HiveFileFormat. It follows Hive's behavior whose file extension is determined by hive.output.file.extension. Please try the following from the beginning.

$ bin/spark-sql -c hive.output.file.extension=orc
...
spark-sql (default)> create table t stored as orc as select id from range(1);
Time taken: 1.706 seconds

$ ls -al spark-warehouse/t
total 40
drwxr-xr-x  8 dongjoon  staff  256 Jan  4 13:50 .
drwxr-xr-x  3 dongjoon  staff   96 Jan  4 13:50 ..
-rw-r--r--  1 dongjoon  staff    8 Jan  4 13:50 ._SUCCESS.crc
-rw-r--r--  1 dongjoon  staff   12 Jan  4 13:50 .part-00000-492978c5-6833-4dd3-a417-eba7a49b9d44-c000.snappy.orc.crc
-rw-r--r--  1 dongjoon  staff   12 Jan  4 13:50 .part-00009-492978c5-6833-4dd3-a417-eba7a49b9d44-c000.snappy.orc.crc
-rw-r--r--  1 dongjoon  staff    0 Jan  4 13:50 _SUCCESS
-rw-r--r--  1 dongjoon  staff  116 Jan  4 13:50 part-00000-492978c5-6833-4dd3-a417-eba7a49b9d44-c000.snappy.orc
-rw-r--r--  1 dongjoon  staff  237 Jan  4 13:50 part-00009-492978c5-6833-4dd3-a417-eba7a49b9d44-c000.snappy.orc

dongjoon-hyun · 2024-01-04T21:55:58Z

For this PR, could you try to add a new additional configuration to ignore the extension while keeping the existing default behavior, @cxzl25 ? For the additional feature, we can accept that while avoiding any breaking change.

cxzl25 · 2024-01-05T03:34:31Z

It seems that you are using HiveFileFormat. It follows Hive's behavior whose file extension is determined by hive.output.file.extension

Thanks for the reminder, because spark will convert the ctas statement, so I corrected the command slightly.(.orc)

./bin/spark-sql -c spark.hive.output.file.extension=.orc -c spark.sql.hive.convertMetastoreCtas=false

create table tmp_orc stored as orcfile as select id from range(1);

ls -al spark-warehouse/tmp_orc

-rwxr-xr-x@ 1 csy  staff    8  1  5 11:28 .part-00000-d2fb8ac6-e34b-4b1b-9a08-6c3580e4afc6-c000.orc.crc
-rwxr-xr-x@ 1 csy  staff   12  1  5 11:28 .part-00011-d2fb8ac6-e34b-4b1b-9a08-6c3580e4afc6-c000.orc.crc
-rwxr-xr-x@ 1 csy  staff    0  1  5 11:28 part-00000-d2fb8ac6-e34b-4b1b-9a08-6c3580e4afc6-c000.orc
-rwxr-xr-x@ 1 csy  staff  202  1  5 11:28 part-00011-d2fb8ac6-e34b-4b1b-9a08-6c3580e4afc6-c000.orc

could you try to add a new additional configuration to ignore the extension while keeping the existing default behavior

Thanks, let me do it.

This reverts commit 530b7c0.

java/tools/src/java/org/apache/orc/tools/RowCount.java

dongjoon-hyun · 2024-01-05T04:17:25Z

I set the milestone, 2.0.0.

Addressed.

dongjoon-hyun · 2024-01-05T04:32:10Z

Please update the PR description, @cxzl25 . It looks outdated.

cxzl25 · 2024-01-05T04:36:05Z

Please update the PR description

Thanks. Updated.

dongjoon-hyun

Well, this has an unexpected side-effect because Apache ORC allows empty ORC files whose size is zero. Please see the following, _SUCCESS.

$ java -jar tools/target/orc-tools-2.0.0-SNAPSHOT-uber.jar count -i /tmp/o
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
file:/tmp/o/part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc 1
file:/tmp/o/_SUCCESS 0
file:/tmp/o/part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc 0

dongjoon-hyun

Although this PR has a clear limitation like the above, I believe it's enough to deliver because the users will take the risk when they use this option.

dongjoon-hyun · 2024-01-05T06:57:54Z

Merged to main for Apache ORC 2.0.0. Thank you, @cxzl25 and @deshanxiao .

cxzl25 · 2024-01-05T08:16:18Z

Thank you for your help and review! @dongjoon-hyun @deshanxiao

…mmands of orc-tools ### What changes were proposed in this pull request? Add the `--ignoreExtension` option. ```bash java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar sizes --ignoreExtension path java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar count --ignoreExtension path ``` ### Why are the changes needed? The `count` and `sizes` commands provided by `orc-tools` now require that the file must have an orc suffix. However, files in orc format do not necessarily have the orc suffix, which is inconvenient to use. ### How was this patch tested? Closes apache#1722 from cxzl25/ORC-1567. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

remove

530b7c0

github-actions bot added the JAVA label Jan 3, 2024

dongjoon-hyun previously requested changes Jan 3, 2024

View reviewed changes

cxzl25 added 2 commits January 5, 2024 11:43

Revert "remove"

15a8cab

This reverts commit 530b7c0.

-ignoreExtension

d49ce27

cxzl25 changed the title ~~ORC-1567: Remove orc-tools restriction on orc suffix name~~ ORC-1567: Add the -ignoreExtension configuration to the sizes and count commands of orc-tools Jan 5, 2024

dongjoon-hyun reviewed Jan 5, 2024

View reviewed changes

java/tools/src/java/org/apache/orc/tools/RowCount.java Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jan 5, 2024

View reviewed changes

java/tools/src/java/org/apache/orc/tools/RowCount.java Outdated Show resolved Hide resolved

dongjoon-hyun changed the title ~~ORC-1567: Add the -ignoreExtension configuration to the sizes and count commands of orc-tools~~ ORC-1567: Support -ignoreExtension option at sizes and count commands of orc-tools Jan 5, 2024

dongjoon-hyun modified the milestones: 1.9.3, 2.0.0 Jan 5, 2024

short opt

5b62c5e

dongjoon-hyun reviewed Jan 5, 2024

View reviewed changes

dongjoon-hyun approved these changes Jan 5, 2024

View reviewed changes

dongjoon-hyun closed this in 291a7de Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1567: Support `-ignoreExtension` option at `sizes` and `count` commands of orc-tools #1722

ORC-1567: Support `-ignoreExtension` option at `sizes` and `count` commands of orc-tools #1722

cxzl25 commented Jan 3, 2024 •

edited

dongjoon-hyun left a comment

deshanxiao commented Jan 4, 2024

cxzl25 commented Jan 4, 2024

dongjoon-hyun commented Jan 4, 2024

dongjoon-hyun commented Jan 4, 2024

cxzl25 commented Jan 5, 2024

dongjoon-hyun commented Jan 5, 2024 •

edited

dongjoon-hyun commented Jan 5, 2024

cxzl25 commented Jan 5, 2024

dongjoon-hyun left a comment

dongjoon-hyun left a comment

dongjoon-hyun commented Jan 5, 2024

cxzl25 commented Jan 5, 2024

ORC-1567: Support -ignoreExtension option at sizes and count commands of orc-tools #1722

ORC-1567: Support -ignoreExtension option at sizes and count commands of orc-tools #1722

Conversation

cxzl25 commented Jan 3, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

deshanxiao commented Jan 4, 2024

cxzl25 commented Jan 4, 2024

dongjoon-hyun commented Jan 4, 2024

dongjoon-hyun commented Jan 4, 2024

cxzl25 commented Jan 5, 2024

dongjoon-hyun commented Jan 5, 2024 • edited

dongjoon-hyun commented Jan 5, 2024

cxzl25 commented Jan 5, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jan 5, 2024

cxzl25 commented Jan 5, 2024

ORC-1567: Support `-ignoreExtension` option at `sizes` and `count` commands of orc-tools #1722

ORC-1567: Support `-ignoreExtension` option at `sizes` and `count` commands of orc-tools #1722

cxzl25 commented Jan 3, 2024 •

edited

dongjoon-hyun commented Jan 5, 2024 •

edited