-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1567: Support -ignoreExtension
option at sizes
and count
commands of orc-tools
#1722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @cxzl25 . However, this looks risk to me because it can read all other garbage files and I'm not sure we have a preventive logic. This could be a regression.
In addition, the following could be misleading. Could you elaborate a little more about your Spark case?
the orc files generated by Hive and Spark do not necessarily have an orc suffix
FYI, in the following case, Spark generates ORC files with .orc
-extension successfully. And, you can see that _SUCCESS
, ._SUCCESS.crc
, *.crc
files exist in the same directory.
$ bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/03 05:44:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as 'sc' (master = local[64], app id = local-1704289474929).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.9)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.range(1).write.orc("/tmp/o")
scala> :quit
max spark-3.5.0-bin-hadoop3:$ ls -al /tmp/o
total 40
drwxr-xr-x 8 dongjoon wheel 256 Jan 3 05:44 .
drwxrwxrwt 21 root wheel 672 Jan 3 05:44 ..
-rw-r--r-- 1 dongjoon wheel 8 Jan 3 05:44 ._SUCCESS.crc
-rw-r--r-- 1 dongjoon wheel 12 Jan 3 05:44 .part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 12 Jan 3 05:44 .part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 0 Jan 3 05:44 _SUCCESS
-rw-r--r-- 1 dongjoon wheel 116 Jan 3 05:44 part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 237 Jan 3 05:44 part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc
+1 for @dongjoon-hyun's concern. In fact, I am thinking that using the file suffix does not seem accurate too. Maybe we can consider the magic number "ORC" starting with orc to judge whether it is a ORC file. This will not have any performance impact as it is just the orc tool. |
If we don't use orc datasource, using the implementation of hive will generate a file name without orc suffix. ./bin/spark-sql set spark.sql.hive.convertMetastoreCtas=false;
create table tmp_orc stored as orcfile as select id from range(1); ls -al spark-warehouse/tmp_orc
total 24
drwxr-xr-x@ 6 csy staff 192 1 4 23:34 .
drwxr-xr-x@ 4 csy staff 128 1 4 23:34 ..
-rwxr-xr-x@ 1 csy staff 8 1 4 23:34 .part-00000-5888ab4a-8773-4376-a320-0cd6f4df5889-c000.crc
-rwxr-xr-x@ 1 csy staff 12 1 4 23:34 .part-00011-5888ab4a-8773-4376-a320-0cd6f4df5889-c000.crc
-rwxr-xr-x@ 1 csy staff 0 1 4 23:34 part-00000-5888ab4a-8773-4376-a320-0cd6f4df5889-c000
-rwxr-xr-x@ 1 csy staff 202 1 4 23:34 part-00011-5888ab4a-8773-4376-a320-0cd6f4df5889-c000 Can we do this? If the input is only one file, ignore the orc suffix? |
To @cxzl25 , Apache Spark has three ORC code path.
It seems that you are using
|
For this PR, could you try to add a new additional configuration to ignore the extension while keeping the existing default behavior, @cxzl25 ? For the additional feature, we can accept that while avoiding any breaking change. |
Thanks for the reminder, because spark will convert the ctas statement, so I corrected the command slightly.( ./bin/spark-sql -c spark.hive.output.file.extension=.orc -c spark.sql.hive.convertMetastoreCtas=false create table tmp_orc stored as orcfile as select id from range(1); ls -al spark-warehouse/tmp_orc
Thanks, let me do it. |
This reverts commit 530b7c0.
-ignoreExtension
configuration to the sizes
and count
commands of orc-tools
-ignoreExtension
configuration to the sizes
and count
commands of orc-tools-ignoreExtension
option at sizes
and count
commands of orc-tools
I set the milestone, |
Please update the PR description, @cxzl25 . It looks outdated. |
Thanks. Updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this has an unexpected side-effect because Apache ORC allows empty ORC files whose size is zero. Please see the following, _SUCCESS
.
$ java -jar tools/target/orc-tools-2.0.0-SNAPSHOT-uber.jar count -i /tmp/o
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
file:/tmp/o/part-00063-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc 1
file:/tmp/o/_SUCCESS 0
file:/tmp/o/part-00000-51779dfb-3058-4572-88ea-3568399b7ab0-c000.snappy.orc 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this PR has a clear limitation like the above, I believe it's enough to deliver because the users will take the risk when they use this option.
Merged to main for Apache ORC 2.0.0. Thank you, @cxzl25 and @deshanxiao . |
Thank you for your help and review! @dongjoon-hyun @deshanxiao |
…mmands of orc-tools ### What changes were proposed in this pull request? Add the `--ignoreExtension` option. ```bash java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar sizes --ignoreExtension path java -jar orc-tools-2.0.0-SNAPSHOT-uber.jar count --ignoreExtension path ``` ### Why are the changes needed? The `count` and `sizes` commands provided by `orc-tools` now require that the file must have an orc suffix. However, files in orc format do not necessarily have the orc suffix, which is inconvenient to use. ### How was this patch tested? Closes apache#1722 from cxzl25/ORC-1567. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Add the
--ignoreExtension
option.Why are the changes needed?
The
count
andsizes
commands provided byorc-tools
now require that the file must have an orc suffix.However, files in orc format do not necessarily have the orc suffix, which is inconvenient to use.
How was this patch tested?