[SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and sql/core #19871

dongjoon-hyun · 2017-12-04T03:06:11Z

What changes were proposed in this pull request?

This PR aims to provide a configuration to choose the default OrcFileFormat from legacy sql/hive module or new sql/core module.

For example, this configuration will affects the following operations.

spark.read.orc(...)

CREATE TABLE t
USING ORC
...

How was this patch tested?

Pass the Jenkins with new test suites.

…nd sql/core

dongjoon-hyun · 2017-12-04T03:44:34Z

This is a second PR after #19651 . This will close the existing one, #17980.

cloud-fan · 2017-12-04T04:02:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -363,6 +363,11 @@ object SQLConf {
    .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
    .createWithDefault("snappy")

+  val ORC_ENABLED = buildConf("spark.sql.orc.enabled")


how about spark.sql.orc.useNewVersion? Also let's make it an internal config and enable it by default.

cloud-fan · 2017-12-04T04:03:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -568,8 +570,13 @@ object DataSource extends Logging {
    "org.apache.spark.Logging")

  /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = {


instead of passing the SparkSession, I think we only need SQLConf

cloud-fan · 2017-12-04T04:04:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = {
+    var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+    if (Seq("orc", "org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) &&


"org.apache.spark.sql.hive.orc.OrcFileFormat" should still point to the old implementation

cloud-fan · 2017-12-04T04:05:18Z

sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala

-  test("should fail to load ORC without Hive Support") {
-    val e = intercept[AnalysisException] {
-      spark.read.format("orc").load()
+  test("should fail to load ORC only if spark.sql.orc.enabled=false and without Hive Support") {


I think this test is replaced by https://github.com/apache/spark/pull/19871/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR2788

Ur, those tests cover different cases.

In this test: true -> Use new OrcFileFormat, false -> Throw Exception (the existing behavior)

In that test: true -> Use new OrcFileFormat, false -> Use old OrcFileFormat (the existing behavior).

that test also check the exception: https://github.com/apache/spark/pull/19871/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR2790

Oh, I confused with SQLQuerySuite.scala in hive. Sorry, I'll remove this.

jiangxb1987 · 2017-12-04T04:35:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -363,6 +363,11 @@ object SQLConf {
    .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
    .createWithDefault("snappy")

+  val ORC_ENABLED = buildConf("spark.sql.orc.enabled")
+    .doc("When true, use OrcFileFormat in sql/core module instead of the one in sql/hive module.")


The description should include the major difference of these two orc versions.

Yep. I'll elaborate more.

jiangxb1987 · 2017-12-04T04:41:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+  def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = {
+    var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+    if (Seq("orc", "org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) &&
+        sparkSession.conf.get(SQLConf.ORC_ENABLED)) {


Shouldn't we get the conf from sessionState?

SparkQA · 2017-12-04T06:04:43Z

Test build #84409 has finished for PR 19871 at commit 37e240c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-04T06:17:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+    var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+    if (Seq("orc").contains(provider1.toLowerCase) && conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) {


"orc".equalsIgnoreCase(...)

dongjoon-hyun · 2017-12-04T06:20:17Z

Thank you for review, @cloud-fan and @jiangxb1987 .
The PR is updated.

gatorsmile · 2017-12-04T08:02:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-  def lookupDataSource(provider: String): Class[_] = {
-    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+    var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)


instead of using var, you can use the pattern match

Also add the maps for new orc format to backwardCompatibilityMap

Thanks. Sure.

SparkQA · 2017-12-04T08:05:02Z

Test build #84413 has finished for PR 19871 at commit 2e498f9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-04T08:05:02Z

Test build #84412 has finished for PR 19871 at commit 8b7e88a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-04T08:05:02Z

Test build #84411 has finished for PR 19871 at commit e7beb02.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-04T08:18:34Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
      }
    }
  }
+
+  test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core") {


Move it to OrcQuerySuite

gatorsmile · 2017-12-04T08:19:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+          relation,
+          options,
+          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc")


SparkQA · 2017-12-04T11:41:28Z

Test build #84421 has finished for PR 19871 at commit 5474a07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Could we enable

ignore("LZO compression options for writing to an ORC file not supported in Hive 1.2.1") {

in OrcQuerySuite too?

HyukjinKwon · 2017-12-04T12:47:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -85,7 +87,8 @@ case class DataSource(

  case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String])

-  lazy val providingClass: Class[_] = DataSource.lookupDataSource(className)
+  lazy val providingClass: Class[_] =
+    DataSource.lookupDataSource(sparkSession.sessionState.conf, className)


I'd put this conf as the last argument actually if you wouldn't mind ..

HyukjinKwon · 2017-12-04T12:47:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+    .doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " +
+      "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " +
+      "more stable and faster.")


tiny nit: let's take out more stable ..

Thank you for review, @HyukjinKwon .
Do you mean Apache ORC library is more stable, but new OrcFileFormat is not because it's introduced newly?
Actually, that's true in the Spark's viewpoint, but new OrcFileFormat contains more bug fixes and new features too. If you allow, I want to keep this. :)

jiangxb1987 · 2017-12-04T12:58:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -568,8 +574,12 @@ object DataSource extends Logging {
    "org.apache.spark.Logging")

  /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {


After more thinking, I think it don't worth to pass the whole SQLConf into this function, we just need to know whether SQLConf.ORC_USE_NEW_VERSION is enabled. WDYT @cloud-fan @gatorsmile ?

So, are you suggesting lookupDataSource(provider, useNewOrc=true), @jiangxb1987 ?

dongjoon-hyun · 2017-12-04T17:04:18Z

@HyukjinKwon , for enabling the following test, I'm restructuring ORC tests now. I'll make a PR today for that.

ignore("LZO compression options for writing to an ORC file not supported in Hive 1.2.1") {

gatorsmile · 2017-12-04T18:59:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -363,6 +363,14 @@ object SQLConf {
    .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
    .createWithDefault("snappy")

+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")


spark.sql.orc.impl

No problem to change to it. But, since the name is given by @cloud-fan before, ping @cloud-fan .

gatorsmile · 2017-12-04T19:00:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " +
+      "more stable and faster.")
+    .internal()
+    .booleanConf


.checkValues(Set("hive", "native")) .createWithDefault("native")

gatorsmile · 2017-12-04T19:03:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+    .doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " +
+      "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " +
+      "more stable and faster.")


When native, use the native version of ORC support instead of the ORC library in Hive 1.2.1. It is by default hive prior to Spark 2.3.

gatorsmile · 2017-12-04T19:47:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -537,6 +540,7 @@ object DataSource extends Logging {
    val csv = classOf[CSVFileFormat].getCanonicalName
    val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
    val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"
+    val newOrc = classOf[OrcFileFormat].getCanonicalName


Please do not use the name like newXYZ. When the newer one was added, the name will be confusing.

How about nativeOrc?

Yep. It sounds better.

SparkQA · 2017-12-04T20:06:47Z

Test build #84436 has finished for PR 19871 at commit 2393e1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-04T22:59:03Z

Test build #84439 has finished for PR 19871 at commit 8bc420a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-12-05T02:54:48Z

Could you review this again, @cloud-fan and @gatorsmile ?

cloud-fan · 2017-12-05T04:54:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      case name if name.equalsIgnoreCase("orc") &&
+          conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+        classOf[OrcFileFormat].getCanonicalName
+      case name => name


if ORC_IMPLEMENTATION is hive, we leave the provider as it was, which may be orc. Then we will hit Multiple sources found issue, aren't we? Both the old and new orc has the same short name orc.

I was looking at the exact same path. It seems not because it's not registered to ServiceLoader (src/main/resources/org.apache.spark.sql.sources.DataSourceRegister). So, short name for the newer ORC source would not be used here.

@cloud-fan . To avoid that issue, new OrcFileFormat is not registered intentionally.
@HyukjinKwon 's comment is correct.

This sounds counter-intuitive, I think we should register the new orc instead of the old one.

and also add comments here.

I agree with both of you.

Just for explanation: The original design completely preserves the previous behavior.
Without SQLConf.ORC_IMPLEMENTATION option, Spark doesn't know OrcFileFormat. So, in case of non-Hive support, creating data source with "orc" will fail with unknown data source.

Anyway, I'm happy to update according to your advice. :)

So, there is no more The ORC data source must be used with Hive support enabled.
If hive impl is request in sql/core, it will show more proper messages.

sounds good

And for here, I added the following to prevent Multiple sources found. Last time, I missed this way. My bad.

+ case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "hive" => + "org.apache.spark.sql.hive.orc.OrcFileFormat"

SparkQA · 2017-12-05T08:05:01Z

Test build #84461 has finished for PR 19871 at commit 7fac88f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-05T08:05:01Z

Test build #84460 has finished for PR 19871 at commit e3f6f75.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-12-05T08:08:26Z

Retest this please.

cloud-fan · 2017-12-05T08:52:52Z

LGTM

HyukjinKwon

LGTM too

HyukjinKwon · 2017-12-05T08:59:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -587,7 +601,8 @@ object DataSource extends Logging {
                if (provider1.toLowerCase(Locale.ROOT) == "orc" ||
                  provider1.startsWith("org.apache.spark.sql.hive.orc")) {
                  throw new AnalysisException(
-                    "The ORC data source must be used with Hive support enabled")
+                    "Hive-based ORC data source must be used with Hive support enabled. " +
+                    "Please use native ORC data source instead")


I think we should make this more actionable, saying spark.sql.orc.impl should be set to native explicitly.

@HyukjinKwon .
For this one, I made #19903.

HyukjinKwon · 2017-12-05T09:11:10Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala

+  test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core") {
+    Seq(
+      ("native", classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat]),
+      ("hive", classOf[org.apache.spark.sql.hive.orc.OrcFileFormat])).foreach { case (i, format) =>


nit: i => orcImpl

For this one, I will update #19882 .
I updated in my local and am running some tests.

SparkQA · 2017-12-05T11:11:31Z

Test build #84464 has finished for PR 19871 at commit 7fac88f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-05T12:47:56Z

since this PR blocks the test moving PR, I'm merging to master now. @dongjoon-hyun please address @HyukjinKwon 's comments in your next PR.

Thanks!

dongjoon-hyun · 2017-12-05T16:42:03Z

Thank you, @cloud-fan and @HyukjinKwon . I'll address that in the next PR.
And thank you for review, @jiangxb1987 and @gatorsmile !

viirya · 2017-12-06T07:11:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -553,6 +557,8 @@ object DataSource extends Logging {
      "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet,
      "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
      "org.apache.spark.sql.hive.orc" -> orc,
+      "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc,
+      "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,


This map is for backward compatibility in case we move data sources around. I think this datasources.orc is newly added. Why we need to add them here?

ah good catch! sounds like we don't need compatibility rule for the new orc.

Like USING org.apache.spark.sql.hive.orc, we want to use USING org.apache.spark.sql.execution.datasources.orc, don't we?

When I received the advice, I thought it's for consistency.

This is for safety. We also do it for parquet

For parquet, this is for historical reason, see #13311.

Previously you can use parquet by org.apache.spark.sql.execution.datasources.parquet and org.apache.spark.sql.execution.datasources.parquet.DefaultSource. So it is also for backward compatibility.

For this new orc, it is not the same case.

I think we should rename variable and/or fix the comments there when we touch some codes around there to prevent confusion next time though.

org.apache.spark.sql.execution.* path is meant to be private too. But I think it's okay to leave it for practical use cases with some comments.

These changes are pretty safe. In case, we move the orc to the other location, it will still refer to the right location.

viirya · 2017-12-06T07:51:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -587,7 +601,8 @@ object DataSource extends Logging {
                if (provider1.toLowerCase(Locale.ROOT) == "orc" ||


provider1 can't be "orc" anymore. It can be only classOf[OrcFileFormat].getCanonicalName or "org.apache.spark.sql.hive.orc.OrcFileFormat".

Yep. I'll remove this, provider1.toLowerCase(Locale.ROOT) == "orc" ||.

[SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive a…

37e240c

…nd sql/core

cloud-fan reviewed Dec 4, 2017

View reviewed changes

jiangxb1987 reviewed Dec 4, 2017

View reviewed changes

Address comments.

e7beb02

cloud-fan reviewed Dec 4, 2017

View reviewed changes

dongjoon-hyun added 2 commits December 3, 2017 22:24

fix

8b7e88a

Remove redundant test case.

2e498f9

gatorsmile reviewed Dec 4, 2017

View reviewed changes

Address comments.

5474a07

HyukjinKwon reviewed Dec 4, 2017

View reviewed changes

jiangxb1987 reviewed Dec 4, 2017

View reviewed changes

Change function signature.

2393e1d

gatorsmile reviewed Dec 4, 2017

View reviewed changes

Use spark.sql.orc.impl.

8bc420a

cloud-fan reviewed Dec 5, 2017

View reviewed changes

dongjoon-hyun added 2 commits December 4, 2017 23:45

Add new OrcFileFormat into DataSourceRegister.

e3f6f75

fix indent

7fac88f

HyukjinKwon approved these changes Dec 5, 2017

View reviewed changes

asfgit closed this in 326f1d6 Dec 5, 2017

dongjoon-hyun deleted the spark-sql-orc-enabled branch December 5, 2017 16:42

dongjoon-hyun mentioned this pull request Dec 6, 2017

[SPARK-20728][SQL][FOLLOWUP] Use an actionable exception message #19903

Closed

viirya reviewed Dec 6, 2017

View reviewed changes

		@@ -587,7 +601,8 @@ object DataSource extends Logging {
		if (provider1.toLowerCase(Locale.ROOT) == "orc" \|\|

[SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and sql/core #19871

[SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and sql/core #19871

Conversation

dongjoon-hyun commented Dec 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2017

cloud-fan Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2017

SparkQA commented Dec 4, 2017

SparkQA commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiangxb1987 Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2017

SparkQA commented Dec 4, 2017

dongjoon-hyun commented Dec 5, 2017

Choose a reason for hiding this comment

HyukjinKwon Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 5, 2017

SparkQA commented Dec 5, 2017

dongjoon-hyun commented Dec 5, 2017

cloud-fan commented Dec 5, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 4, 2017 •

edited

Loading

cloud-fan Dec 4, 2017 •

edited

Loading

jiangxb1987 Dec 4, 2017 •

edited

Loading

gatorsmile Dec 4, 2017 •

edited

Loading

HyukjinKwon Dec 5, 2017 •

edited

Loading

dongjoon-hyun Dec 5, 2017 •

edited

Loading

HyukjinKwon Dec 6, 2017 •

edited

Loading