[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133

gengliangwang · 2018-08-17T10:46:06Z

What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro .

As per the discussion in the Jira and PR #22119, we should make the mapping configurable.

This PR also improve the error message when data source of Avro/Kafka is not found.

How was this patch tested?

Unit test

gengliangwang · 2018-08-17T10:52:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+                  throw new AnalysisException(
+                    s"Failed to find data source: $provider1. Avro is built-in data source since " +
+                    "Spark 2.4. Please deploy the application as per " +
+                    s"$latestDocsURL/avro-data-source-guide.html#deploying")


Please merge #22121 before this one is merged.
But first we need to have agreement on the configuration name, since it is also mentioned in the documentation.

We can update the message later. No need to be blocked by that.

perhaps we should say built-in but external module. As if its built-in I would expect it to automatically be there.

@tgravescs Make sense, thank you.

gengliangwang · 2018-08-17T11:05:36Z

@tgravescs @dongjoon-hyun @HyukjinKwon @cloud-fan

SparkQA · 2018-08-17T11:10:13Z

Test build #94885 has finished for PR 22133 at commit 0d6ba2a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-08-17T11:28:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1474,6 +1474,12 @@ object SQLConf {
    .checkValues((1 to 9).toSet + Deflater.DEFAULT_COMPRESSION)
    .createWithDefault(Deflater.DEFAULT_COMPRESSION)

+  val ENABLE_AVRO_BACKWARD_COMPATIBILITY = buildConf("spark.sql.avro.backwardCompatibility")


Thank you for working on this. I'm thinking if we can provide this in a general way.

SparkQA · 2018-08-17T18:20:22Z

Test build #94894 has finished for PR 22133 at commit 400a7f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-20T07:05:01Z

Test build #94940 has finished for PR 22133 at commit be69aec.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-08-20T08:22:28Z

retest this please

SparkQA · 2018-08-20T12:06:03Z

Test build #94950 has finished for PR 22133 at commit be69aec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-08-20T13:25:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val LEGACY_REPLACE_DATABRICKS_SPARK_AVRO_ENABLED =
+    buildConf("spark.sql.legacy.replaceDatabricksSparkAvro.enabled")
+      .doc("If it is set to true, the data source provider com.databricks.spark.avro is mapped " +
+        "to the built-in Avro data source module for backward compatibility.")


do we want to give more details here about being for hive table provider compatibility?

I think it would also be nice if we put more details about the compatibility in the avro doc, either in this pr or in the other doc one.

Thanks, I will do it in #22121

SparkQA · 2018-08-20T18:50:30Z

Test build #94960 has finished for PR 22133 at commit d617db0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-21T02:16:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -593,7 +592,6 @@ object DataSource extends Logging {
      "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
      "org.apache.spark.ml.source.libsvm" -> libsvm,
      "com.databricks.spark.csv" -> csv,
-      "com.databricks.spark.avro" -> avro,


@gengliangwang, not a big deal but how about adding the entry at 618 here conditionally since this is called backward compatibility map?

@HyukjinKwon I did add it in the backwardCompatibilityMap at first. But later on I find that the configuration won't be effective in run time, since the backwardCompatibilityMap is a val. (We can change backwardCompatibilityMap to method to resolve that.)

Also the code looks ugly.

val retMap = Map(...) if(...) { retMap + (k -> v) } else { retMap } // it would be worse if we have more configurations.

Ah okie makes sense if there's a reason.

HyukjinKwon · 2018-08-21T02:17:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -626,6 +626,7 @@ object DataSource extends Logging {
      serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList match {
        // the provider format did not match any given registered aliases
        case Nil =>
+          val latestDocsURL = "https://spark.apache.org/docs/latest"


I would actually avoid to leave the explicit doc link because we will have to fix it for every release. Just prose should be good enough.

This is the link for the latest doc. I think it should be ok.

I mean, if we happen to have Spark 3.0.0 then this link will be stale in 2.4.0.. no?

The doc will be like
https://github.com/apache/spark/pull/22121/files#diff-acdddc6cbd45ccd226bf151564b9cc40R11

It is about loading the module with --package

But what if we happen to have more information specific to newer versions ..? For example, we could happen to have different group name rule, etc. in the future. Avoiding pointing out the latest shouldn't be too difficult .. You could just say "Please refer deployment section in Apache Avro data source guide.". if we happen to have other new changes specific to newer versions there, we should go down and fix the links in all the branches strictly in theory.

Ah, make sense. Thank you!

HyukjinKwon · 2018-08-21T02:20:25Z

Seems fine otherwise.

SparkQA · 2018-08-21T07:05:01Z

Test build #94996 has finished for PR 22133 at commit e57b232.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-08-21T07:08:23Z

retest this please.

SparkQA · 2018-08-21T11:11:56Z

Test build #95001 has finished for PR 22133 at commit e57b232.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-21T17:53:55Z

Test build #95024 has finished for PR 22133 at commit b22834f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-08-21T21:57:32Z

+1

gatorsmile

LGTM

Thanks! Merged to master.

…uilt-in module configurable In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro . As per the discussion in the [Jira](https://issues.apache.org/jira/browse/SPARK-24924) and PR apache#22119, we should make the mapping configurable. This PR also improve the error message when data source of Avro/Kafka is not found. Unit test Closes apache#22133 from gengliangwang/configurable_avro_mapping. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com> (cherry picked from commit ac0174e) RB=1526614 BUG=LIHADOOP-43392 R=fli,mshen,yezhou,edlu A=fli

gengliangwang mentioned this pull request Aug 17, 2018

[WIP][SPARK-25129][SQL] Revert mapping com.databricks.spark.avro to org.apache.spark.sql.avro #22119

Closed

gengliangwang commented Aug 17, 2018

View reviewed changes

dongjoon-hyun reviewed Aug 17, 2018

View reviewed changes

dongjoon-hyun mentioned this pull request Aug 17, 2018

[SPARK-25143][SQL] Support data source name mapping configuration #22134

Closed

gengliangwang mentioned this pull request Aug 20, 2018

[SPARK-25133][SQL][Doc]Avro data source guide #22121

Closed

tgravescs reviewed Aug 20, 2018

View reviewed changes

HyukjinKwon reviewed Aug 21, 2018

View reviewed changes

gengliangwang added 6 commits August 21, 2018 21:34

make databricks avro backward compatibility configurable

b3e6177

fix test case

5409d34

better configuration naming

65fdf35

revise doc

60aeae8

revise doc

e473a49

address comment

b22834f

gengliangwang force-pushed the configurable_avro_mapping branch from e57b232 to b22834f Compare August 21, 2018 13:53

gatorsmile reviewed Aug 21, 2018

View reviewed changes

asfgit closed this in ac0174e Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133

[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133

gengliangwang commented Aug 17, 2018 •

edited

Loading

gengliangwang Aug 17, 2018

gatorsmile Aug 17, 2018

tgravescs Aug 20, 2018

gengliangwang Aug 20, 2018

gengliangwang commented Aug 17, 2018

SparkQA commented Aug 17, 2018

dongjoon-hyun Aug 17, 2018

SparkQA commented Aug 17, 2018

SparkQA commented Aug 20, 2018

gengliangwang commented Aug 20, 2018

SparkQA commented Aug 20, 2018

tgravescs Aug 20, 2018

gengliangwang Aug 20, 2018

SparkQA commented Aug 20, 2018

HyukjinKwon Aug 21, 2018

gengliangwang Aug 21, 2018 •

edited

Loading

HyukjinKwon Aug 21, 2018

HyukjinKwon Aug 21, 2018

gengliangwang Aug 21, 2018

HyukjinKwon Aug 21, 2018

gengliangwang Aug 21, 2018

HyukjinKwon Aug 21, 2018

gengliangwang Aug 21, 2018

HyukjinKwon commented Aug 21, 2018

SparkQA commented Aug 21, 2018

gengliangwang commented Aug 21, 2018

SparkQA commented Aug 21, 2018

SparkQA commented Aug 21, 2018

tgravescs commented Aug 21, 2018

gatorsmile left a comment

[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133

[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133

Conversation

gengliangwang commented Aug 17, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Aug 17, 2018

SparkQA commented Aug 17, 2018

Choose a reason for hiding this comment

SparkQA commented Aug 17, 2018

SparkQA commented Aug 20, 2018

gengliangwang commented Aug 20, 2018

SparkQA commented Aug 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 20, 2018

Choose a reason for hiding this comment

gengliangwang Aug 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 21, 2018

SparkQA commented Aug 21, 2018

gengliangwang commented Aug 21, 2018

SparkQA commented Aug 21, 2018

SparkQA commented Aug 21, 2018

tgravescs commented Aug 21, 2018

gatorsmile left a comment

Choose a reason for hiding this comment

gengliangwang commented Aug 17, 2018 •

edited

Loading

gengliangwang Aug 21, 2018 •

edited

Loading