Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133

Closed

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Aug 17, 2018

What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro .

As per the discussion in the Jira and PR #22119, we should make the mapping configurable.

This PR also improve the error message when data source of Avro/Kafka is not found.

How was this patch tested?

Unit test

throw new AnalysisException(
s"Failed to find data source: $provider1. Avro is built-in data source since " +
"Spark 2.4. Please deploy the application as per " +
s"$latestDocsURL/avro-data-source-guide.html#deploying")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge #22121 before this one is merged.
But first we need to have agreement on the configuration name, since it is also mentioned in the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update the message later. No need to be blocked by that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should say built-in but external module. As if its built-in I would expect it to automatically be there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs Make sense, thank you.

@gengliangwang
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94885 has finished for PR 22133 at commit 0d6ba2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1474,6 +1474,12 @@ object SQLConf {
.checkValues((1 to 9).toSet + Deflater.DEFAULT_COMPRESSION)
.createWithDefault(Deflater.DEFAULT_COMPRESSION)

val ENABLE_AVRO_BACKWARD_COMPATIBILITY = buildConf("spark.sql.avro.backwardCompatibility")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this. I'm thinking if we can provide this in a general way.

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94894 has finished for PR 22133 at commit 400a7f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2018

Test build #94940 has finished for PR 22133 at commit be69aec.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 20, 2018

Test build #94950 has finished for PR 22133 at commit be69aec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val LEGACY_REPLACE_DATABRICKS_SPARK_AVRO_ENABLED =
buildConf("spark.sql.legacy.replaceDatabricksSparkAvro.enabled")
.doc("If it is set to true, the data source provider com.databricks.spark.avro is mapped " +
"to the built-in Avro data source module for backward compatibility.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to give more details here about being for hive table provider compatibility?

I think it would also be nice if we put more details about the compatibility in the avro doc, either in this pr or in the other doc one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will do it in #22121

@SparkQA
Copy link

SparkQA commented Aug 20, 2018

Test build #94960 has finished for PR 22133 at commit d617db0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -593,7 +592,6 @@ object DataSource extends Logging {
"org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
"org.apache.spark.ml.source.libsvm" -> libsvm,
"com.databricks.spark.csv" -> csv,
"com.databricks.spark.avro" -> avro,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang, not a big deal but how about adding the entry at 618 here conditionally since this is called backward compatibility map?

Copy link
Member Author

@gengliangwang gengliangwang Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I did add it in the backwardCompatibilityMap at first. But later on I find that the configuration won't be effective in run time, since the backwardCompatibilityMap is a val. (We can change backwardCompatibilityMap to method to resolve that.)

Also the code looks ugly.

val retMap = Map(...)
if(...) {
 retMap + (k -> v)
} else {
 retMap
}
// it would be worse if we have more configurations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okie makes sense if there's a reason.

@@ -626,6 +626,7 @@ object DataSource extends Logging {
serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList match {
// the provider format did not match any given registered aliases
case Nil =>
val latestDocsURL = "https://spark.apache.org/docs/latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually avoid to leave the explicit doc link because we will have to fix it for every release. Just prose should be good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the link for the latest doc. I think it should be ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if we happen to have Spark 3.0.0 then this link will be stale in 2.4.0.. no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc will be like
https://github.com/apache/spark/pull/22121/files#diff-acdddc6cbd45ccd226bf151564b9cc40R11

It is about loading the module with --package

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if we happen to have more information specific to newer versions ..? For example, we could happen to have different group name rule, etc. in the future. Avoiding pointing out the latest shouldn't be too difficult .. You could just say "Please refer deployment section in Apache Avro data source guide.". if we happen to have other new changes specific to newer versions there, we should go down and fix the links in all the branches strictly in theory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, make sense. Thank you!

@HyukjinKwon
Copy link
Member

Seems fine otherwise.

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #94996 has finished for PR 22133 at commit e57b232.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #95001 has finished for PR 22133 at commit e57b232.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #95024 has finished for PR 22133 at commit b22834f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

+1

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in ac0174e Aug 21, 2018
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…uilt-in module configurable

In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro .

As per the discussion in the [Jira](https://issues.apache.org/jira/browse/SPARK-24924) and PR apache#22119, we should make the mapping configurable.

This PR also improve the error message when data source of Avro/Kafka is not found.

Unit test

Closes apache#22133 from gengliangwang/configurable_avro_mapping.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
(cherry picked from commit ac0174e)

RB=1526614
BUG=LIHADOOP-43392
R=fli,mshen,yezhou,edlu
A=fli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants