-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133
[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133
Conversation
throw new AnalysisException( | ||
s"Failed to find data source: $provider1. Avro is built-in data source since " + | ||
"Spark 2.4. Please deploy the application as per " + | ||
s"$latestDocsURL/avro-data-source-guide.html#deploying") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge #22121 before this one is merged.
But first we need to have agreement on the configuration name, since it is also mentioned in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update the message later. No need to be blocked by that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should say built-in but external module. As if its built-in I would expect it to automatically be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs Make sense, thank you.
Test build #94885 has finished for PR 22133 at commit
|
@@ -1474,6 +1474,12 @@ object SQLConf { | |||
.checkValues((1 to 9).toSet + Deflater.DEFAULT_COMPRESSION) | |||
.createWithDefault(Deflater.DEFAULT_COMPRESSION) | |||
|
|||
val ENABLE_AVRO_BACKWARD_COMPATIBILITY = buildConf("spark.sql.avro.backwardCompatibility") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this. I'm thinking if we can provide this in a general way.
Test build #94894 has finished for PR 22133 at commit
|
Test build #94940 has finished for PR 22133 at commit
|
retest this please |
Test build #94950 has finished for PR 22133 at commit
|
val LEGACY_REPLACE_DATABRICKS_SPARK_AVRO_ENABLED = | ||
buildConf("spark.sql.legacy.replaceDatabricksSparkAvro.enabled") | ||
.doc("If it is set to true, the data source provider com.databricks.spark.avro is mapped " + | ||
"to the built-in Avro data source module for backward compatibility.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to give more details here about being for hive table provider compatibility?
I think it would also be nice if we put more details about the compatibility in the avro doc, either in this pr or in the other doc one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will do it in #22121
Test build #94960 has finished for PR 22133 at commit
|
@@ -593,7 +592,6 @@ object DataSource extends Logging { | |||
"org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm, | |||
"org.apache.spark.ml.source.libsvm" -> libsvm, | |||
"com.databricks.spark.csv" -> csv, | |||
"com.databricks.spark.avro" -> avro, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang, not a big deal but how about adding the entry at 618 here conditionally since this is called backward compatibility map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I did add it in the backwardCompatibilityMap
at first. But later on I find that the configuration won't be effective in run time, since the backwardCompatibilityMap
is a val
. (We can change backwardCompatibilityMap
to method to resolve that.)
Also the code looks ugly.
val retMap = Map(...)
if(...) {
retMap + (k -> v)
} else {
retMap
}
// it would be worse if we have more configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okie makes sense if there's a reason.
@@ -626,6 +626,7 @@ object DataSource extends Logging { | |||
serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList match { | |||
// the provider format did not match any given registered aliases | |||
case Nil => | |||
val latestDocsURL = "https://spark.apache.org/docs/latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually avoid to leave the explicit doc link because we will have to fix it for every release. Just prose should be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the link for the latest doc. I think it should be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if we happen to have Spark 3.0.0 then this link will be stale in 2.4.0.. no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc will be like
https://github.com/apache/spark/pull/22121/files#diff-acdddc6cbd45ccd226bf151564b9cc40R11
It is about loading the module with --package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if we happen to have more information specific to newer versions ..? For example, we could happen to have different group name rule, etc. in the future. Avoiding pointing out the latest shouldn't be too difficult .. You could just say "Please refer deployment section in Apache Avro data source guide.". if we happen to have other new changes specific to newer versions there, we should go down and fix the links in all the branches strictly in theory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, make sense. Thank you!
Seems fine otherwise. |
Test build #94996 has finished for PR 22133 at commit
|
retest this please. |
Test build #95001 has finished for PR 22133 at commit
|
e57b232
to
b22834f
Compare
Test build #95024 has finished for PR 22133 at commit
|
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! Merged to master.
…uilt-in module configurable In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro . As per the discussion in the [Jira](https://issues.apache.org/jira/browse/SPARK-24924) and PR apache#22119, we should make the mapping configurable. This PR also improve the error message when data source of Avro/Kafka is not found. Unit test Closes apache#22133 from gengliangwang/configurable_avro_mapping. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com> (cherry picked from commit ac0174e) RB=1526614 BUG=LIHADOOP-43392 R=fli,mshen,yezhou,edlu A=fli
What changes were proposed in this pull request?
In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro .
As per the discussion in the Jira and PR #22119, we should make the mapping configurable.
This PR also improve the error message when data source of Avro/Kafka is not found.
How was this patch tested?
Unit test