New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25143][SQL] Support data source name mapping configuration #22134
Conversation
Ping, @gengliangwang . I made a different approach from #22133 . This will be more general. |
Could you review this please, @tgravescs , @cloud-fan , @gatorsmile , @HyukjinKwon ? |
conf.getAllConfs | ||
.filter(_._1.startsWith("spark.sql.datasource.map")) | ||
.map{ case (k, v) => (k.replaceFirst("^spark.sql.datasource.map.", ""), v) } | ||
val compatibilityMap = backwardCompatibilityMap ++ customBackwardCompatibilityMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so at this point we are leaving com.databricks.spark.avro -> internal avro as the default and users have to set it back to com.databricks.spark.avro or do they set it empty? Although if set to empty I think it will return empty below which will cause an issue.
We should have a test case for empty and perhaps have a check for it below in that case.
What about documentation? Is there a jira for documenting all avro stuff? If we do leave it as default we definitely want a release note with change in behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answering one of my own questions found avro docs here: #22121
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is merged, we can remove the internal Avro mapping code and put that into documents before branch cut.
so at this point we are leaving com.databricks.spark.avro -> internal avro as the default and users have to set it back to com.databricks.spark.avro or do they set it empty?
This will be better than another option like the following avro
specific option.
val ENABLE_AVRO_BACKWARD_COMPATIBILITY =
buildConf("spark.sql.avro.backwardCompatibility")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same concern as @tgravescs . It seems tricky to unset the default mapping.
For example, if by default we map com.databricks.spark.avro
to internal avro, then to unset it we have to set
spark.sql.datasource.map.com.databricks.spark.avro -> com.databricks.spark.avro
.
Currently we only have to deal with Avro and CSV, so I think it is ok to have one single straightforward configuration like #22133 proposed.
Test build #94889 has finished for PR 22134 at commit
|
Retest this please. |
Do we need it in the current stage? Regarding UX, it looks complex to end users. I am unable to remember the names. It is very easy to provide a wrong class name.
|
@gengliangwang and @gatorsmile . First of all, the internal 3rd party mapping should be removed in Apache Spark 3.0. Please consider that. Also, with this PR, we can remove the internal mapping And, we don't need to |
I don't think it is user friendly for such RESET. |
@gengliangwang . As of now, it's not possible to use that full class name for the Hive table saved as
|
Test build #94902 has finished for PR 22134 at commit
|
I think it's premature to introduce this. The extra layer of abstraction actually makes it more difficult to reason about what's going on. We don't have that many data sources that require flexibility here, and we can always add the flexibility if needed later. |
I got it. I'll close this approach. |
What changes were proposed in this pull request?
Currently, for better UX, Apache Spark provides data source backward compatibility by using the predefined built-in
backwardCompatibilityMap
.Although this mapping table is maintained carefully, but it's not flexible when we have multiple implementations like Avro/CSV/ORC. This introduced many additional options to decide which implementation will be used for the registered
shortName
(likecsv
,orc
,avro
).This PR aims to extend Spark backward-compatibility-mapping capability by allowing users provide a custom mapping in a general manner as configuration. If a user faces some issue with the provided mapping, he/she can override it by configuration. Also, this will reduce the complexity in the code, too.
How was this patch tested?
Pass the Jenkins with a newly added test case.