[SPARK-25143][SQL] Support data source name mapping configuration #22134

dongjoon-hyun · 2018-08-17T12:45:05Z

What changes were proposed in this pull request?

Currently, for better UX, Apache Spark provides data source backward compatibility by using the predefined built-in backwardCompatibilityMap.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L564

Although this mapping table is maintained carefully, but it's not flexible when we have multiple implementations like Avro/CSV/ORC. This introduced many additional options to decide which implementation will be used for the registered shortName (like csv, orc, avro).

This PR aims to extend Spark backward-compatibility-mapping capability by allowing users provide a custom mapping in a general manner as configuration. If a user faces some issue with the provided mapping, he/she can override it by configuration. Also, this will reduce the complexity in the code, too.

How was this patch tested?

Pass the Jenkins with a newly added test case.

dongjoon-hyun · 2018-08-17T12:47:06Z

Ping, @gengliangwang . I made a different approach from #22133 . This will be more general.

dongjoon-hyun · 2018-08-17T12:58:25Z

Could you review this please, @tgravescs , @cloud-fan , @gatorsmile , @HyukjinKwon ?

tgravescs · 2018-08-17T13:34:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      conf.getAllConfs
+        .filter(_._1.startsWith("spark.sql.datasource.map"))
+        .map{ case (k, v) => (k.replaceFirst("^spark.sql.datasource.map.", ""), v) }
+    val compatibilityMap = backwardCompatibilityMap ++ customBackwardCompatibilityMap


so at this point we are leaving com.databricks.spark.avro -> internal avro as the default and users have to set it back to com.databricks.spark.avro or do they set it empty? Although if set to empty I think it will return empty below which will cause an issue.

We should have a test case for empty and perhaps have a check for it below in that case.

What about documentation? Is there a jira for documenting all avro stuff? If we do leave it as default we definitely want a release note with change in behavior.

answering one of my own questions found avro docs here: #22121

If this is merged, we can remove the internal Avro mapping code and put that into documents before branch cut.

so at this point we are leaving com.databricks.spark.avro -> internal avro as the default and users have to set it back to com.databricks.spark.avro or do they set it empty?

This will be better than another option like the following avro specific option.

val ENABLE_AVRO_BACKWARD_COMPATIBILITY = buildConf("spark.sql.avro.backwardCompatibility")

I have the same concern as @tgravescs . It seems tricky to unset the default mapping.

For example, if by default we map com.databricks.spark.avro to internal avro, then to unset it we have to set
spark.sql.datasource.map.com.databricks.spark.avro -> com.databricks.spark.avro .

Currently we only have to deal with Avro and CSV, so I think it is ok to have one single straightforward configuration like #22133 proposed.

SparkQA · 2018-08-17T15:42:40Z

Test build #94889 has finished for PR 22134 at commit e477a8e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-08-17T16:41:42Z

Retest this please.

gatorsmile · 2018-08-17T16:52:21Z

Do we need it in the current stage? Regarding UX, it looks complex to end users. I am unable to remember the names. It is very easy to provide a wrong class name.

"spark.sql.datasource.map.org.apache.spark.sql.avro" -> testDataSource.getCanonicalName,
"spark.sql.datasource.map.com.databricks.spark.csv" -> testDataSource.getCanonicalName,
"spark.sql.datasource.map.com.databricks.spark.avro" -> testDataSource.getCanonicalName)

dongjoon-hyun · 2018-08-17T16:57:47Z

@gengliangwang and @gatorsmile .

First of all, the internal 3rd party mapping should be removed in Apache Spark 3.0. Please consider that. Also, with this PR, we can remove the internal mapping com.databricks.spark.avro in Spark 2.4, too. If you want, I can remove that in this PR, but I just want to keep this PR simplest.

And, we don't need to unset here. The purpose of this PR is RESET to override the built-in backwardCompatibilityMap. Before this PR, we don't have any control over the built-in mapping. We need to allow the users can take a full controlability.

gengliangwang · 2018-08-17T17:22:31Z

The purpose of this PR is RESET to override the built-in backwardCompatibilityMap. Before this PR, we don't have any control over the built-in mapping. We need to allow the users can take a full controlability.

I don't think it is user friendly for such RESET.
Also, besides the Databricks avro/csv repo, users can just provide their full class name in specifying the file format, or short name if it doesn't not conflict with built-in ones. I don't think they need such control.

dongjoon-hyun · 2018-08-17T17:30:43Z

@gengliangwang . As of now, it's not possible to use that full class name for the Hive table saved as com.databricks.spark.avro. So, we are trying to find this way (this PR) or that way (maybe your PR).

Also, besides the Databricks avro/csv repo, users can just provide their full class name in specifying the file format, or short name if it doesn't not conflict with built-in ones. I don't think they need such control.

SparkQA · 2018-08-17T20:16:04Z

Test build #94902 has finished for PR 22134 at commit e477a8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-08-17T20:31:32Z

I think it's premature to introduce this. The extra layer of abstraction actually makes it more difficult to reason about what's going on. We don't have that many data sources that require flexibility here, and we can always add the flexibility if needed later.

dongjoon-hyun · 2018-08-17T23:14:37Z

I got it. I'll close this approach.

[SPARK-25143][SQL] Support data source name mapping configuration

e477a8e

tgravescs reviewed Aug 17, 2018

View reviewed changes

dongjoon-hyun closed this Aug 17, 2018

dongjoon-hyun deleted the SPARK-25143 branch August 17, 2018 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25143][SQL] Support data source name mapping configuration #22134

[SPARK-25143][SQL] Support data source name mapping configuration #22134

dongjoon-hyun commented Aug 17, 2018 •

edited

dongjoon-hyun commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018 •

edited

tgravescs Aug 17, 2018

tgravescs Aug 17, 2018

dongjoon-hyun Aug 17, 2018

gengliangwang Aug 17, 2018

SparkQA commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018

gatorsmile commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018 •

edited

gengliangwang commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018

SparkQA commented Aug 17, 2018

rxin commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018

[SPARK-25143][SQL] Support data source name mapping configuration #22134

[SPARK-25143][SQL] Support data source name mapping configuration #22134

Conversation

dongjoon-hyun commented Aug 17, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018 • edited

tgravescs Aug 17, 2018

Choose a reason for hiding this comment

tgravescs Aug 17, 2018

Choose a reason for hiding this comment

dongjoon-hyun Aug 17, 2018

Choose a reason for hiding this comment

gengliangwang Aug 17, 2018

Choose a reason for hiding this comment

SparkQA commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018

gatorsmile commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018 • edited

gengliangwang commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018

SparkQA commented Aug 17, 2018

rxin commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018

dongjoon-hyun commented Aug 17, 2018 •

edited

dongjoon-hyun commented Aug 17, 2018 •

edited

dongjoon-hyun commented Aug 17, 2018 •

edited