Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25143][SQL] Support data source name mapping configuration #22134

Closed
wants to merge 1 commit into from
Closed

[SPARK-25143][SQL] Support data source name mapping configuration #22134

wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 17, 2018

What changes were proposed in this pull request?

Currently, for better UX, Apache Spark provides data source backward compatibility by using the predefined built-in backwardCompatibilityMap.

Although this mapping table is maintained carefully, but it's not flexible when we have multiple implementations like Avro/CSV/ORC. This introduced many additional options to decide which implementation will be used for the registered shortName (like csv, orc, avro).

This PR aims to extend Spark backward-compatibility-mapping capability by allowing users provide a custom mapping in a general manner as configuration. If a user faces some issue with the provided mapping, he/she can override it by configuration. Also, this will reduce the complexity in the code, too.

How was this patch tested?

Pass the Jenkins with a newly added test case.

@dongjoon-hyun
Copy link
Member Author

Ping, @gengliangwang . I made a different approach from #22133 . This will be more general.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 17, 2018

Could you review this please, @tgravescs , @cloud-fan , @gatorsmile , @HyukjinKwon ?

conf.getAllConfs
.filter(_._1.startsWith("spark.sql.datasource.map"))
.map{ case (k, v) => (k.replaceFirst("^spark.sql.datasource.map.", ""), v) }
val compatibilityMap = backwardCompatibilityMap ++ customBackwardCompatibilityMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so at this point we are leaving com.databricks.spark.avro -> internal avro as the default and users have to set it back to com.databricks.spark.avro or do they set it empty? Although if set to empty I think it will return empty below which will cause an issue.

We should have a test case for empty and perhaps have a check for it below in that case.

What about documentation? Is there a jira for documenting all avro stuff? If we do leave it as default we definitely want a release note with change in behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answering one of my own questions found avro docs here: #22121

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is merged, we can remove the internal Avro mapping code and put that into documents before branch cut.

so at this point we are leaving com.databricks.spark.avro -> internal avro as the default and users have to set it back to com.databricks.spark.avro or do they set it empty?

This will be better than another option like the following avro specific option.

val ENABLE_AVRO_BACKWARD_COMPATIBILITY = 
  buildConf("spark.sql.avro.backwardCompatibility")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern as @tgravescs . It seems tricky to unset the default mapping.

For example, if by default we map com.databricks.spark.avro to internal avro, then to unset it we have to set
spark.sql.datasource.map.com.databricks.spark.avro -> com.databricks.spark.avro .

Currently we only have to deal with Avro and CSV, so I think it is ok to have one single straightforward configuration like #22133 proposed.

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94889 has finished for PR 22134 at commit e477a8e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@gatorsmile
Copy link
Member

Do we need it in the current stage? Regarding UX, it looks complex to end users. I am unable to remember the names. It is very easy to provide a wrong class name.

"spark.sql.datasource.map.org.apache.spark.sql.avro" -> testDataSource.getCanonicalName,
"spark.sql.datasource.map.com.databricks.spark.csv" -> testDataSource.getCanonicalName,
"spark.sql.datasource.map.com.databricks.spark.avro" -> testDataSource.getCanonicalName) 

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 17, 2018

@gengliangwang and @gatorsmile .

First of all, the internal 3rd party mapping should be removed in Apache Spark 3.0. Please consider that. Also, with this PR, we can remove the internal mapping com.databricks.spark.avro in Spark 2.4, too. If you want, I can remove that in this PR, but I just want to keep this PR simplest.

And, we don't need to unset here. The purpose of this PR is RESET to override the built-in backwardCompatibilityMap. Before this PR, we don't have any control over the built-in mapping. We need to allow the users can take a full controlability.

@gengliangwang
Copy link
Member

The purpose of this PR is RESET to override the built-in backwardCompatibilityMap. Before this PR, we don't have any control over the built-in mapping. We need to allow the users can take a full controlability.

I don't think it is user friendly for such RESET.
Also, besides the Databricks avro/csv repo, users can just provide their full class name in specifying the file format, or short name if it doesn't not conflict with built-in ones. I don't think they need such control.

@dongjoon-hyun
Copy link
Member Author

@gengliangwang . As of now, it's not possible to use that full class name for the Hive table saved as com.databricks.spark.avro. So, we are trying to find this way (this PR) or that way (maybe your PR).

Also, besides the Databricks avro/csv repo, users can just provide their full class name in specifying the file format, or short name if it doesn't not conflict with built-in ones. I don't think they need such control.

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94902 has finished for PR 22134 at commit e477a8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Aug 17, 2018

I think it's premature to introduce this. The extra layer of abstraction actually makes it more difficult to reason about what's going on. We don't have that many data sources that require flexibility here, and we can always add the flexibility if needed later.

@dongjoon-hyun
Copy link
Member Author

I got it. I'll close this approach.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-25143 branch August 17, 2018 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants