Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10195] [SQL] Data sources Filter should not expose internal types #8403

Conversation

JoshRosen
Copy link
Contributor

Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties.

This issue caused incompatibilities when upgrading our spark-redshift library to work against Spark 1.5.0. To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions.

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41482 has finished for PR 8403 at commit 6af0a45.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41486 has finished for PR 8403 at commit 1a3d053.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #41490 has finished for PR 8403 at commit c3fb4eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Like the buildScan in data source, we shoud not expose internal types outside spark sql, but we also need to provide the ability to build efficient data source by using internal types directly(no conversions) for our build-in data source or advanced users.

There maybe some cases that user need the internal types in Filter to avoid converions and speed up operations, I think we need to improve our data source API to make this stuff more flexible.
cc @liancheng

@rxin
Copy link
Contributor

rxin commented Aug 25, 2015

This is once for query - isn't it? It'd make sense to specialize the input, but I don't think it's worth it for the filter pushdowns.

@liancheng
Copy link
Contributor

This PR LGTM.

@cloud-fan Same opinion as @rxin. Filter push-down itself isn't a critical path.

@rxin
Copy link
Contributor

rxin commented Aug 25, 2015

I've merged this.

asfgit pushed a commit that referenced this pull request Aug 25, 2015
Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties.

This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0.  To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8403 from JoshRosen/datasources-internal-vs-external-types.

(cherry picked from commit 7bc9a8c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 7bc9a8c Aug 25, 2015
@JoshRosen JoshRosen deleted the datasources-internal-vs-external-types branch January 15, 2016 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants