Skip to content

[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM#56092

Open
pan3793 wants to merge 2 commits into
apache:masterfrom
pan3793:SPARK-57040
Open

[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM#56092
pan3793 wants to merge 2 commits into
apache:masterfrom
pan3793:SPARK-57040

Conversation

@pan3793
Copy link
Copy Markdown
Member

@pan3793 pan3793 commented May 24, 2026

What changes were proposed in this pull request?

This PR contains 3 parts:

  1. JdbcDialect API change - deprecate def supportsTableSample: Boolean and def getTableSample(sample: TableSampleInfo): String, introduce a def compileTableSample(sample: TableSampleInfo): Option[String] as the replacement.
    1.1. this is a correctness fix - PostgreSQL does not support withReplacement = true (actually, it seems no mainstream RDBMS or OLAP database supports TABLESAMPLE with replacement, Spark only supports it in DataFrame API), but the current implementation ignores withReplacement and always treats it as withReplacement = false, which is incorrect.
    1.2. it's a pre-step to support TABLESAMPLE SYSTEM as some RDBMSs may only support TABLESAMPLE BERNOULLI but not TABLESAMPLE SYSTEM.

  2. Mark the old pushTableSample method as deprecated and suggest the new API, connectors are suggested to only implement the new pushTableSample that has a sampleMethod parameter.

  3. Extend the built-in JDBC connector to support pushdown TABLESAMPLE SYSTEM, for now, this is only applicable to the PostgreSQL dialect.

Why are the changes needed?

Correctness fix - TABLESAMPLE withReplacement = true should not push down to PostgreSQL and Databricks JDBC connector.

Make JdbcDialects API more flexible - RDBMS may partially support TABLESAMPLE, now they can answer whether the expression can be pushed down by testing the real TABLESAMPLE expression instead of blindly answering yes/no without knowing the input.

Extend the built-in JDBC connector feature.

Does this PR introduce any user-facing change?

Yes, see the above section for details.

How was this patch tested?

New UT/IT cases are added.

Was this patch authored or co-authored using generative AI tooling?

Contains Content Generated-by: MiMo-V2.5-Pro

*/
def withTableSample(sample: TableSampleInfo): JdbcSQLQueryBuilder = {
tableSampleClause = dialect.getTableSample(sample)
def withTableSampleClause(clause: String): JdbcSQLQueryBuilder = {
Copy link
Copy Markdown
Member Author

@pan3793 pan3793 May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIMA complains about it - this changes the propagation from the uncompiled TableSampleInfo to the compiled TABLESAMPLE ... clause, to avoid calling the JdbcDialect.compileTableSample again at the end.

override def getTableSample(sample: TableSampleInfo): String = {
s"TABLESAMPLE (${(sample.upperBound - sample.lowerBound) * 100}) REPEATABLE (${sample.seed})"
override def compileTableSample(sample: TableSampleInfo): Option[String] = {
if (sample.withReplacement || sample.sampleMethod == SampleMethod.System) return None
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-sampling, Databricks SQL does not support TABLESAMPLE SYSTEM

* Pushes down BERNOULLI (row-level) SAMPLE to the data source.
*/
boolean pushTableSample(
@Deprecated(since = "4.2.0")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, this method should be treated as deprecated after SPARK-55978 (4.2.0), connectors do not need to implement this method if they already implement the new pushTableSample method that has an additional sampleMethod parameter.

@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented May 24, 2026

@stanyao @szehon-ho @cloud-fan @huaxingao could you please take a look? this might need to be included in 4.2 (entirely or partially)

MIMA failures will be fixed after deciding on #56092 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant