[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM#56092
[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM#56092pan3793 wants to merge 2 commits into
Conversation
| */ | ||
| def withTableSample(sample: TableSampleInfo): JdbcSQLQueryBuilder = { | ||
| tableSampleClause = dialect.getTableSample(sample) | ||
| def withTableSampleClause(clause: String): JdbcSQLQueryBuilder = { |
There was a problem hiding this comment.
MIMA complains about it - this changes the propagation from the uncompiled TableSampleInfo to the compiled TABLESAMPLE ... clause, to avoid calling the JdbcDialect.compileTableSample again at the end.
| override def getTableSample(sample: TableSampleInfo): String = { | ||
| s"TABLESAMPLE (${(sample.upperBound - sample.lowerBound) * 100}) REPEATABLE (${sample.seed})" | ||
| override def compileTableSample(sample: TableSampleInfo): Option[String] = { | ||
| if (sample.withReplacement || sample.sampleMethod == SampleMethod.System) return None |
There was a problem hiding this comment.
according to https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-sampling, Databricks SQL does not support TABLESAMPLE SYSTEM
| * Pushes down BERNOULLI (row-level) SAMPLE to the data source. | ||
| */ | ||
| boolean pushTableSample( | ||
| @Deprecated(since = "4.2.0") |
There was a problem hiding this comment.
in fact, this method should be treated as deprecated after SPARK-55978 (4.2.0), connectors do not need to implement this method if they already implement the new pushTableSample method that has an additional sampleMethod parameter.
|
@stanyao @szehon-ho @cloud-fan @huaxingao could you please take a look? this might need to be included in 4.2 (entirely or partially) MIMA failures will be fixed after deciding on #56092 (comment) |
What changes were proposed in this pull request?
This PR contains 3 parts:
JdbcDialectAPI change - deprecatedef supportsTableSample: Booleananddef getTableSample(sample: TableSampleInfo): String, introduce adef compileTableSample(sample: TableSampleInfo): Option[String]as the replacement.1.1. this is a correctness fix - PostgreSQL does not support
withReplacement = true(actually, it seems no mainstream RDBMS or OLAP database supports TABLESAMPLE with replacement, Spark only supports it in DataFrame API), but the current implementation ignoreswithReplacementand always treats it aswithReplacement = false, which is incorrect.1.2. it's a pre-step to support
TABLESAMPLE SYSTEMas some RDBMSs may only supportTABLESAMPLE BERNOULLIbut notTABLESAMPLE SYSTEM.Mark the old
pushTableSamplemethod as deprecated and suggest the new API, connectors are suggested to only implement the newpushTableSamplethat has asampleMethodparameter.Extend the built-in JDBC connector to support pushdown
TABLESAMPLE SYSTEM, for now, this is only applicable to the PostgreSQL dialect.Why are the changes needed?
Correctness fix - TABLESAMPLE
withReplacement = trueshould not push down to PostgreSQL and Databricks JDBC connector.Make
JdbcDialectsAPI more flexible - RDBMS may partially support TABLESAMPLE, now they can answer whether the expression can be pushed down by testing the real TABLESAMPLE expression instead of blindly answering yes/no without knowing the input.Extend the built-in JDBC connector feature.
Does this PR introduce any user-facing change?
Yes, see the above section for details.
How was this patch tested?
New UT/IT cases are added.
Was this patch authored or co-authored using generative AI tooling?
Contains Content Generated-by: MiMo-V2.5-Pro