New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing #36995
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,22 +17,33 @@ | |
|
||
package org.apache.spark.sql.execution.datasources.v2 | ||
|
||
import org.apache.spark.sql.catalyst.expressions.Expression | ||
import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion} | ||
import org.apache.spark.sql.catalyst.expressions.{Expression, Literal, SortOrder, TransformExpression, V2ExpressionUtils} | ||
import org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils._ | ||
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, RebalancePartitions, RepartitionByExpression, Sort} | ||
import org.apache.spark.sql.catalyst.rules.Rule | ||
import org.apache.spark.sql.connector.catalog.FunctionCatalog | ||
import org.apache.spark.sql.connector.catalog.functions.ScalarFunction | ||
import org.apache.spark.sql.connector.distributions._ | ||
import org.apache.spark.sql.connector.write.{RequiresDistributionAndOrdering, Write} | ||
import org.apache.spark.sql.errors.QueryCompilationErrors | ||
|
||
object DistributionAndOrderingUtils { | ||
|
||
def prepareQuery(write: Write, query: LogicalPlan): LogicalPlan = write match { | ||
def prepareQuery( | ||
write: Write, | ||
query: LogicalPlan, | ||
funCatalogOpt: Option[FunctionCatalog]): LogicalPlan = write match { | ||
case write: RequiresDistributionAndOrdering => | ||
val numPartitions = write.requiredNumPartitions() | ||
|
||
val distribution = write.requiredDistribution match { | ||
case d: OrderedDistribution => toCatalystOrdering(d.ordering(), query) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what was the behavior before this PR? do we fail to translate v2 function or we fail at runtime complaining that some expression can't be evaluated? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The former, the query is going to fail w/ |
||
case d: ClusteredDistribution => d.clustering.map(e => toCatalyst(e, query)).toSeq | ||
case d: OrderedDistribution => | ||
toCatalystOrdering(d.ordering(), query, funCatalogOpt) | ||
.map(e => resolveTransformExpression(e).asInstanceOf[SortOrder]) | ||
case d: ClusteredDistribution => | ||
d.clustering.map(e => toCatalyst(e, query, funCatalogOpt)) | ||
.map(e => resolveTransformExpression(e)).toSeq | ||
case _: UnspecifiedDistribution => Seq.empty[Expression] | ||
} | ||
|
||
|
@@ -53,16 +64,33 @@ object DistributionAndOrderingUtils { | |
query | ||
} | ||
|
||
val ordering = toCatalystOrdering(write.requiredOrdering, query) | ||
val ordering = toCatalystOrdering(write.requiredOrdering, query, funCatalogOpt) | ||
val queryWithDistributionAndOrdering = if (ordering.nonEmpty) { | ||
Sort(ordering, global = false, queryWithDistribution) | ||
Sort( | ||
ordering.map(e => resolveTransformExpression(e).asInstanceOf[SortOrder]), | ||
global = false, | ||
queryWithDistribution) | ||
} else { | ||
queryWithDistribution | ||
} | ||
|
||
queryWithDistributionAndOrdering | ||
|
||
// Apply typeCoercionRules since the converted expression from TransformExpression | ||
// implemented ImplicitCastInputTypes | ||
typeCoercionRules.foldLeft(queryWithDistributionAndOrdering)((plan, rule) => rule(plan)) | ||
case _ => | ||
query | ||
} | ||
|
||
private def resolveTransformExpression(expr: Expression): Expression = expr.transform { | ||
case TransformExpression(scalarFunc: ScalarFunction[_], arguments, Some(numBuckets)) => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't recall the details now. When do we need to translate v2 trasnform to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SPARK-37377(#35657) changes the common code used by V2 reading & writing, converting the |
||
V2ExpressionUtils.resolveScalarFunction(scalarFunc, Seq(Literal(numBuckets)) ++ arguments) | ||
case TransformExpression(scalarFunc: ScalarFunction[_], arguments, None) => | ||
V2ExpressionUtils.resolveScalarFunction(scalarFunc, arguments) | ||
} | ||
|
||
private def typeCoercionRules: List[Rule[LogicalPlan]] = if (conf.ansiEnabled) { | ||
AnsiTypeCoercion.typeCoercionRules | ||
} else { | ||
TypeCoercion.typeCoercionRules | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I wonder how does the write work with transforms such as bucket. For example, suppose the required distribution is
bucket(col, 100)
, Spark currently will compute the partition (bucket) ID bymurmur_hash(bucket(col, 100)) pmod 100
, so the value ofcol
is essentially hashed twice. I'm not sure whether this breaks any assumption from the V2 data source side, or whether it has any effect in the hash key distributions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only happening on the V1 write path. In V2, the bucket is resolved as
BucketTransform
in analyzing phase, and converted to evaluable catalyst expressionApplyFunctionExpression
/Invoke
/StaticInvoke
here, so I don't think your concern will happen.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes it will be converted to
ApplyFunctionExpression
etc. But in V2 write path, these partition transforms will be used inRepartitionByExpression
, and then converted intoShuffleExchangeExec
during physical planning, right?Checking the code path, the
ApplyFunctionExpression
/Invoke
/StaticInvoke
, etc, will be passed toRepartitionByExpression
first, as fieldpartitionExpressions
, and then be passed toHashPartitioning
,RangePartitioning
etc, inHasPartitionExpressions
, and eventually be used in places likeHashPartitioning.partitionIdExpression
where the computation of partition ID I mentioned above happened.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, educated. After reading the code, I think you are right that "the value of
col
is essentially hashed twice", but I don't think it will bring correctness issues, because it still guarantees that the same values will be clustered into the same partition.One example is Hive bucket. In
V1WritesUtils#getWriterBucketSpec
, bothHiveHash
andHashPartitioning#partitionIdExpression
can be used to constructbucketIdExpression
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just thinking whether we should enforce stronger semantics on the bucketing. Consider we build bucketed table support for DSv2 file source on top on this mechanism, does it mean for a particular row, it could be hashed to a different bucket ID than V1? What if someone wants to first write a bucketed table using V2 and then read back in V1, and perhaps do bucket join with another V1 bucketed table? I wonder if that could cause incorrect results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that different storage system usually defines their own hash functions and sharding(bucketing) rules, I don't think we can make bucket tables of all data sources fully compatible w/ each other, e.g. Spark, Hive, Iceberg using the total different hash algorithm for bucketing.
We can make that DS V1 file bucket table compatible w/ V2 file bucket table, and Hive bucket table compatible w/ V2 Hive bucket table, but DS file bucket table can not be compatible w/ Hive(neither V1 nor V2) since they use the different hash algorithm.
In detail, as
V2SessionCatalog
extendsFunctionCatalog
, to make V2 file bucket tables compatible w/ V1, we can introduce and register aSparkBucket
V2 function using the same hash algorithm V1 inV2SessionCatalog
. For Hive tables, just register a similarHiveBucket
V2 function using the Hive hash algorithm inHiveCatalog
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @sunchao brings a valid point that is easy to overlook. We have to make sure Spark writes to Hive tables in the same way no matter whether the v1 or v2 path is being used.
Would it be correct to say we have this issue because
partitionIdExpression
inHashPartitioning
is used both for generating bucket IDs in Hive tables as well as for producing partition IDs for writing tasks? Can we use different mechanisms?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After SPARK-32709(#33432), it allows using different hash expressions for generating partition IDs and bucket IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao, coming back to the use case you mentioned above. I think the bucket ID will be always the same as long as the task writer respects the table bucket spec and Spark shuffles all records that are supposed to land in one bucket to one task. Like @pan3793 said, I suppose V2 file source will request a distribution using a V2 function that would wrap the internal Spark hash function. That should guarantee that all records for the same bucket will land in one task. As long as the task writer uses the correct bucket expression (based on the table definition), we should be good, right?
You are right the partition/task ID during writes for a particular row may be different in V1 and V2 because of double hashing. But does it actually matter, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply here. Yes, I think it should be OK as long as the task writer uses the bucket expression to decide which bucket the input record should go. There will be some extra work to achieve V1 and V2 compatibility but it's not that relevant to this PR now.