Skip to content

[SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition#37290

Closed
ulysses-you wants to merge 3 commits intoapache:masterfrom
ulysses-you:v1write
Closed

[SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition#37290
ulysses-you wants to merge 3 commits intoapache:masterfrom
ulysses-you:v1write

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Jul 26, 2022

What changes were proposed in this pull request?

This is a rework for #34468, since we pull out v1write required ordering.

This prs add a new parameter numStaticPartitions to v1write and FileFormatWriter so we can skip unnecessary local sort for static partition write.

Why are the changes needed?

The v1 write requires ordering for dynamic partition, bucket expression and sort column during writing. The reason is the DynamicPartitionDataSingleWriter and DynamicPartitionDataConcurrentWriter assume the partition and bucket columns are continuous. Then if partition column is static, it's unnecessary to do the local sort.

For v1 write, InsertIntoHadoopFsRelationCommand is the only case which adds a local sort even if the partition column is static.

Does this PR introduce any user-facing change?

no, only improve performance

How was this patch tested?

add test

@github-actions github-actions bot added the SQL label Jul 26, 2022
@ulysses-you
Copy link
Contributor Author

cc @viirya @cloud-fan @c21

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM from my side with minor comments. Thanks @ulysses-you

options: Map[String, String],
numStaticPartitions: Int = 0)
: Set[String] = {
assert(partitionColumns.size >= numStaticPartitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would require() be better?

options: Map[String, String]): Seq[SortOrder] = {
options: Map[String, String],
numStaticPartitions: Int = 0): Seq[SortOrder] = {
assert(partitionColumns.size >= numStaticPartitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

*/
private[sql] var outputOrderingMatched: Boolean = false

// scalastyle:off argcount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can pass in a wrapper class PartitionSpec(partitionColumns: Seq[Attribute], numStaticPartitions: Int) to avoid this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a new parameter with default value is for compatible with downstream project as far as possible, does it make sense to you ?

|""".stripMargin)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be good to have one more unit test for no static columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous contains similar test, but I'd like to add it

statsTrackers: Seq[WriteJobStatsTracker],
options: Map[String, String])
options: Map[String, String],
numStaticPartitions: Int = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numStaticPartitionCols?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

For v1 write, InsertIntoHadoopFsRelationCommand is the only case which adds a local sort even if the partition column is static.

Do you mean that InsertIntoHadoopFsRelationCommand is the only usecase for now for numStaticPartitions?

@ulysses-you
Copy link
Contributor Author

@viirya yes, the default of numStaticPartitionCols is 0 so it won't affect other code place. For now, only InsertIntoHadoopFsRelationCommand specifies it.

The reason is:

  • For hive write, it has already pass the dynamic partition cols for FileFormatWriter, so it's numStaticPartitionCols should always be 0
  • For ctas, we do not support specify static partition

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e187bb3 Jul 29, 2022
@ulysses-you ulysses-you deleted the v1write branch July 31, 2022 09:03
@ulysses-you
Copy link
Contributor Author

thank you all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants