Skip to content

[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases#35549

Closed
coalchan wants to merge 1 commit intoapache:masterfrom
coalchan:parts
Closed

[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases#35549
coalchan wants to merge 1 commit intoapache:masterfrom
coalchan:parts

Conversation

@coalchan
Copy link

What changes were proposed in this pull request?

Add a spark conf in order to just fetch partitions' name instead of fetching partitions' details. This can reduce requests on hive metastore.

Why are the changes needed?

  1. method listPartitions is order to get locations of partitions and compute custom partition locations(variable customPartitionLocations), but in most cases we do not have custom partition locations.
  2. method listPartitionNames just fetchs partitions' name, it can reduce requests on hive metastore db.

Does this PR introduce any user-facing change?

Yes, we should config "spark.sql.hasCustomPartitionLocations = false"

How was this patch tested?

  1. recompile InsertIntoHadoopFsRelationCommand.scala
  2. update spark-sql_2.12-3.0.2.jar
  3. run insert into cases

@github-actions github-actions bot added the SQL label Feb 17, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?


// When partitions are tracked by the catalog, compute all custom partition locations that
// may be relevant to the insertion job.
if (partitionsTrackedByCatalog) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based the code on HiveClientImpl.getTableOption and CatalogTable, it seems that we won't run into this code while using hive metastore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your response. tracksPartitionsInCatalog is set true in HiveExternalCatalog.restoreHiveSerdeTable

@jackylee-ch
Copy link
Contributor

Could you fix the GA?
Besides, any prof showing that this is taking too much time?

@coalchan
Copy link
Author

coalchan commented Mar 9, 2022

@stczwd
spark-sql run insert into table xxx partition (x1, x2) select 1,2:
before optimization:
before

after optimization:
after

In this case, the table has 20k+ partitions.

@jackylee-ch
Copy link
Contributor

jackylee-ch commented Mar 9, 2022

Hm, I got your mind. After looking at the relevant logic, customPartitionLocations will only be used while overwriting hive static partition.
Thus, we can use listPartitions when partitionsTrackedByCatalog && staticPartitions.size < partitionColumns.length and use listPartitionNames when partitionsTrackedByCatalog.

if (partitionsTrackedByCatalog) {
      if (staticPartitions.size == partitionColumns.length) {
        matchingPartitions = sparkSession.sessionState.catalog.listPartitions(
          catalogTable.get.identifier, Some(staticPartitions))
        initialMatchingPartitions = matchingPartitions.map(_.spec)
        customPartitionLocations = getCustomPartitionLocations(
          fs, catalogTable.get, qualifiedOutputPath, matchingPartitions)
      } else {
            // calling listPartitionNames to find initialMatchingPartitions
      }
}

cc @cloud-fan @LuciferYang

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments