[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases by coalchan · Pull Request #35549 · apache/spark

coalchan · 2022-02-17T08:58:44Z

What changes were proposed in this pull request?

Add a spark conf in order to just fetch partitions' name instead of fetching partitions' details. This can reduce requests on hive metastore.

Why are the changes needed?

method listPartitions is order to get locations of partitions and compute custom partition locations(variable customPartitionLocations), but in most cases we do not have custom partition locations.
method listPartitionNames just fetchs partitions' name, it can reduce requests on hive metastore db.

Does this PR introduce any user-facing change?

Yes, we should config "spark.sql.hasCustomPartitionLocations = false"

How was this patch tested?

recompile InsertIntoHadoopFsRelationCommand.scala
update spark-sql_2.12-3.0.2.jar
run insert into cases

… not have custom partition locations

AmplabJenkins · 2022-02-18T20:43:24Z

Can one of the admins verify this patch?

jackylee-ch · 2022-03-07T13:21:54Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala


    // When partitions are tracked by the catalog, compute all custom partition locations that
    // may be relevant to the insertion job.
    if (partitionsTrackedByCatalog) {


Based the code on HiveClientImpl.getTableOption and CatalogTable, it seems that we won't run into this code while using hive metastore.

Thanks for your response. tracksPartitionsInCatalog is set true in HiveExternalCatalog.restoreHiveSerdeTable

jackylee-ch · 2022-03-07T13:23:46Z

Could you fix the GA?
Besides, any prof showing that this is taking too much time?

coalchan · 2022-03-09T08:47:17Z

@stczwd
spark-sql run insert into table xxx partition (x1, x2) select 1,2:
before optimization:

after optimization:

In this case, the table has 20k+ partitions.

jackylee-ch · 2022-03-09T10:45:29Z

Hm, I got your mind. After looking at the relevant logic, customPartitionLocations will only be used while overwriting hive static partition.
Thus, we can use listPartitions when partitionsTrackedByCatalog && staticPartitions.size < partitionColumns.length and use listPartitionNames when partitionsTrackedByCatalog.

if (partitionsTrackedByCatalog) {
      if (staticPartitions.size == partitionColumns.length) {
        matchingPartitions = sparkSession.sessionState.catalog.listPartitions(
          catalogTable.get.identifier, Some(staticPartitions))
        initialMatchingPartitions = matchingPartitions.map(_.spec)
        customPartitionLocations = getCustomPartitionLocations(
          fs, catalogTable.get, qualifiedOutputPath, matchingPartitions)
      } else {
            // calling listPartitionNames to find initialMatchingPartitions
      }
}

cc @cloud-fan @LuciferYang

github-actions · 2022-09-28T00:31:35Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Feb 17, 2022

replace listPartitions with listPartitionNames when target table does…

0f7c09e

… not have custom partition locations

coalchan force-pushed the parts branch from fd1aa09 to 0f7c09e Compare February 17, 2022 09:22

jackylee-ch reviewed Mar 7, 2022

View reviewed changes

github-actions bot added the Stale label Sep 28, 2022

github-actions bot closed this Sep 29, 2022

czxm mentioned this pull request Jan 16, 2023

[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases #39595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases#35549

[SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases#35549
coalchan wants to merge 1 commit intoapache:masterfrom
coalchan:parts

coalchan commented Feb 17, 2022

Uh oh!

AmplabJenkins commented Feb 18, 2022

Uh oh!

jackylee-ch Mar 7, 2022

Uh oh!

coalchan Mar 9, 2022

Uh oh!

jackylee-ch commented Mar 7, 2022

Uh oh!

coalchan commented Mar 9, 2022 •

edited

Loading

Uh oh!

jackylee-ch commented Mar 9, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Sep 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

coalchan commented Feb 17, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Feb 18, 2022

Uh oh!

jackylee-ch Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

coalchan Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

jackylee-ch commented Mar 7, 2022

Uh oh!

coalchan commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackylee-ch commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

coalchan commented Mar 9, 2022 •

edited

Loading

jackylee-ch commented Mar 9, 2022 •

edited

Loading