Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31877][SQL]Avoid stats computation for Hive table #28686

Closed
wants to merge 9 commits into from

Conversation

karuppayya
Copy link
Contributor

@karuppayya karuppayya commented May 31, 2020

What changes were proposed in this pull request?

As part of DetermineTableStats rule we compute the stats for a HiveTableRelation, whcih can b an expensive operation. And it could happen multiple times for a query(SPARK-31850).

In most cases(the flag for converting Parquet and Orc table to datasource table is enabled by default in master branch), RelationConversion rule converts the HiveTableRelation to LogicalRelation.
When the conversion happens, the stats computed as part of Hive Table relation does not get used.

In this change, stats compute is avoided by performing the conversion before computing stats.

Why are the changes needed?

With the change, stats for Hive table will not be computed unnecessarily.

Does this PR introduce any user-facing change?

No

How was this patch tested?

It was tested on local machine and behaviour verified.

@karuppayya
Copy link
Contributor Author

@viirya @dongjoon-hyun @maropu @gatorsmile Can any of you help review the change.
Thank you

@maropu
Copy link
Member

maropu commented Jun 1, 2020

ok to test

@maropu
Copy link
Member

maropu commented Jun 1, 2020

Looks okay to me.

@SparkQA
Copy link

SparkQA commented Jun 1, 2020

Test build #123362 has finished for PR 28686 at commit 5c782a7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123404 has finished for PR 28686 at commit 3b3b4cc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123418 has finished for PR 28686 at commit a58a75f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@karuppayya
Copy link
Contributor Author

Failure doesnt look genuine to me.
@maropu do i need to re-trigger the tests?

@viirya
Copy link
Member

viirya commented Jun 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123443 has finished for PR 28686 at commit a58a75f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -654,7 +654,7 @@ case class HiveTableRelation(
tableMeta: CatalogTable,
dataCols: Seq[AttributeReference],
partitionCols: Seq[AttributeReference],
tableStats: Option[Statistics] = None,
tableStats: Option[Statistics] = Some(Statistics(sizeInBytes = SQLConf.get.defaultSizeInBytes)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to use Option?

@SparkQA
Copy link

SparkQA commented Jun 3, 2020

Test build #123456 has finished for PR 28686 at commit d5e999f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2020

Test build #123457 has finished for PR 28686 at commit 88fefaf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -654,7 +654,8 @@ case class HiveTableRelation(
tableMeta: CatalogTable,
dataCols: Seq[AttributeReference],
partitionCols: Seq[AttributeReference],
tableStats: Option[Statistics] = None,
tableStats: Option[Statistics] = Option(Statistics(sizeInBytes
= SQLConf.get.defaultSizeInBytes)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant tableStats: Statistics = Statistics(sizeInBytes = SQLConf.get.defaultSizeInBytes),

@SparkQA
Copy link

SparkQA commented Jun 3, 2020

Test build #123495 has finished for PR 28686 at commit 571319b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

RelationConversions(conf, catalog) +:
new DetermineTableStats(session) +:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DetermineTableStats updates statistics in HiveTableRelation for some cases. Updated statistics will be propagated into HadoopFsRelation and so the sizeInBytes depends on it.

As you change the order of DetermineTableStats to after RelationConversions, it could change the sizeInBytes of converted HadoopFsRelation.

That said, if the user is willing to use fallBackToHdfsForStatsEnabled to calculate table size, this change will make it not work as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Hive tables are converted To HiveTableScanExec instead of Logical Relation(which inturn uses HadoopFSRelation) . This happens in org.apache.spark.sql.hive.HiveStrategies.HiveTableScans . It will not affect HadoopFSRelation I think. Let me know if I am missing some thing
Note: conversion to HiveTableScanExec happen for Hive table with any formats other than Parquet and Orc.
In case or Orc and Parquet, it happens when the flag to convert to datasource table is disabled.

Copy link
Member

@viirya viirya Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean previously DetermineTableStats will calculate table size and update in HiveTableRelation before RelationConversions. Then in RelationConversions, this calculated table size will be propagated into HadoopFsRelation and used by sizeInBytes.

Now as you change the rule order, when running RelationConversions, even users enable fallBackToHdfsForStatsEnabled, the HadoopFsRelation won't get the table size calculated in DetermineTableStats (because this rule is run after RelationConversions now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thanks for catching this, I think this re-order will not be useful. I will decline this pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated statistics will be propagated into HadoopFsRelation and so the sizeInBytes depends on it.

Ah, I see. I missed that code flow.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think DetermineTableStats doesn't always calculate table size. It is controlled by fallBackToHdfsForStatsEnabled and only for non-partitioned tables. If the user wants to avoid it, fallBackToHdfsForStatsEnabled should be used for it.

@SparkQA
Copy link

SparkQA commented Jun 4, 2020

Test build #123497 has finished for PR 28686 at commit 9b4c18d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@karuppayya karuppayya closed this Jun 5, 2020
@karuppayya
Copy link
Contributor Author

@viirya @maropu
I relooked the code, the stats from Hive table relation are propagated only for Partitioned tables. Also in DetermineTableStats we compute the stats only for non-partitioned table. I think for a partitioned table it will also be spark.sql.defaultSizeInBytes
In case of a non-partitioned table, the HadoopFSRelation created uses InMemoryFileIndex which does not use the stats computed and does a separate listing to figure the stats.
Let me know if I am missing something here

To add to how this change is useful, I took the example of q17.sql TPCDS query on scale 1000, non-partitioned data
Without this change, the following is the query metrics for the query planning phase

scala> val df = sql(query)
scala> df.queryExecution.tracker.topRulesByTime(2).foreach(println)
(org.apache.spark.sql.hive.DetermineTableStats,RuleSummary(55677175448, 3, 3))
(org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions,RuleSummary(37485411305, 6, 0))

The time is also pretty high due to SPARK-31850.

The stats computed is not used and can be avoided completely.
Let me know your thoughts.

@karuppayya karuppayya reopened this Jul 1, 2020
@@ -171,6 +171,32 @@ class HiveCatalogedDDLSuite extends DDLSuite with TestHiveSingleton with BeforeA
testDropTable(isDatasourceTable = false)
}

test("DetermineTableStats should not cause any plan changes" +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be right place to add this test? @viirya @maropu
I will move the table create/drop to beforeAll and afterAll when moving this test to the right file

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124814 has finished for PR 28686 at commit ccba79b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124813 has finished for PR 28686 at commit 9b4c18d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124819 has finished for PR 28686 at commit 3ce6fbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@karuppayya karuppayya requested review from viirya and maropu July 6, 2020 19:15
@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125134 has finished for PR 28686 at commit 3ce6fbb.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@karuppayya
Copy link
Contributor Author

@dongjoon-hyun The error doesn't seem to be related to the change. Can u take a look, and if intermittent can we re-trigger the tests.

@karuppayya
Copy link
Contributor Author

@viirya @maropu do you think this change would be helpful? Let me know if you see any other issues.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Nov 19, 2020
@github-actions github-actions bot closed this Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants