New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table #28662
Conversation
Can one of the admins verify this patch? |
@dongjoon-hyun @cloud-fan @gatorsmile Can anyone please help reviewing this change. Thanks |
Tagging few more committers from the file's git history, for review: @HeartSaVioR @holdenk @maropu |
fs.getContentSummary(tablePath).getLength | ||
val table = relation.tableMeta | ||
val relationSizeMap = getRelationToSizeMap | ||
if (relationSizeMap.contains(table.identifier)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this rule catch up the stats updates of base relations in this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now handled this in a more generic way in 0a1fb93.
The way to refresh table in a Spark session is using the REFRESH TABLE <tblname>
command or the corresponding DF apis(Please let me know if this is not correct).
With this change, the cache invalidation happens whenever a refresh table is invoked
Ur, that's a wrong number. As you said, I meant #28686. |
You disable that option in your usecase? I thought most users turn on the flag to convert it to a datasrouce table. Anyway, I just want to know a priority to fix this issue. |
@maropu In my example I took the case of parquet as data format. This can happen with formats other than parquet/orc(like JSON, CSV etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why DetermineTableStats
will calculate the statistics multiple times? Do you mean it is caused bydf.queryExecution.analyzed
like the PR description shows? Once you finish query analysis of a dataframe, the analyzed plan is kept as QueryExecution.analyzed
. Why accessing it will cause re-calculation?
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
Outdated
Show resolved
Hide resolved
// Hive table columns are always nullable. | ||
table.dataSchema.asNullable.toAttributes, | ||
table.partitionSchema.asNullable.toAttributes) | ||
def readHiveTable(catalog: SessionCatalog, table: CatalogTable): HiveTableRelation = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see readHiveTable
is used by writing path (InsertIntoStatement
) too. If we just get cached plan, will it be dangerous if the cached plan is out-of-date and Spark writes with incorrect metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to reuse the cache of Datasource tables for Hive tables
InsertIntoStatement for Datasource tables, also fetches from the same cache. The cache invalidation have been taken care.
From my reading, I didnt find any cases. Let me know if you find any cases that needs special handling. I will also check the code from this perspective again.
In the description, i had written the code to trigger the analysis phase. At the end of analysis phase, |
I see, it is because calling Btw, as we already calculate statistics and save into class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case relation: HiveTableRelation
if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty && relation.tableStats.isEmpty =>
hiveTableWithStats(relation)
...
} Then I think Spark doesn't re-calculate if |
The above condition is already present.
|
@viirya @maropu @HyukjinKwon Can you please help review this PR |
#28686 should handle most cases. Closing this PR. |
What changes were proposed in this pull request?
Repro steps
Stacktrace indicating that stats collection happens multiple times:
Note:
There is no log line in DetermineTableStats to indicate that stats compute happened. Need to add a log line or use a debugger
The above can be repro-ed with first query on a created table.
Why are the changes needed?
Stats computation might be an expensive operation especially for a large table
Once stats are computed for a table, it can be re-used.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests added