-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18028][SQL] simplify TableFileCatalog #15568
Conversation
Test build #67265 has finished for PR 15568 at commit
|
bd808af
to
c98c4f6
Compare
Test build #67266 has finished for PR 15568 at commit
|
@cloud-fan, I don't know what you mean in item 4 about |
*/ | ||
def listPartitionsByFilter( | ||
tableName: TableIdentifier, | ||
predicates: Seq[Expression]): Seq[CatalogTablePartition] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this, @cloud-fan ! I think I can use this in #15302 .
@@ -102,6 +95,13 @@ class TableFileCatalog( | |||
} | |||
|
|||
override def inputFiles: Array[String] = allPartitions.inputFiles | |||
|
|||
override def equals(o: Any): Boolean = o match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this needed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under hive context, we will cache the LogicalRelation
for every data source table(including converted from hive), which means every table will always have a TableFileCatalog
of same instance.
However, it's not true in sql core. We will re-construct the TableFileCatalog
and LogicalRelation
everytime we look up a table. Thus we may encounter cache miss even if the table is cached, because TableFileCatalog
of difference instances never equal to each other.
Although it's not a real problem now, I think it's reasonable to follow ListFileCatalong
and add the equals
and hashCode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Could we add a unit test for this and a comment here?
LGTM, but had a question on why we need to override equals(). |
@mallman , item 4 is a potential problem in the future. The current workflow is, we get the However, we may have a workflow which call BTW, we also have |
c98c4f6
to
c4a906f
Compare
Test build #67444 has finished for PR 15568 at commit
|
lgtm, thanks for adding the test! |
thanks for the review, merging to master! |
## What changes were proposed in this pull request? Simplify/cleanup TableFileCatalog: 1. pass a `CatalogTable` instead of `databaseName` and `tableName` into `TableFileCatalog`, so that we don't need to fetch table metadata from metastore again 2. In `TableFileCatalog.filterPartitions0`, DO NOT set `PartitioningAwareFileCatalog.BASE_PATH_PARAM`. According to the [classdoc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L189-L209), the default value of `basePath` already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc. 3. add `equals` and `hashCode` to `TableFileCatalog` 4. add `SessionCatalog.listPartitionsByFilter` which handles case sensitivity. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15568 from cloud-fan/table-file-catalog.
## What changes were proposed in this pull request? Simplify/cleanup TableFileCatalog: 1. pass a `CatalogTable` instead of `databaseName` and `tableName` into `TableFileCatalog`, so that we don't need to fetch table metadata from metastore again 2. In `TableFileCatalog.filterPartitions0`, DO NOT set `PartitioningAwareFileCatalog.BASE_PATH_PARAM`. According to the [classdoc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L189-L209), the default value of `basePath` already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc. 3. add `equals` and `hashCode` to `TableFileCatalog` 4. add `SessionCatalog.listPartitionsByFilter` which handles case sensitivity. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15568 from cloud-fan/table-file-catalog.
new PrunedTableFileCatalog( | ||
sparkSession, new Path(baseLocation.get), fileStatusCache, partitionSpec) | ||
} else { | ||
new ListingFileCatalog(sparkSession, rootPaths, table.storage.properties, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cloud-fan, here I have a question why should we remove the fileStatusCache
for ListingFileCatalog
? Do we need to add it back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems it was a mistake. Can you send a PR to add it? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure. Thanks for your answer.
What changes were proposed in this pull request?
Simplify/cleanup TableFileCatalog:
CatalogTable
instead ofdatabaseName
andtableName
intoTableFileCatalog
, so that we don't need to fetch table metadata from metastore againTableFileCatalog.filterPartitions0
, DO NOT setPartitioningAwareFileCatalog.BASE_PATH_PARAM
. According to the classdoc, the default value ofbasePath
already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc.equals
andhashCode
toTableFileCatalog
SessionCatalog.listPartitionsByFilter
which handles case sensitivity.How was this patch tested?
existing tests.