Spark: Reduce unnecessary remote service requests in SparkSessionCatalog.invalidateTable#3861
Conversation
|
Could you help me to look at the code. cc @RussellSpitzer @rdblue |
| } | ||
| // We do not need to check whether the table exists and whether | ||
| // it is an Iceberg table to reduce remote service requests. | ||
| icebergCatalog.invalidateTable(ident); |
There was a problem hiding this comment.
So we always have to do at least one catalog load here. Since we have a caching catalog this is at most a single remote lookup but in most circumstances should probably be zero because the reference is cached.
The code for the SparkCatalog 'invalidatetable' method starts by doing a 'load' operation, so either that call will use a cached reference or do a remote lookup. If the 'exists' call was remote than it the table reference would be cached and the invalidate call would use that cache. If the exists call wasn't remote it also would be cached that means the reference was cached already and invalidate should also use that cache. So the number of remote lookups should be at most one both with this patch and without.
This change now also always calls the underlying session catalogs invalidate method and I'm not sure what the consequences of that so I would probably recommend leaving this as is unless we have some evidence this is causing an issue.
I think if we want to remove the remote calls we have to first go to the invalidateTables method and do some mods there.
First, check whether the catalog is caching or not. For non caching or no cache timeout catalogs we can return a noop. For tables backed by a catalog with a cache we would probably need to add a dropCache function to the Iceberg cachingcatalog api or something like that so we don't have to rely on calling load.
There was a problem hiding this comment.
@RussellSpitzer Sorry, I forgot link to #3837. In #3837, we add a invalidateTable method to Catalog interface and remove load calls in SparkCatalog.invalidateTable.
There was a problem hiding this comment.
This change now also always calls the underlying session catalogs invalidate method and I'm not sure what the consequences ...
For this, I have no good idea at present.
Can we rely on the conventions made in Spark document for the invalidateTable interface?
There was a problem hiding this comment.
What about updating invalidateTable to return true if a table was removed from the cache? Then we could use that to avoid calling getSessionCatalog().invalidateTable(ident) at least when there was a table loaded.
There was a problem hiding this comment.
I think I'm just being paranoid so I'm ok with calling both ... I think it should be fine as long as no one breaks contracts
There was a problem hiding this comment.
What about updating invalidateTable to return true if a table was removed from the cache?
I have updated Catalog.invalidateTable to return whether the table is cached. However, due to Spark's TableCatalog.invalidateTable has no return value, I have to add invalidateTableIfCached method to BaseCatalog to workaround it.
…alog.invalidateTable`
e3db375 to
e62c1fa
Compare
| * not cached, do nothing. | ||
| * | ||
| * @param identifier a table identifier | ||
| * @return true if the table is cached, false otherwise |
There was a problem hiding this comment.
I think this is if the table "was in the cache" because "is" is misleading. It is no longer cached after this call.
| * @param ident a table identifier | ||
| * @return true if the table is cached, false otherwise | ||
| */ | ||
| public boolean invalidateTableIfCached(Identifier ident) { |
There was a problem hiding this comment.
I don't think that we need this method. We only need to check whether the Iceberg cache had the table. No need to change the Spark catalog methods.
There was a problem hiding this comment.
Oh, I see why you added this. I think it isn't worth this change. Let's just call both invalidate methods instead.
|
@smallx, I think my suggestion to check whether the table was cached to control whether to call the session catalog invalidate was a bad one. It introduces a lot of code for not much benefit, like needing to look up whether a table was cached before evicting it. Let's just go with the original approach of invalidating everywhere. If @RussellSpitzer is okay with that, then so am I. |
|
@rdblue I think so, too. I have reverted the code. |
See also: https://github.com/apache/spark/blob/f051b4be1c17cd3d8789787e5dec25bfcd749442/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java#L132
Link to #3072 and #3837.