Skip to content

Spark: Reduce unnecessary remote service requests in SparkSessionCatalog.invalidateTable#3861

Merged
rdblue merged 3 commits intoapache:masterfrom
smallx:optimize-invalidateTable-for-SparkSessionCatalog
Jan 13, 2022
Merged

Spark: Reduce unnecessary remote service requests in SparkSessionCatalog.invalidateTable#3861
rdblue merged 3 commits intoapache:masterfrom
smallx:optimize-invalidateTable-for-SparkSessionCatalog

Conversation

@smallx
Copy link
Contributor

@smallx smallx commented Jan 8, 2022

@github-actions github-actions bot added the spark label Jan 8, 2022
@smallx
Copy link
Contributor Author

smallx commented Jan 8, 2022

Could you help me to look at the code. cc @RussellSpitzer @rdblue

}
// We do not need to check whether the table exists and whether
// it is an Iceberg table to reduce remote service requests.
icebergCatalog.invalidateTable(ident);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we always have to do at least one catalog load here. Since we have a caching catalog this is at most a single remote lookup but in most circumstances should probably be zero because the reference is cached.

The code for the SparkCatalog 'invalidatetable' method starts by doing a 'load' operation, so either that call will use a cached reference or do a remote lookup. If the 'exists' call was remote than it the table reference would be cached and the invalidate call would use that cache. If the exists call wasn't remote it also would be cached that means the reference was cached already and invalidate should also use that cache. So the number of remote lookups should be at most one both with this patch and without.

This change now also always calls the underlying session catalogs invalidate method and I'm not sure what the consequences of that so I would probably recommend leaving this as is unless we have some evidence this is causing an issue.

I think if we want to remove the remote calls we have to first go to the invalidateTables method and do some mods there.

First, check whether the catalog is caching or not. For non caching or no cache timeout catalogs we can return a noop. For tables backed by a catalog with a cache we would probably need to add a dropCache function to the Iceberg cachingcatalog api or something like that so we don't have to rely on calling load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer Sorry, I forgot link to #3837. In #3837, we add a invalidateTable method to Catalog interface and remove load calls in SparkCatalog.invalidateTable.

Copy link
Contributor Author

@smallx smallx Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change now also always calls the underlying session catalogs invalidate method and I'm not sure what the consequences ...

For this, I have no good idea at present.
Can we rely on the conventions made in Spark document for the invalidateTable interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about updating invalidateTable to return true if a table was removed from the cache? Then we could use that to avoid calling getSessionCatalog().invalidateTable(ident) at least when there was a table loaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm just being paranoid so I'm ok with calling both ... I think it should be fine as long as no one breaks contracts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about updating invalidateTable to return true if a table was removed from the cache?

I have updated Catalog.invalidateTable to return whether the table is cached. However, due to Spark's TableCatalog.invalidateTable has no return value, I have to add invalidateTableIfCached method to BaseCatalog to workaround it.

@smallx smallx force-pushed the optimize-invalidateTable-for-SparkSessionCatalog branch from e3db375 to e62c1fa Compare January 10, 2022 11:55
* not cached, do nothing.
*
* @param identifier a table identifier
* @return true if the table is cached, false otherwise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is if the table "was in the cache" because "is" is misleading. It is no longer cached after this call.

* @param ident a table identifier
* @return true if the table is cached, false otherwise
*/
public boolean invalidateTableIfCached(Identifier ident) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we need this method. We only need to check whether the Iceberg cache had the table. No need to change the Spark catalog methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see why you added this. I think it isn't worth this change. Let's just call both invalidate methods instead.

@rdblue
Copy link
Contributor

rdblue commented Jan 10, 2022

@smallx, I think my suggestion to check whether the table was cached to control whether to call the session catalog invalidate was a bad one. It introduces a lot of code for not much benefit, like needing to look up whether a table was cached before evicting it. Let's just go with the original approach of invalidating everywhere. If @RussellSpitzer is okay with that, then so am I.

@smallx
Copy link
Contributor Author

smallx commented Jan 10, 2022

@rdblue I think so, too. I have reverted the code.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rdblue rdblue merged commit bb79080 into apache:master Jan 13, 2022
@smallx smallx deleted the optimize-invalidateTable-for-SparkSessionCatalog branch January 13, 2022 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants