Spark: Reduce unnecessary remote service requests in `SparkSessionCatalog.invalidateTable` by smallx · Pull Request #3861 · apache/iceberg

smallx · 2022-01-08T07:11:18Z

See also: https://github.com/apache/spark/blob/f051b4be1c17cd3d8789787e5dec25bfcd749442/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java#L132

Link to #3072 and #3837.

smallx · 2022-01-08T07:15:06Z

Could you help me to look at the code. cc @RussellSpitzer @rdblue

RussellSpitzer · 2022-01-08T11:31:58Z

spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java

-    }
+    // We do not need to check whether the table exists and whether
+    // it is an Iceberg table to reduce remote service requests.
+    icebergCatalog.invalidateTable(ident);


So we always have to do at least one catalog load here. Since we have a caching catalog this is at most a single remote lookup but in most circumstances should probably be zero because the reference is cached.

The code for the SparkCatalog 'invalidatetable' method starts by doing a 'load' operation, so either that call will use a cached reference or do a remote lookup. If the 'exists' call was remote than it the table reference would be cached and the invalidate call would use that cache. If the exists call wasn't remote it also would be cached that means the reference was cached already and invalidate should also use that cache. So the number of remote lookups should be at most one both with this patch and without.

This change now also always calls the underlying session catalogs invalidate method and I'm not sure what the consequences of that so I would probably recommend leaving this as is unless we have some evidence this is causing an issue.

I think if we want to remove the remote calls we have to first go to the invalidateTables method and do some mods there.

First, check whether the catalog is caching or not. For non caching or no cache timeout catalogs we can return a noop. For tables backed by a catalog with a cache we would probably need to add a dropCache function to the Iceberg cachingcatalog api or something like that so we don't have to rely on calling load.

@RussellSpitzer Sorry, I forgot link to #3837. In #3837, we add a invalidateTable method to Catalog interface and remove load calls in SparkCatalog.invalidateTable.

This change now also always calls the underlying session catalogs invalidate method and I'm not sure what the consequences ...

For this, I have no good idea at present.
Can we rely on the conventions made in Spark document for the invalidateTable interface?

What about updating invalidateTable to return true if a table was removed from the cache? Then we could use that to avoid calling getSessionCatalog().invalidateTable(ident) at least when there was a table loaded.

I think I'm just being paranoid so I'm ok with calling both ... I think it should be fine as long as no one breaks contracts

What about updating invalidateTable to return true if a table was removed from the cache?

I have updated Catalog.invalidateTable to return whether the table is cached. However, due to Spark's TableCatalog.invalidateTable has no return value, I have to add invalidateTableIfCached method to BaseCatalog to workaround it.

…alog.invalidateTable`

rdblue · 2022-01-10T21:26:16Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

   * not cached, do nothing.
   *
   * @param identifier a table identifier
+   * @return true if the table is cached, false otherwise


I think this is if the table "was in the cache" because "is" is misleading. It is no longer cached after this call.

rdblue · 2022-01-10T21:27:52Z

spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java

+   * @param ident a table identifier
+   * @return true if the table is cached, false otherwise
+   */
+  public boolean invalidateTableIfCached(Identifier ident) {


I don't think that we need this method. We only need to check whether the Iceberg cache had the table. No need to change the Spark catalog methods.

Oh, I see why you added this. I think it isn't worth this change. Let's just call both invalidate methods instead.

rdblue · 2022-01-10T21:29:57Z

@smallx, I think my suggestion to check whether the table was cached to control whether to call the session catalog invalidate was a bad one. It introduces a lot of code for not much benefit, like needing to look up whether a table was cached before evicting it. Let's just go with the original approach of invalidating everywhere. If @RussellSpitzer is okay with that, then so am I.

smallx · 2022-01-10T23:42:42Z

@rdblue I think so, too. I have reverted the code.

RussellSpitzer

LGTM

github-actions bot added the spark label Jan 8, 2022

RussellSpitzer reviewed Jan 8, 2022

View reviewed changes

smallx added 2 commits January 10, 2022 16:31

Spark: Reduce unnecessary remote service requests in `SparkSessionCat…

4a1cc88

…alog.invalidateTable`

fix code review

e62c1fa

smallx force-pushed the optimize-invalidateTable-for-SparkSessionCatalog branch from e3db375 to e62c1fa Compare January 10, 2022 11:55

github-actions bot added API core labels Jan 10, 2022

rdblue reviewed Jan 10, 2022

View reviewed changes

fix code review

3837ebb

RussellSpitzer approved these changes Jan 11, 2022

View reviewed changes

rdblue merged commit bb79080 into apache:master Jan 13, 2022

smallx deleted the optimize-invalidateTable-for-SparkSessionCatalog branch January 13, 2022 01:22

Conversation

smallx commented Jan 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smallx commented Jan 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallx Jan 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 10, 2022

Uh oh!

smallx commented Jan 10, 2022

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smallx commented Jan 8, 2022 •

edited

Loading

smallx Jan 8, 2022 •

edited

Loading