Core: fix deadlock in CachingCatalog by racevedoo · Pull Request #3801 · apache/iceberg

racevedoo · 2021-12-23T16:29:40Z

Uses caffeine's RemovalListener to expire metadata tables asynchronously, avoiding modifying cache entries during compute HashMap functions (which cause deadlocks).
For more details, check #3791

Fixes #3791

core/src/main/java/org/apache/iceberg/CachingCatalog.java

kbendick · 2021-12-24T02:28:02Z

core/src/test/java/org/apache/iceberg/hadoop/TestCachingCatalog.java

-            catalog.cache().asMap().containsKey(metadataTable)));
+    // Removal of metadata tables from cache is async, use awaitility
+    await().untilAsserted(() ->
+        Assertions.assertThat(catalog.cache().asMap()).doesNotContainKeys(metadataTables(tableIdent)));


Side note from my own investigation into the need for these changes:

If we're introducing Awaitility (which in general I think is a good idea), we should eventually update the rest of these tests to use Awaitility and remove the call to cleanUp in the TestableCachingCatalog, which was added to handle the async expiration.

iceberg/core/src/test/java/org/apache/iceberg/TestableCachingCatalog.java

Lines 46 to 51 in 9c3e340

public Cache<TableIdentifier, Table> cache() {

// cleanUp must be called as tests apply assertions directly on the underlying map, but metadata table

// map entries are cleaned up asynchronously.

tableCache.cleanUp();

return tableCache;

}

This larger refactor to using Awaitility entirely should be done in a separate PR though, as this fix is important and I'd like to get this PR in as soon as possible so people can use the updated snapshot and we'll continue to see if the race condition is fixed.

I was curious why this change was needed so I went looking through the Caffeine issues and update notes and found the likely cause. Updating Caffeine to 2.8.5 would make this change not needed. The patch notes reference an issues that says "Fixed expiration delay for scheduled cleanup" which is likely brought in from the new / modified write path.

I also have another PR open to upgrade the caffeine library version, as there are some important bug fixes for us and since we've been mucking around in here, we might as well upgrade instead of stay behind. #3803

All that said, I think the use of Awaitility here is fine. =)

Others might have different opinions based on a "smallest possible diff" principal, where we'd introduce Awaitility in one PR by itself to help people who backport or for keeping the assertion message etc. But I'm cool either way.

Thanks again for all your work on this!

Thinking on it further, Can we upgrade Caffeine to the next minor version (2.8.5) in this PR, which makes this Awaitility introduction unnecessary?

My reasoning is that the bug that causes you to have to use Awaitility here, is fixed byy upgrading to Caffeine 2.8.5. This indicates that the write path we're going through also receives the fix and so we should provide that fix to our actual code too (even if we're not explicitly calling cleanUp ever, the bug could still affect us).

It's the only patch mentioned in the release notes for 2.8.4 -> 2.8.5

We can and should still convert this to not call cleanUp in the TestableCachingCache, but I think it's best that the fix from 2.8.5 is also applied to the final code as it's a scheduling bug that could affect our new write path for this cache as well. This way also, we can introduce Awaitility in a follow up PR and focus only on this bug.

Also, when we upgrade Caffeine to 2.9.x or 3.0 (same release more or less), we should look into changing this to use the newly added .evictionListener, which would make the metadata table drops fully atomic. It might make the Awaitility changes not fully needed. But my main thinking is upgrade to caffeine 2.8.5 for now to get the scheduling fix that we know otherwise likely affects us.

The evictionListener replaces CacheWriter’s delete for an atomic action under the map’s computation. Since your previous code wrote back into the cache I think you would suffer the same problem if switching.

Since you don’t use a scheduler, I don’t see how that bug fix affects you.

You can use Caffeine.executor(Runnable::run) to disable async, which simplifies tests.

Oh awesome thank you! You're right, after running the tests again several more times on 2.8.5, it didn't fix the issue. I just somehow got lucky and was able to get several successful runs in a row.

Appreciate the input regarding disabling async!

We typically don't expect to see a huge amount of cache usage. It's for caching tables that are accessed, and a job usually accesses a table via the catalog one time at the start of the job when query planning happens.

There are cases where they can be accessed several times (as mentioned in the issue), but it's usually a pretty limited number of values and times to use it.

Would it be a terrible idea to use Caffeine.executor(Runnable::run) to disable async in the actual production code? Ideally the tables we're removing in the RemovalListener would expire synchronously when the main table they reference expires.

Thanks again for your input and the great library @ben-manes!

It would be perfectly fine to disable async in production code (Redhat's Infinispan does that, for example). The async is primarily because we don't know the cost of running user-supplied callbacks like the removal listener, which might impact user-facing latencies if run on the calling threads. The cache's own logic is very fast and strictly uses amortized O(1) algorithms.

Great. Our work in the callback is relatively negligible (building a list of maybe at most 10 or so possible keys and calling invalidateAll with that list) and disabling async would be the desired behavior to reduce the possibility of users getting a stale associated value (the ones we’re invalidating in the callback). But it wouldn’t be the end of the world if they did. Really appreciate the input and knowledge sharing! 😊

just updated the code to use Runnable::run 😄

kbendick

Thank you for quickly reporting this issue and then also finding a fix.

I've been out of office due to the holidays. In addition to the unit test, have you verified that this change actually fixes the issue that you reported (e.g. compile it and use the jar from your local build)? That would be the best form of verification, but given the reproduction test case in #3798, that's not required.

Aside from a few small nits, I'm +1. Thank you @racevedoo for reporting the issue and for the quick repro and patch!

core/src/main/java/org/apache/iceberg/CachingCatalog.java

kbendick · 2021-12-24T03:10:56Z

core/src/main/java/org/apache/iceberg/CachingCatalog.java

-    @Override
-    public void delete(TableIdentifier tableIdentifier, Table table, RemovalCause cause) {
+    public void onRemoval(TableIdentifier tableIdentifier, Table table, RemovalCause cause) {
+      identLoggingRemovalListener.onRemoval(tableIdentifier, table, cause);


Nit: Instead of relying on identLoggingRemovalListener#onRemoval, as we're already inside of an onRemoval function, would it make sense to cut out the mental overhead and just add the log statement here directly? The double onRemoval was odd to me on first glance, and is added overhead for the reader.

EDIT: As mentioned elsewhere, identLoggingRemovalListener is no longer needed (we only added it to log about cache expiration). Realistically, logging has additional overhead and caffeine logs itself as well. How helpful to debugging the situation for you was this specific log? If the log message doesn't seem critical, I think the class should look as follows and avoid the extra logging that was added when the MetadataTableInvalidatingCacheWriter was introduced:

/** * RemovalListener class for removing metadata tables when their associated data table is expired * via cache expiration. */ class MetadataTableInvalidatingRemovalListener implements RemovalListener<TableIdentifier, Table> { @Override public void onRemoval(TableIdentifier tableIdentifier, Table table, RemovalCause cause) { if (RemovalCause.EXPIRED.equals(cause) && !MetadataTableUtils.hasMetadataTableName(tableIdentifier)) { tableCache.invalidateAll(metadataTableIdentifiers(tableIdentifier)); } } }

removed identLoggingRemovalListener

kbendick · 2021-12-24T05:24:31Z

core/src/test/java/org/apache/iceberg/hadoop/TestCachingCatalog.java

-    Arrays.stream(metadataTables(tableIdent)).forEach(metadataTable ->
-        Assert.assertFalse("When a data table expires, its metadata tables should expire regardless of age",
-            catalog.cache().asMap().containsKey(metadataTable)));
+    // Removal of metadata tables from cache is async, use awaitility


Nit: This comment seems unnecessary, as we're using awaitility immediately below.

racevedoo · 2021-12-24T09:56:30Z

Oops, I missed some of the comments/edits.

@kbendick from my perspective, the proper fix is indeed upgrading to caffeine 2.9.x and using evictionListener. This would get us rid of awaitility and the cleanUp calls. I still have to check if the test in #3798 passes though

I'll try to work on this today.

racevedoo · 2021-12-24T13:37:43Z

Actually evictionListener seems to be called only within the cleanUp task. I'm not sure if I understood correctly, but I guess we should move forward with the upgrade to caffeine 2.8.5 and keep the cleanUp in cache() until we investigate this in more detail

racevedoo · 2021-12-24T13:43:28Z

Caffeine upgraded to 2.8.5

I've been out of office due to the holidays. In addition to the unit test, have you verified that this change actually fixes the issue that you reported (e.g. compile it and use the jar from your local build)? That would be the best form of verification, but given the reproduction test case in #3798, that's not required.

I'm out of office too, so I haven't verified if this change fixes the issue, but I guess the unit test is good enough

racevedoo · 2021-12-24T14:10:55Z

Guys, sorry for the lots of force-pushes. The main changes/discoveries from the previous state are:

Upgraded to caffeine 2.9.x
Added the test present in Core: add test for deadlock in CachingCatalog #3798 (I guess we can close that PR if the test is fine)
Kept using removalListener since evictionListener is synchronous and also causes the deadlock (the test fails)
Kept awaitility since removalListener is async

kbendick · 2021-12-24T21:52:01Z

Thanks for the follow up @racevedoo! No worries on the force pushes, just be sure to rebase off of master when needed.

Also, @racevedoo and I are syncing up offline about some of the changes in the tests. Will comment here with more detail later. 🙂

kbendick

This looks great, thank you so much @racevedoo!

Just a few minor nits.

core/src/main/java/org/apache/iceberg/CachingCatalog.java

kbendick

+1 this looks great. Thank you @racevedoo for the report and the quick fix. Hugely appreciated, especially during the holidays.

RussellSpitzer

Had a few questions, but in general this looks good to me. I'll need some time after break to really dig into the Caffine library but invalidating cache in the removal listener seems to make more sense to me

core/src/main/java/org/apache/iceberg/CachingCatalog.java

RussellSpitzer · 2021-12-27T05:10:24Z

core/src/test/java/org/apache/iceberg/hadoop/TestCachingCatalog.java

  }

+  @Test
+  @Ignore("reproduces https://github.com/apache/iceberg/issues/3791")


This test should pass now right, so should we turn this on?

That's an open question. I think we should enable it, as it runs relatively fast (~2 seconds) and we would be getting the safety of not hitting the deadlock issue again.

I guess @kbendick thinks differently though

I figured to turn it off normally since it uses a lot of threads. I was hoping that we might come up with a system of tagging tests as resource intensive etc that we run nightly or something.

I've seen a handful of tests lately that start many threads and that seems not great for CI.

But if we're ok with running this on every push, we can remove the Ignore. We could either open an issue for expensive tests that we flag or hold off on that as we might not be there yet.

But tests that spawn 20+ threads make me nervous about increasing oddities in CI like HMS timeouts etc.

If it's just 2 seconds, I'd say enable it. We share infra with all of ASF, but arguably we have a lot of Spark tests in this repo that should be refactored that are more resource intensive than this. The risk of thread thrashing is likely minimal and we can deal with it if it comes up.

kbendick · 2021-12-28T07:39:52Z

Had a few questions, but in general this looks good to me. I'll need some time after break to really dig into the Caffine library but invalidating cache in the removal listener seems to make more sense to me

Agreed on digging into Caffeine. I've begun doing that as well and plan to upgrade our version after this PR. It is ubiquitous in the rest of our stack too so it's worth knowing well.

But given that the snapshot currently can deadlock under common usage patterns, I'd prefer to ship this and then revisit it if need be. If the deadlock doesn't happen anymore with the snapshot, that will be great.

Then we'll just need to revisit with more knowledge the choice of executor service (possibly making it optionally not synchronous or something). But the deadlock seems important to fix as soon as possible.

kbendick · 2021-12-28T07:45:21Z

core/src/main/java/org/apache/iceberg/CachingCatalog.java

    @Override
-    public void delete(TableIdentifier tableIdentifier, Table table, RemovalCause cause) {
+    public void onRemoval(TableIdentifier tableIdentifier, Table table, RemovalCause cause) {
+      LOG.debug("Evicted {} from the table cache ({})", tableIdentifier, cause);


Given that this inner class is the only thing that uses the Logger, should we make the call to LoggerFactory.getLogger here instead?

static declarations are not supported in inner classes :(

core/src/test/java/org/apache/iceberg/hadoop/TestCachingCatalog.java

kbendick · 2021-12-28T08:00:17Z

core/src/test/java/org/apache/iceberg/hadoop/TestCachingCatalog.java

+    HadoopCatalog underlyingCatalog = hadoopCatalog();
+    TestableCachingCatalog catalog = TestableCachingCatalog.wrap(underlyingCatalog, Duration.ofSeconds(1), ticker);
+    Namespace namespace = Namespace.of("db", "ns1", "ns2");
+    int numThreads = 20;


Would it make sense to use fewer threads, so the space for collision / hitting the deadlock is smaller? And then iterating like you are a few extra times (so possibly switching to a different fixed threadpool)? This way the Random.nextInt calls are more likely to collide and we don't have to spawn so many threads.

Or even just creating two tables and then calling Random.nextInt(2) 20 times, which is highly likely to collide and use the same value twice in a row.

Can you hack the hashCode (e.g. make constant) to coerce collisions?

I don't mean to take up too much of your time, but you mean for the cache key's (the TableIdentifier)?

Would it be equivalent to just make one cache entry and then just only operate on that? That would make dropping the table (cache entry) free.

Yes, the key. The map locks on a hashbin so locking multiple in an unpredictable order is a classic deadlock case. I don’t know if a single entry would suffice or if you need threads performing ABA vs BAB to get the desired test case.

The thing here is to make the cache miss (from my tests). I changed this a little to get rid of the random stuff, making cache misses more certain.

RussellSpitzer

I think this looks good to me for the fix, I think we should aim to try to get a more deterministic test, the faking of the hash value seems pretty doable but if that's not possible I am fine with the current state. Let's try to get that test cleaned up a bit and make sure it doesn't leave to much junk behind when it is done.

Uses caffeine's `RemovalListener` to expire metadata tables, avoiding modifying cache entries during `compute` HashMap functions (which cause deadlocks). Also changes caffeine's executor to make `RemovalListener` run sync For more details, check apache#3791 Fixes apache#3791 Co-authored-by: Kyle Bendickson <kjbendickson@gmail.com>

racevedoo · 2021-12-28T18:14:18Z

I think this looks good to me for the fix, I think we should aim to try to get a more deterministic test, the faking of the hash value seems pretty doable but if that's not possible I am fine with the current state. Let's try to get that test cleaned up a bit and make sure it doesn't leave to much junk behind when it is done.

I changed the test a little to remove the random stuff and cleanup the created tables 😄

rdblue · 2021-12-29T20:19:54Z

Looks good. Thanks for getting this done, @racevedoo!

github-actions bot added build core labels Dec 23, 2021

This was referenced Dec 23, 2021

Deadlock when using CachingCatalog with Iceberg 0.13.0-SNAPSHOT #3791

Closed

Core: add test for deadlock in CachingCatalog #3798

Closed

racevedoo changed the title ~~Core: fixes deadlock in CachingCatalog~~ Core: fix deadlock in CachingCatalog Dec 23, 2021

racevedoo force-pushed the fix-caching-catalog-deadlock branch from f1b04a0 to 59f355e Compare December 23, 2021 16:38

rdblue reviewed Dec 23, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/CachingCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 23, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/CachingCatalog.java Outdated Show resolved Hide resolved

racevedoo force-pushed the fix-caching-catalog-deadlock branch from 59f355e to 76f9667 Compare December 23, 2021 21:32

racevedoo requested a review from rdblue December 23, 2021 21:33

kbendick reviewed Dec 24, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/CachingCatalog.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/CachingCatalog.java Outdated Show resolved Hide resolved

kbendick reviewed Dec 24, 2021

View reviewed changes

kbendick mentioned this pull request Dec 24, 2021

Update caffeine cache library to latest 2.x release #3803

Merged

kbendick reviewed Dec 24, 2021

View reviewed changes

racevedoo force-pushed the fix-caching-catalog-deadlock branch from 76f9667 to 867650c Compare December 24, 2021 09:39

racevedoo force-pushed the fix-caching-catalog-deadlock branch from 867650c to 149b5b2 Compare December 24, 2021 13:41

racevedoo force-pushed the fix-caching-catalog-deadlock branch from 149b5b2 to 2bda14f Compare December 24, 2021 13:52

racevedoo requested a review from kbendick December 24, 2021 13:53

racevedoo force-pushed the fix-caching-catalog-deadlock branch 2 times, most recently from d6a74f3 to de72d8d Compare December 24, 2021 14:07

racevedoo force-pushed the fix-caching-catalog-deadlock branch 2 times, most recently from f8aa34d to 181022c Compare December 25, 2021 12:29

kbendick approved these changes Dec 25, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/CachingCatalog.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/CachingCatalog.java Outdated Show resolved Hide resolved

kbendick self-requested a review December 25, 2021 22:02

racevedoo force-pushed the fix-caching-catalog-deadlock branch from 181022c to cca7f51 Compare December 26, 2021 00:38

kbendick approved these changes Dec 26, 2021

View reviewed changes

RussellSpitzer reviewed Dec 27, 2021

View reviewed changes

racevedoo force-pushed the fix-caching-catalog-deadlock branch 2 times, most recently from 0d94efd to 1f54cd8 Compare December 27, 2021 17:22

kbendick reviewed Dec 28, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/hadoop/TestCachingCatalog.java Show resolved Hide resolved

kbendick reviewed Dec 28, 2021

View reviewed changes

RussellSpitzer approved these changes Dec 28, 2021

View reviewed changes

racevedoo force-pushed the fix-caching-catalog-deadlock branch from 1f54cd8 to 4b22c97 Compare December 28, 2021 18:06

rdblue merged commit 63aa349 into apache:master Dec 29, 2021

	public Cache<TableIdentifier, Table> cache() {
	// cleanUp must be called as tests apply assertions directly on the underlying map, but metadata table
	// map entries are cleaned up asynchronously.
	tableCache.cleanUp();
	return tableCache;
	}

Conversation

racevedoo commented Dec 23, 2021

Uh oh!

Uh oh!

Uh oh!

kbendick Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ben-manes Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kbendick Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

racevedoo commented Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

racevedoo commented Dec 24, 2021

Uh oh!

racevedoo commented Dec 24, 2021

Uh oh!

racevedoo commented Dec 24, 2021

Uh oh!

kbendick commented Dec 24, 2021

Uh oh!

kbendick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick commented Dec 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

kbendick Dec 24, 2021 •

edited

Loading

kbendick Dec 24, 2021 •

edited

Loading

ben-manes Dec 24, 2021 •

edited

Loading

kbendick Dec 25, 2021 •

edited

Loading

kbendick left a comment •

edited

Loading

kbendick Dec 24, 2021 •

edited

Loading

racevedoo commented Dec 24, 2021 •

edited

Loading

kbendick left a comment •

edited

Loading

RussellSpitzer left a comment •

edited

Loading

kbendick Dec 28, 2021 •

edited

Loading

kbendick commented Dec 28, 2021 •

edited

Loading

kbendick Dec 28, 2021 •

edited

Loading