Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearing LFU cache doesn't actually clear it #399

Closed
provegard opened this issue Sep 19, 2023 · 9 comments · Fixed by #401
Closed

Clearing LFU cache doesn't actually clear it #399

provegard opened this issue Sep 19, 2023 · 9 comments · Fixed by #401
Labels
bug Something isn't working

Comments

@provegard
Copy link

I have a situation where clearing the cache doesn't work. I can reproduce it with the following test case:

    [Test]
    public async Task TestClear()
    {
        var cache = new ConcurrentLfuBuilder<string, string>()
            .WithScheduler(new BackgroundThreadScheduler())
            .WithCapacity(40)
            .WithAtomicGetOrAdd()
            .AsAsyncCache()
            .Build();

        var emptyCount = 0;
        for (var a = 0; a < 200; a++)
        {
            for (var i = 0; i < 10; i++)
            {
                var str = "a" + i;
                await getOrAdd(str);
            }

            cache.Clear();
            cache.Clear(); // attempt to trigger maintenance after previous clear operation

            emptyCount += cache.Count;
        }

        Assert.That(emptyCount, Is.EqualTo(0));

        async Task getOrAdd(string key)
        {
            await cache.GetOrAddAsync(key, Task.FromResult);
        }
    }

When I run this test, it fails most of the times, but occasionally succeeds. When it fails, emptyCount varies from 2 to 7.

When I debugged the real situation, I ended up in ConcurrentLfu.Trim. Here, I noticed that:

  • this.Count was 3.
  • this.probationLru had something like 26 nodes in it - but it looked like multiple nodes had the same key.
  • The repeated calls to TakeCandidatesInLruOrder produced 3 candidates.
  • TryRemove failed to remove the candidates.

Debugging from the test has the same pattern - TryRemove fails since this.dictionary doesn't contain the candidate, so it appears to be out of sync.

I know that there is some sort of async maintenance going on (that's why I try to force maintenance in the test with the second Clear), but in the application where the cache is used, the entries don't go away even after a while - the count is stuck at non-zero after clearing.

@provegard
Copy link
Author

Update: I tested with ForegroundScheduler, and that makes the test green of course, which means that the test doesn't reflect the real-world situation, where BackgroundThreadScheduler supposedly never gets around to clean things up.

@provegard
Copy link
Author

I would have thought that calling Clear multiple times would work, since it calls Maintenance internally, which also is what the background scheduler does.

@bitfaster
Copy link
Owner

You beat me to it - I was going to try with ForegroundScheduler.

I think the reason ForegroundScheduler is working and BackgroundThreadScheduler fails (even when calling Clear twice) is due to a bug - the Clear method is immediately purging the write buffer which contains the removed items that were just added by TryRemove - they are removed before they have been processed in the background. Foreground scheduler works because TryRemove is fully completed before the write buffer is cleared.

What isn't clear to me is how the unit test for this can pass. I will debug this in detail.

@bitfaster
Copy link
Owner

I looked at this some more yesterday - the code is wrong and leaves orphan nodes in the window/probation/protected LRUs. These are the bogus items you observed in the probation LRU.

The unit test I have passes because it only validates via the count and whether the removed keys exist. Both count and exists are based on the internal dictionary. Clear internally calls TryRemove which immediately removes the key from the dictionary (thus test criteria is met and the unit test passes). TryRemove also stores a record of removal in the write buffer (so the LRUs can be cleaned up by maintenance later). But since Clear immediately drops the buffers before maintenance, orphans are left behind in window/probation/protected. These are not observable by the test when performing only a single iteration of clear.

In your test, it is possible background maintenance runs before Clear drops the write buffer (so it can pass sometimes). If maintenance doesn't run, orphans are left behind and subsequent iterations build up more orphans. Since Clear will attempt to Trim n cache entries, the orphans with dupe keys will be processed ahead of other valid items that appear in the list after n items are processed. For example, let's say the cache is size 3 and contains keys [a, b, c], but the probation LRU ends up with orphans of key a, so contains [a, a, a, b, c]. Trim will take n=3 candidates from probation, then attempt to remove key a 3 times. The next time around, it may have cleared all the orphans and produce a valid result, so it works intermittently.

I will fix this so that the write buffer items are not dropped. I will also implement an integrity check in the tests to validate the internal data structures are in a consistent state to prevent re-introducing this bug. Last month I added some soak tests for ConcurrentLru (multi-threaded tests that interleave different read/write method pairs), I will also get this done for LFU and check for any other similar bugs.

@bitfaster bitfaster added the bug Something isn't working label Sep 25, 2023
@khellang
Copy link

khellang commented Sep 29, 2023

Hey @bitfaster! 👋🏻

I think we're bumping up against the same bug. Can you confirm this also affects ConcurrentLru<K, V>?

Is there a workaround? Looping through entries and calling TryRemove?

We're using v2.2.1.

UPDATE: Looping with TryRemove seems to leave the cache in a broken state as well 😅

@khellang
Copy link

khellang commented Sep 29, 2023

Moved repro to #402

2 similar comments
@khellang
Copy link

khellang commented Sep 29, 2023

Moved repro to #402

@khellang
Copy link

khellang commented Sep 29, 2023

Moved repro to #402

@bitfaster
Copy link
Owner

bitfaster commented Sep 29, 2023

@khellang I created a new issue for ConcurrentLru here #402, and copied over all your notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants