Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in auth cache (solves Tobira freezing problem) #1141

Merged
merged 2 commits into from
Mar 12, 2024

Conversation

LukasKalbertodt
Copy link
Member

@LukasKalbertodt LukasKalbertodt commented Mar 12, 2024

Fixes #1129

See first commit description for the technical details on how this was caused. But the gist is: I incorrectly used DashMap, holding locks across await points. This causes a deadlock if the timing is right and two specific tasks are scheduled to run in the same thread. I could have fixed the code around the await point, but since this is a very easy and subtle mistake to make, I decided to use a different concurrent hashmap that can better deal with these scenarios.

The second commit also fixes the cache dealing with a 0 cache duration (which is supposed to disable the cache). See commits for more details.

On the ETH test system I deployed v2.6 plus this patch and verified that the freeze is not happening anymore in the only situation where I could reliably reproduce it before: starting Tobira and immediately making an authenticated request. Since the cache_duration was set to 0, the timing somehow worked out most of the time. Doing that does not freeze Tobira anymore with this patch (I tried several times).


Some additional tests/details

The scc hashmap has no problem when a lock is held on the same thread that retain is called. I tested it like this:

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let start = Instant::now();

    let map = HashMap::new();
    let _ = map.insert_async("foo", 4).await;
    let _ = map.insert_async("bar", 27).await;
    let map = Arc::new(map);

    {
        let map = Arc::clone(&map);
        tokio::spawn(async move {
            println!("--- {:.2?} calling entry", start.elapsed());
            let e = map.entry_async("foo").await;
            println!("--- {:.2?} acquired entry", start.elapsed());
            tokio::time::sleep(Duration::from_millis(3000)).await;
            *e.or_insert(6).get_mut() *= 2;
            println!("--- {:.2?} done with entry", start.elapsed());
        });
    }

    {
        let map = Arc::clone(&map);
        tokio::spawn(async move {
            tokio::time::sleep(Duration::from_millis(1500)).await;
            println!("--- {:.2?} calling entry 2", start.elapsed());
            let e = map.entry_async("foo").await;
            println!("--- {:.2?} acquired entry 2", start.elapsed());
            tokio::time::sleep(Duration::from_millis(3000)).await;
            *e.or_insert(6).get_mut() *= 2;
            println!("--- {:.2?} done with entry 2", start.elapsed());
        });
    }

    tokio::time::sleep(Duration::from_millis(1000)).await;
    println!("--- {:.2?} calling retain", start.elapsed());
    map.retain_async(|_, v| *v % 2 != 0).await;
    println!("--- {:.2?} done retain", start.elapsed());
}

This outputs:

--- 31.88µs calling entry
--- 38.91µs acquired entry
--- 1.00s calling retain
--- 1.50s calling entry 2
--- 3.00s done with entry
--- 3.00s acquired entry 2
--- 6.00s done with entry 2
--- 6.00s done retain

Of course a single test doesn't guarantee that this is supported by the library, but all docs indicate as well that the library can deal with these situations. "async" is used everywhere and these kinds of situations regularly occur in async code.

Edit: thinking about it more, I suppose the key feature here is that retain_async can yield, whereas the retain from dashmap could only block when it couldn't make any progress.

Fixes elan-ev#1129

The "frozen Tobira" bug was caused by the following: we had two
in-memory hashmaps that cached different things. Each entry in that
cache has a time stamp and is evicted/ignored once it is too old. The
eviction is done by a long running task, that calls `retain` on the
cache. Of course, the cache is also read. In the case of the user cache
these steps are regularly performed:
1. Does an entry for this username exists? If yes:
2. Is this entry too old or differs from the new data? If yes:
3. Write new data to database.
4. Update the timestamp in the cache entry to mark it as "brand new".

This tricky part is this: step 3 contains an `.await` point (waiting for
the DB to reply). At that point, the Tokio runtime will pause that task
and run a different task on that same thread. If that different task
happens to be the maintenance task that calls `retain` on the hashmap,
then we are in trouble: the documentation of `retain` says:

> May deadlock if called when holding any sort of reference into the map

And in [this issue](xacrimon/dashmap#233) it
is specified, that this means:

> May deadlock if called when **the current thread is** holding any sort
> of reference into the map.

That is usually not a problem in non-async multi-threaded code, but in
async code when a lock/reference is held across an await point, then it
can happen that the same thread holds a lock and tries to call `retain`.

The solution is to not do that: avoid holding locks across await points
(a good advice generally anyway). However, that isn't trivial in our
case, but much more importantly: doing that is a subtle mistake! I don't
want bugs like this cropping up accidentally again. I'm sure there is
a clippy lint for that, but still, it feels wrong to me. So I decided
to use a different library that does not have this problem. It's not yet
as widely used but it seems very promising.

The auth callback cache could also use `HashIndex`, which might be
faster. But for simplicity I kept both `HashMaps` now. Should be plenty
fast.
At some point I said "set duration to 0 to disable cache", but only
added appropriate `if` conditions in one place. I forgot to update this
maintenance task. To avoid running into this problem again, I added a
new type that explicitly encodes "disabled" instead of using the magic
value 0 for it, which one can easily forget checking.
@LukasKalbertodt LukasKalbertodt added the changelog:admin Changes primarily for admins label Mar 12, 2024
@github-actions github-actions bot temporarily deployed to test-deployment-pr1141 March 12, 2024 10:31 Destroyed
Copy link
Member

@owi92 owi92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extensive comment on the first commit! Without it I wouldn't understand too much of this tbh. But with that, I'd say this looks good to me™.

@owi92 owi92 merged commit 4a8e304 into elan-ev:master Mar 12, 2024
4 checks passed
@LukasKalbertodt LukasKalbertodt deleted the fix-deadlock branch March 12, 2024 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:admin Changes primarily for admins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tobira seems to freeze randomly, not responding to HTTP requests at all anymore
2 participants