Fix deadlock in auth cache (solves Tobira freezing problem) #1141

LukasKalbertodt · 2024-03-12T10:26:26Z

See first commit description for the technical details on how this was caused. But the gist is: I incorrectly used DashMap, holding locks across await points. This causes a deadlock if the timing is right and two specific tasks are scheduled to run in the same thread. I could have fixed the code around the await point, but since this is a very easy and subtle mistake to make, I decided to use a different concurrent hashmap that can better deal with these scenarios.

The second commit also fixes the cache dealing with a 0 cache duration (which is supposed to disable the cache). See commits for more details.

On the ETH test system I deployed v2.6 plus this patch and verified that the freeze is not happening anymore in the only situation where I could reliably reproduce it before: starting Tobira and immediately making an authenticated request. Since the cache_duration was set to 0, the timing somehow worked out most of the time. Doing that does not freeze Tobira anymore with this patch (I tried several times).

Some additional tests/details

The scc hashmap has no problem when a lock is held on the same thread that retain is called. I tested it like this:

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let start = Instant::now();

    let map = HashMap::new();
    let _ = map.insert_async("foo", 4).await;
    let _ = map.insert_async("bar", 27).await;
    let map = Arc::new(map);

    {
        let map = Arc::clone(&map);
        tokio::spawn(async move {
            println!("--- {:.2?} calling entry", start.elapsed());
            let e = map.entry_async("foo").await;
            println!("--- {:.2?} acquired entry", start.elapsed());
            tokio::time::sleep(Duration::from_millis(3000)).await;
            *e.or_insert(6).get_mut() *= 2;
            println!("--- {:.2?} done with entry", start.elapsed());
        });
    }

    {
        let map = Arc::clone(&map);
        tokio::spawn(async move {
            tokio::time::sleep(Duration::from_millis(1500)).await;
            println!("--- {:.2?} calling entry 2", start.elapsed());
            let e = map.entry_async("foo").await;
            println!("--- {:.2?} acquired entry 2", start.elapsed());
            tokio::time::sleep(Duration::from_millis(3000)).await;
            *e.or_insert(6).get_mut() *= 2;
            println!("--- {:.2?} done with entry 2", start.elapsed());
        });
    }

    tokio::time::sleep(Duration::from_millis(1000)).await;
    println!("--- {:.2?} calling retain", start.elapsed());
    map.retain_async(|_, v| *v % 2 != 0).await;
    println!("--- {:.2?} done retain", start.elapsed());
}

This outputs:

--- 31.88µs calling entry
--- 38.91µs acquired entry
--- 1.00s calling retain
--- 1.50s calling entry 2
--- 3.00s done with entry
--- 3.00s acquired entry 2
--- 6.00s done with entry 2
--- 6.00s done retain

Of course a single test doesn't guarantee that this is supported by the library, but all docs indicate as well that the library can deal with these situations. "async" is used everywhere and these kinds of situations regularly occur in async code.

Edit: thinking about it more, I suppose the key feature here is that retain_async can yield, whereas the retain from dashmap could only block when it couldn't make any progress.

Fixes elan-ev#1129 The "frozen Tobira" bug was caused by the following: we had two in-memory hashmaps that cached different things. Each entry in that cache has a time stamp and is evicted/ignored once it is too old. The eviction is done by a long running task, that calls `retain` on the cache. Of course, the cache is also read. In the case of the user cache these steps are regularly performed: 1. Does an entry for this username exists? If yes: 2. Is this entry too old or differs from the new data? If yes: 3. Write new data to database. 4. Update the timestamp in the cache entry to mark it as "brand new". This tricky part is this: step 3 contains an `.await` point (waiting for the DB to reply). At that point, the Tokio runtime will pause that task and run a different task on that same thread. If that different task happens to be the maintenance task that calls `retain` on the hashmap, then we are in trouble: the documentation of `retain` says: > May deadlock if called when holding any sort of reference into the map And in [this issue](xacrimon/dashmap#233) it is specified, that this means: > May deadlock if called when **the current thread is** holding any sort > of reference into the map. That is usually not a problem in non-async multi-threaded code, but in async code when a lock/reference is held across an await point, then it can happen that the same thread holds a lock and tries to call `retain`. The solution is to not do that: avoid holding locks across await points (a good advice generally anyway). However, that isn't trivial in our case, but much more importantly: doing that is a subtle mistake! I don't want bugs like this cropping up accidentally again. I'm sure there is a clippy lint for that, but still, it feels wrong to me. So I decided to use a different library that does not have this problem. It's not yet as widely used but it seems very promising. The auth callback cache could also use `HashIndex`, which might be faster. But for simplicity I kept both `HashMaps` now. Should be plenty fast.

At some point I said "set duration to 0 to disable cache", but only added appropriate `if` conditions in one place. I forgot to update this maintenance task. To avoid running into this problem again, I added a new type that explicitly encodes "disabled" instead of using the magic value 0 for it, which one can easily forget checking.

owi92

Thanks for the extensive comment on the first commit! Without it I wouldn't understand too much of this tbh. But with that, I'd say this looks good to me™.

LukasKalbertodt added 2 commits March 12, 2024 10:13

LukasKalbertodt added the changelog:admin Changes primarily for admins label Mar 12, 2024

github-actions bot temporarily deployed to test-deployment-pr1141 March 12, 2024 10:31 Destroyed

owi92 approved these changes Mar 12, 2024

View reviewed changes

owi92 merged commit 4a8e304 into elan-ev:master Mar 12, 2024
4 checks passed

LukasKalbertodt deleted the fix-deadlock branch March 12, 2024 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in auth cache (solves Tobira freezing problem) #1141

Fix deadlock in auth cache (solves Tobira freezing problem) #1141

LukasKalbertodt commented Mar 12, 2024 •

edited

Loading

owi92 left a comment

Fix deadlock in auth cache (solves Tobira freezing problem) #1141

Fix deadlock in auth cache (solves Tobira freezing problem) #1141

Conversation

LukasKalbertodt commented Mar 12, 2024 • edited Loading

owi92 left a comment

Choose a reason for hiding this comment

LukasKalbertodt commented Mar 12, 2024 •

edited

Loading