-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tobira seems to freeze randomly, not responding to HTTP requests at all anymore #1129
Comments
I skimmed the changelogs for tokio and hyper (starting from the version used in Tobira v2.5). The following issues are potentially relevant. But I haven't looked into any of those in detail:
|
FWIW: I saw that problem once or twice before we upgraded , at least if it is the problem we saw at our meeting last week /discussed with Piri and Waldemar. |
Fixes elan-ev#1129 The "frozen Tobira" bug was caused by the following: we had two in-memory hashmaps that cached different things. Each entry in that cache has a time stamp and is evicted/ignored once it is too old. The eviction is done by a long running task, that calls `retain` on the cache. Of course, the cache is also read. In the case of the user cache these steps are regularly performed: 1. Does an entry for this username exists? If yes: 2. Is this entry too old or differs from the new data? If yes: 3. Write new data to database. 4. Update the timestamp in the cache entry to mark it as "brand new". This tricky part is this: step 3 contains an `.await` point (waiting for the DB to reply). At that point, the Tokio runtime will pause that task and run a different task on that same thread. If that different task happens to be the maintenance task that calls `retain` on the hashmap, then we are in trouble: the documentation of `retain` says: > May deadlock if called when holding any sort of reference into the map And in [this issue](xacrimon/dashmap#233) it is specified, that this means: > May deadlock if called when **the current thread is** holding any sort > of reference into the map. That is usually not a problem in non-async multi-threaded code, but in async code when a lock/reference is held across an await point, then it can happen that the same thread holds a lock and tries to call `retain`. The solution is to not do that: avoid holding locks across await points (a good advice generally anyway). However, that isn't trivial in our case, but much more importantly: doing that is a subtle mistake! I don't want bugs like this cropping up accidentally again. I'm sure there is a clippy lint for that, but still, it feels wrong to me. So I decided to use a different library that does not have this problem. It's not yet as widely used but it seems very promising. The auth callback cache could also use `HashIndex`, which might be faster. But for simplicity I kept both `HashMaps` now. Should be plenty fast.
Fixes elan-ev#1129 The "frozen Tobira" bug was caused by the following: we had two in-memory hashmaps that cached different things. Each entry in that cache has a time stamp and is evicted/ignored once it is too old. The eviction is done by a long running task, that calls `retain` on the cache. Of course, the cache is also read. In the case of the user cache these steps are regularly performed: 1. Does an entry for this username exists? If yes: 2. Is this entry too old or differs from the new data? If yes: 3. Write new data to database. 4. Update the timestamp in the cache entry to mark it as "brand new". This tricky part is this: step 3 contains an `.await` point (waiting for the DB to reply). At that point, the Tokio runtime will pause that task and run a different task on that same thread. If that different task happens to be the maintenance task that calls `retain` on the hashmap, then we are in trouble: the documentation of `retain` says: > May deadlock if called when holding any sort of reference into the map And in [this issue](xacrimon/dashmap#233) it is specified, that this means: > May deadlock if called when **the current thread is** holding any sort > of reference into the map. That is usually not a problem in non-async multi-threaded code, but in async code when a lock/reference is held across an await point, then it can happen that the same thread holds a lock and tries to call `retain`. The solution is to not do that: avoid holding locks across await points (a good advice generally anyway). However, that isn't trivial in our case, but much more importantly: doing that is a subtle mistake! I don't want bugs like this cropping up accidentally again. I'm sure there is a clippy lint for that, but still, it feels wrong to me. So I decided to use a different library that does not have this problem. It's not yet as widely used but it seems very promising. The auth callback cache could also use `HashIndex`, which might be faster. But for simplicity I kept both `HashMaps` now. Should be plenty fast.
Fixes #1129 See first commit description for the technical details on how this was caused. But the gist is: I incorrectly used `DashMap`, holding locks across await points. This causes a deadlock if the timing is right and two specific tasks are scheduled to run in the same thread. I could have fixed the code around the await point, but since this is a very easy and subtle mistake to make, I decided to use a different concurrent hashmap that can better deal with these scenarios. The second commit also fixes the cache dealing with a 0 cache duration (which is supposed to disable the cache). See commits for more details. On the ETH test system I deployed v2.6 plus this patch and verified that the freeze is not happening anymore in the only situation where I could reliably reproduce it before: starting Tobira and immediately making an authenticated request. Since the cache_duration was set to 0, the timing somehow worked out most of the time. Doing that does not freeze Tobira anymore with this patch (I tried several times). --- <details> <summary>Some additional tests/details</summary> The `scc` hashmap has no problem when a lock is held on the same thread that `retain` is called. I tested it like this: ```rust #[tokio::main(flavor = "current_thread")] async fn main() { let start = Instant::now(); let map = HashMap::new(); let _ = map.insert_async("foo", 4).await; let _ = map.insert_async("bar", 27).await; let map = Arc::new(map); { let map = Arc::clone(&map); tokio::spawn(async move { println!("--- {:.2?} calling entry", start.elapsed()); let e = map.entry_async("foo").await; println!("--- {:.2?} acquired entry", start.elapsed()); tokio::time::sleep(Duration::from_millis(3000)).await; *e.or_insert(6).get_mut() *= 2; println!("--- {:.2?} done with entry", start.elapsed()); }); } { let map = Arc::clone(&map); tokio::spawn(async move { tokio::time::sleep(Duration::from_millis(1500)).await; println!("--- {:.2?} calling entry 2", start.elapsed()); let e = map.entry_async("foo").await; println!("--- {:.2?} acquired entry 2", start.elapsed()); tokio::time::sleep(Duration::from_millis(3000)).await; *e.or_insert(6).get_mut() *= 2; println!("--- {:.2?} done with entry 2", start.elapsed()); }); } tokio::time::sleep(Duration::from_millis(1000)).await; println!("--- {:.2?} calling retain", start.elapsed()); map.retain_async(|_, v| *v % 2 != 0).await; println!("--- {:.2?} done retain", start.elapsed()); } ``` This outputs: ``` --- 31.88µs calling entry --- 38.91µs acquired entry --- 1.00s calling retain --- 1.50s calling entry 2 --- 3.00s done with entry --- 3.00s acquired entry 2 --- 6.00s done with entry 2 --- 6.00s done retain ``` Of course a single test doesn't guarantee that this is supported by the library, but all docs indicate as well that the library can deal with these situations. "async" is used everywhere and these kinds of situations regularly occur in async code. Edit: thinking about it more, I suppose the key feature here is that `retain_async` can yield, whereas the `retain` from dashmap could only block when it couldn't make any progress. </details>
See the linked PR for an in-depth explanation of the cause. To add some more useful detail:
|
From the ETH and Bern, we got reports of Tobira randomly stopping to work. No more log outputs are happening, and HTTP requests are not responded to anymore at all. The process still runs, but it seems like it's not doing anything anymore. Log does not have any relevant entries.
Bern is running v2.5 (where it first occurred AFAIK), while ETH is running v2.6 (and they apparently did not see that problem on v2.5). Both are running it behind an nginx. From ETH I heard that it could happen immediately, or after hours or days. Bern reported it failing on roughly 26.02. 17:27, 22.02. 21:10 and 10.02. 16:53.
Playing around on the ETH system right now, I could somewhat reliably cause the problem to occur by immediately refreshing the page after restarting Tobira. I couldn't yet identify a relevant HTTP request, but assume that it's a graphql request. The
index.html
was almost (I think?) delivered correctly. The problem also occurred with the -musl build of v2.6. ETH observed that the memory-usage as reported bysystemctl status
was at only 784KB for the frozen Tobira, but 50MB for a normally running one. I did not see those numbers in my experiments, but it might still be useful information. The frozen process does not consume any noticeable CPU time (i.e. not a full core, for example).I now deployed a MUSL build of 97d28b2 which is just v2.6 + #1120. There we update Tokio and hyper, two prime candidates for causing things like that. So far I could not reproduce the problem, I think? I will see if it continues to work for a few days.
I have not yet gone through the changelogs of Tokio and hyper, but will do so soon. Other things I have planned to debug this:
strace
and friends to see what's happening.The text was updated successfully, but these errors were encountered: