Reproduce deadlock #3995

douglaz · 2023-12-29T20:23:50Z

This is an attempt to reproduce the deadlock issue arised on #3989

We can see it's possible to reproduce the issue with 2 small changes from master:

Changing the test DB from MemDatabase to RocksDb
Adding a task::yield_now() to commit_tx()

I also increased the timeout of the connection requests just to avoid noise in the logs due to the low 60s default timeout.

One way to see a test deadlocking for debug purposes is to run:

RUST_LOG=trace,hyper=info,soketto=info,fedimint_server::net::framed=debug cargo test --package fedimint-mint-tests --test fedimint_mint_tests -- sends_ecash_out_of_band --exact --nocapture

The goal of this PR is just to show that there is some issue on Fedimint db layer that isn't exclusive to MemDatabase and that can be reproduced without adding new tests. Created an issue to track it: #3995

maan2003 · 2023-12-30T10:28:29Z

fedimint-core/src/api.rs

@@ -804,6 +804,7 @@ impl JsonRpcClient for WsClient {
        return WsClientBuilder::default()
            .use_webpki_rustls()
            .max_concurrent_requests(u16::MAX as usize)
+            .request_timeout(Duration::from_secs(3600))


maybe this is just making it really slow?

maan2003 · 2023-12-30T10:33:57Z

fedimint-testing/src/federation.rs

+            let db = Database::new(
+                fedimint_rocksdb::RocksDb::open(tempfile::tempdir().unwrap()).unwrap(),
+                decoders,
+            );


can't repro with only client db changed (with server memdb)
and I thought this was client side deadlock, maybe it is server side?

maan2003 · 2023-12-30T10:39:01Z

this could be separate deadlock not same as #3989

dpc · 2024-01-16T19:01:00Z

This looks like MintClientModule::await_output_finalized subscribing to notifications, filtering out some Created and never getting any updates. Possibly a timing/race condition issue somewhere, not a db problem.

Will keep debugging.

dpc · 2024-01-17T23:58:53Z

I've pin-pointed the problem the that AlephBFT hanging in UnitSaver::flush:

impl std::io::Write for UnitSaver {
    fn write(&mut self, buffer: &[u8]) -> std::io::Result<usize> {
        self.buffer.extend_from_slice(buffer);
        Ok(buffer.len())
    }

    fn flush(&mut self) -> std::io::Result<()> {
        info!(units_index = %self.units_index, "##### UNIT FLUSH START");
        block_on(async {
                let mut dbtx = self.db.begin_transaction().await;

                dbtx.insert_new_entry(&AlephUnitsKey(self.units_index), &self.buffer)
                    .await;

                dbtx.commit_tx_result()
                    .await
                    .expect("This is the only place where we write to this key");
        })

        info!(units_index = %self.units_index, "##### UNIT FLUSH END");

        self.buffer.clear();
        self.units_index += 1;

        Ok(())
    }
}

It is using futures::executor::block_on; which in my understanding corresponds to some kind of loop polling future repeatedly on the current thread, and I guess this is somehow incompatible with tokio::task::yield_now which I guess expects to be running on tokio runtime direclty(?).

I'm pretty sure this block_on is incorrect, but happens to work. I tried couple of quick attempts to fix it using different block_ons and block_in_place but I only made things even worse, so I guess this requires some more debugging and careful analysis, but this made me remind myself that async Rust is a waste of everyone's time and ruined otherwise perfect Rust programming language and should be avoided.

Anyway - I think it makes sense, the only thing that this PR does is add yield_now, this blocks aleph, so the submitted transaction is never processed and the test hangs.

The original purpose of this PR was to show that there are some problems with the database that are not about wrong implementation of MVCC in MemDatabase, but I think the above goes to show that this is not the case (at least not here).

douglaz · 2024-01-18T13:53:05Z

tokio::task::yield_now which I guess expects to be running on tokio runtime direclty(?).

negative. yield_now is a very simple future implementation that will be pending once then always ready after that. there is nothing tokio specific there. this issue isn't a incompatibility between async runtimes because there is only one runtime in this project: tokio

dpc · 2024-01-19T00:41:03Z

negative. yield_now is a very simple future implementation that will be pending once then always ready after that. there is nothing tokio specific there. this issue isn't a incompatibility between async runtimes because there is only one runtime in this project: tokio

Edit: the below is wrong, I think.

The tokio::task::yield_now is OK, but I don't think block_on is, because it explicitly "Runs a future to completion on the current thread." and the "current thread" is already tokio's executor which most likely does not like being blocked like that. So that yield_on is getting scheduled on a task queue of an executor thread that is no longer scheduling (because it blocked on an ablocking block_on). There's a reason Tokio runtime has it's own block_on(s), so probably that should be used instead in some form, but when I tried it wasn't working and I think @maan2003 is hitting same issues in #4054 . Let's continue discussion there.

The log line before that block_on is printed, the one after is not, the alephbft executor gets stuck, transaction are being submitted but not processed, test gets stuck. Whatever it is, we should fix it, and we can continue reasoning about this PR.

elsirion · 2024-01-19T12:18:23Z

I tried couple of quick attempts to fix it using different block_ons and block_in_place but I only made things even worse

@dpc so you tried this approach I assume?

fedimint/devimint/src/external.rs

Lines 208 to 214 in 9c36777

    
           block_in_place(move || { 
        
               block_on(async move { 
        
                   if let Err(e) = self.terminate().await { 
        
                       warn!(target: LOG_DEVIMINT, "failed to terminate lightningd: {e:?}"); 
        
                   } 
        
               }) 
        
           });

elsirion · 2024-01-19T12:41:22Z

I see, @maan2003 did and it helped? 28b0361 (or at least for that one PR)

justinmoon · 2024-02-22T18:28:09Z

Still relevant?

elsirion · 2024-02-22T19:53:15Z

No, this was related to the AlephBFT sync/async bug.

chore: try to reproduce bug

5580add

douglaz force-pushed the simplified_bug branch from ce5bb52 to 5580add Compare December 29, 2023 20:32

This was referenced Dec 29, 2023

fix: handle write write conflicts in mem db #3989

Merged

Deadlock somewhere on the database layer on Fedimint #3996

Closed

douglaz changed the title ~~chore: try to reproduce bug~~ Reproduce deadlock Dec 29, 2023

maan2003 reviewed Dec 30, 2023

View reviewed changes

dpc mentioned this pull request Jan 18, 2024

UnitSaver::flush is using wrong block_on? #4054

Closed

elsirion closed this Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce deadlock #3995

Reproduce deadlock #3995

douglaz commented Dec 29, 2023 •

edited

maan2003 Dec 30, 2023

maan2003 Dec 30, 2023

maan2003 commented Dec 30, 2023

dpc commented Jan 16, 2024 •

edited

dpc commented Jan 17, 2024 •

edited

douglaz commented Jan 18, 2024 •

edited

dpc commented Jan 19, 2024 •

edited

elsirion commented Jan 19, 2024

elsirion commented Jan 19, 2024

justinmoon commented Feb 22, 2024

elsirion commented Feb 22, 2024

Reproduce deadlock #3995

Reproduce deadlock #3995

Conversation

douglaz commented Dec 29, 2023 • edited

maan2003 Dec 30, 2023

Choose a reason for hiding this comment

maan2003 Dec 30, 2023

Choose a reason for hiding this comment

maan2003 commented Dec 30, 2023

dpc commented Jan 16, 2024 • edited

dpc commented Jan 17, 2024 • edited

douglaz commented Jan 18, 2024 • edited

dpc commented Jan 19, 2024 • edited

elsirion commented Jan 19, 2024

elsirion commented Jan 19, 2024

justinmoon commented Feb 22, 2024

elsirion commented Feb 22, 2024

douglaz commented Dec 29, 2023 •

edited

dpc commented Jan 16, 2024 •

edited

dpc commented Jan 17, 2024 •

edited

douglaz commented Jan 18, 2024 •

edited

dpc commented Jan 19, 2024 •

edited