Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce deadlock #3995

Closed
wants to merge 1 commit into from
Closed

Reproduce deadlock #3995

wants to merge 1 commit into from

Conversation

douglaz
Copy link
Contributor

@douglaz douglaz commented Dec 29, 2023

This is an attempt to reproduce the deadlock issue arised on #3989

We can see it's possible to reproduce the issue with 2 small changes from master:

  • Changing the test DB from MemDatabase to RocksDb
  • Adding a task::yield_now() to commit_tx()

I also increased the timeout of the connection requests just to avoid noise in the logs due to the low 60s default timeout.

One way to see a test deadlocking for debug purposes is to run:

RUST_LOG=trace,hyper=info,soketto=info,fedimint_server::net::framed=debug cargo test --package fedimint-mint-tests --test fedimint_mint_tests -- sends_ecash_out_of_band --exact --nocapture

The goal of this PR is just to show that there is some issue on Fedimint db layer that isn't exclusive to MemDatabase and that can be reproduced without adding new tests. Created an issue to track it: #3995

@douglaz douglaz changed the title chore: try to reproduce bug Reproduce deadlock Dec 29, 2023
@@ -804,6 +804,7 @@ impl JsonRpcClient for WsClient {
return WsClientBuilder::default()
.use_webpki_rustls()
.max_concurrent_requests(u16::MAX as usize)
.request_timeout(Duration::from_secs(3600))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is just making it really slow?

Comment on lines +113 to +116
let db = Database::new(
fedimint_rocksdb::RocksDb::open(tempfile::tempdir().unwrap()).unwrap(),
decoders,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't repro with only client db changed (with server memdb)
and I thought this was client side deadlock, maybe it is server side?

@maan2003
Copy link
Member

this could be separate deadlock not same as #3989

@dpc
Copy link
Contributor

dpc commented Jan 16, 2024

This looks like MintClientModule::await_output_finalized subscribing to notifications, filtering out some Created and never getting any updates. Possibly a timing/race condition issue somewhere, not a db problem.

Will keep debugging.

@dpc
Copy link
Contributor

dpc commented Jan 17, 2024

I've pin-pointed the problem the that AlephBFT hanging in UnitSaver::flush:

impl std::io::Write for UnitSaver {
    fn write(&mut self, buffer: &[u8]) -> std::io::Result<usize> {
        self.buffer.extend_from_slice(buffer);
        Ok(buffer.len())
    }

    fn flush(&mut self) -> std::io::Result<()> {
        info!(units_index = %self.units_index, "##### UNIT FLUSH START");
        block_on(async {
                let mut dbtx = self.db.begin_transaction().await;

                dbtx.insert_new_entry(&AlephUnitsKey(self.units_index), &self.buffer)
                    .await;

                dbtx.commit_tx_result()
                    .await
                    .expect("This is the only place where we write to this key");
        })

        info!(units_index = %self.units_index, "##### UNIT FLUSH END");

        self.buffer.clear();
        self.units_index += 1;

        Ok(())
    }
}

It is using futures::executor::block_on; which in my understanding corresponds to some kind of loop polling future repeatedly on the current thread, and I guess this is somehow incompatible with tokio::task::yield_now which I guess expects to be running on tokio runtime direclty(?).

I'm pretty sure this block_on is incorrect, but happens to work. I tried couple of quick attempts to fix it using different block_ons and block_in_place but I only made things even worse, so I guess this requires some more debugging and careful analysis, but this made me remind myself that async Rust is a waste of everyone's time and ruined otherwise perfect Rust programming language and should be avoided.

Anyway - I think it makes sense, the only thing that this PR does is add yield_now, this blocks aleph, so the submitted transaction is never processed and the test hangs.

The original purpose of this PR was to show that there are some problems with the database that are not about wrong implementation of MVCC in MemDatabase, but I think the above goes to show that this is not the case (at least not here).

@douglaz
Copy link
Contributor Author

douglaz commented Jan 18, 2024

tokio::task::yield_now which I guess expects to be running on tokio runtime direclty(?).

negative. yield_now is a very simple future implementation that will be pending once then always ready after that. there is nothing tokio specific there. this issue isn't a incompatibility between async runtimes because there is only one runtime in this project: tokio

@dpc
Copy link
Contributor

dpc commented Jan 19, 2024

negative. yield_now is a very simple future implementation that will be pending once then always ready after that. there is nothing tokio specific there. this issue isn't a incompatibility between async runtimes because there is only one runtime in this project: tokio

Edit: the below is wrong, I think.

The tokio::task::yield_now is OK, but I don't think block_on is, because it explicitly "Runs a future to completion on the current thread." and the "current thread" is already tokio's executor which most likely does not like being blocked like that. So that yield_on is getting scheduled on a task queue of an executor thread that is no longer scheduling (because it blocked on an ablocking block_on). There's a reason Tokio runtime has it's own block_on(s), so probably that should be used instead in some form, but when I tried it wasn't working and I think @maan2003 is hitting same issues in #4054 . Let's continue discussion there.

The log line before that block_on is printed, the one after is not, the alephbft executor gets stuck, transaction are being submitted but not processed, test gets stuck. Whatever it is, we should fix it, and we can continue reasoning about this PR.

@elsirion
Copy link
Contributor

I tried couple of quick attempts to fix it using different block_ons and block_in_place but I only made things even worse

@dpc so you tried this approach I assume?

block_in_place(move || {
block_on(async move {
if let Err(e) = self.terminate().await {
warn!(target: LOG_DEVIMINT, "failed to terminate lightningd: {e:?}");
}
})
});

@elsirion
Copy link
Contributor

I see, @maan2003 did and it helped? 28b0361 (or at least for that one PR)

@justinmoon
Copy link
Contributor

Still relevant?

@elsirion
Copy link
Contributor

No, this was related to the AlephBFT sync/async bug.

@elsirion elsirion closed this Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants