-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mint_client_sanity flake: losing 1msat balance #5195
Comments
Ping @elsirion for awareness. |
reproduces on my system after a long while |
Interesting observation: if you restore the client from passphrase it has correct balance. So it's definitely the client that lost something.
|
Do you have any logs from the run that failed? EDIT: was just running it myself to get some logs, is there a way to get the log output of |
there is |
oh there is also |
Yeah, I run with Edit: I spoke too soon and reminting 1msat is yet to repro the issue. |
Argh. Yesterday I was able to repro 2 times, and today I'm running it for 2 hours already and nothing. |
Ok, I think I've found something. So one of the transactions was rejected:
and later the refund transactions seem to also get rejected:
The reason why we might be missing why the transaction rejection was missed on a higher level is using However it's unclear why would it get rejected in the first place (possibly we're trying to spend the output too fast?), and then why would refund fail (something retried?). |
With more logging #5206:
and then:
OK, makes sense. Input was already spent, trying to refund it fails the same way. Now ... why is it spent? |
How am I going to figure out at which stage the input became spent... It could be that Or |
dev call: dpc and eric both can reproduce (but it takes hours to hit) |
OK, so I think I have a clue: We submit a tx
The output state machine says it succeeded:
But the input state machine fails:
Output state machines wait only for Which suggests that our idempotency of tx submission is borked. Or I messed up something and confused myself. |
🤔 diff --git a/fedimint-server/src/net/api.rs b/fedimint-server/src/net/api.rs
index 8fc49260f1..4a1cccfcc4 100644
--- a/fedimint-server/src/net/api.rs
+++ b/fedimint-server/src/net/api.rs
@@ -114,11 +114,10 @@ impl ConsensusApi {
debug!(%txid, "Received mint transaction");
+ // Create read-only DB tx so that the read state is consistent
+ let mut dbtx = self.db.begin_transaction_nc().await;
// we already processed the transaction before
- if self
- .db
- .begin_transaction()
- .await
+ if dbtx
.get_value(&AcceptedTransactionKey(txid))
.await
.is_some()
@@ -126,9 +125,6 @@ impl ConsensusApi {
return Ok(txid);
}
- // Create read-only DB tx so that the read state is consistent
- let mut dbtx = self.db.begin_transaction_nc().await;
-
// We ignore any writes, as we only verify if the transaction is valid here
dbtx.ignore_uncommitted(); Will leave it overnight and see how it went. |
Great catch! That could very well be the culprit, starting the test script on my server too. |
#5226 but I'm still able to repro, so either red herring or didn't help. Will keep debugging. Bummer, looked like a perfect candidate. |
Taking a step back #5195 (comment) doesn't explain losing money imo since the output was processed successfully. There's probably another anomaly that causes the loss, what I'd expect is that output processing failed since you noticed earlier that money reappears when recovering from seed. |
The good part is that I'm improving the debugability of all the things here, so it should be easier to figure out issues like this in the future. |
Why does everything looks always OK? The note
Then it got spent as OOB as
Then the receiver reissues it in tx
but it fails:
Adding more server-side logging to track when exactly this note got spent. |
It seems the issue is server side: First the server does the
Then the consensus processes the tx and everything commits:
But on the next submit_tx with the exact same transaction:
which seems the code failed to detect that transaction was processed, but somehow (in the same dbtx) sees the note was marked as spent, which happened on the same write tx when tx was processed by the consensus. Either we experience dirty-reads, or we somehow failed to write accepted tx, but marked inputs as spent, or something very weird is going on. Edit: Oh, actually I'm not sure if the tx is exactly the same because debug output is truncated. I'm wondering if maybe client is not recalculating nonce each time or something. Edit2: Looking at the client code, I don't see a way the tx would change. Everything suggests that somehow on re-submitting, we sometimes fail to see the record marking it as accepted already. |
OMG. There's nothing guaranteeing that the db view does not change from under us, since we are not running with "serializable" isolation level. Just read through https://stackoverflow.com/a/4036063/134409 . By default we are only getting "read committed". We could update to "repeatable reads", but it doesn't really matter here. The sequence is as follows:
For some reason I never really internalized that the db can't guarantee consistent read-view of the database in the transaction. Only after seeing it step by step in the log now, with all the needed data it clicked. Please confirm if this makes sense, or maybe it's just too late and I lost the plot. The simple solution here, would be to change the code to check for |
In one of the recent commits I've added an assertion to make sure a spendable note was actually always removed when we thought it was, as I was investigating a bug causing notes to be lost (fedimint#5195), and it seemed like a good idea. Too bad it causing flakes now: fedimint#5271 So it looks like our reliance on retrying db conflicts (`autocommit`), leads to inability to assert basic expectations about database state, which seems terrible. In PR I'm adding a little lock to prevent write-write conflicts in this area of the code, and I think it's worthwhile, just to be able to ensure database integrity. Moreover, I realized that `autocommit` is probably a bad idea altogether. Application typically require transaction conflict handling, because they are using **shared** databases. But with an embedded databases, there is no sharing with external entities so there *always* should be, or even *must be* a way to structure application and locking to avoid any conflicts, preventing db isolation level issues, inability to assert basic stuff, or just wasting time retrying things.
@dpc @m1sterc001guy I think we (at least me) always operated under the assumption that RocksDB provides snapshot isolation 😬 Didn't we even have a whole test suite for that? Imo we should just get the closest to that we can, dealing with lower isolation levels manually isn't worth it at all. EDIT: @dpc I don't think this is the problem. We do set the snapshot when creating the DBTX and even have a test for read repeatability: fedimint/fedimint-rocksdb/src/lib.rs Lines 452 to 458 in 757c5ff
EDIT2: The one thing that we don't prevent on the DB layer is:
Fixing that would probably require serializability? |
Yeah I believe we have snapshot isolation. When creating the rocksdb transaction, we set the snapshot like this. Here was the commit. async fn begin_transaction<'a>(&'a self) -> RocksDbTransaction {
let mut optimistic_options = OptimisticTransactionOptions::default();
optimistic_options.set_snapshot(true);
let mut write_options = WriteOptions::default();
// Make sure we never lose data on unclean shutdown
write_options.set_sync(true);
let mut rocksdb_tx =
RocksDbTransaction(self.0.transaction_opt(&write_options, &optimistic_options));
rocksdb_tx
.set_tx_savepoint()
.await
.expect("setting tx savepoint failed");
rocksdb_tx
} As Eric pointed out, we have a set of tests for detecting the other anomalies present at lower isolation levels.
Yes I believe so. Snapshot isolation provides a read-consistent view of the database at the time of the transaction creation. (here's a good example post about the differences between SI and Serializable) Since in your example above, the data that tx1 is operating over changes during the transaction, it will not have an up to date version of key1 at step 4. I've been moving this week so I haven't been able to keep up with this whole thread, but do we think our isolation expectations have changed because of this bug? Perhaps we can discuss on the dev call. |
BTW. #5271 is another place confirm something is of. Otherwise - how could you in the same db tx selected a note, and then fail to return it when you're deleting it. Check https://github.com/fedimint/fedimint/actions/runs/9041274494/job/24860479614?pr=5259
So consolidation code selected notes
This repeats 8 times for each note being deleted.
And it commited. But then right afterwards the commit there's another task that selected the same notes:
which I believe is because the that dbtx was opened before the previous one commited, as these functions are parts of bigger and longer dbtx-es. and it inevitably fails on trying to delete a missing note:
As you can see one of the consolidations was part of a But if everything was working the second transaction should either not see these notes anymore, or still see them an successfully deleted them, right? |
Ah I see so, if I understand correctly, dbtx2 starts before dbtx1 commits, so it still sees the notes as available, tries to delete them, and fails since they're already deleted. This is possible with Snapshot isolation, since SI only guarantees that there will be a consistent read at the time of transaction creation. SI is mainly useful in systems that don't have long running transactions that will conflict with each other (its also why we use optimistic transactions, since they avoid locking at the db layer). Serializable would prevent this, it would essentially block the second transaction from making progress at all until dbtx1 is committed, since the first transaction would acquire a lock over the page where the note is held. Last time I looked at this I think we had determined SI would be a sufficient isolation level, but maybe with this note consolidation we should reconsider this? Are we doing these consolidations in parallel with other spends? |
https://docs.rs/rocksdb/latest/rocksdb/struct.TransactionOptions.html#method.set_snapshot This looks sus to me. |
We've found the issue with @m1sterc001guy ! 🥳 🎉 🌮 |
https://github.com/fedimint/fedimint/actions/runs/8929917822/job/24528829288
It could be related to another flake we've seen where we are sometimes losing a note.
The text was updated successfully, but these errors were encountered: