-
Notifications
You must be signed in to change notification settings - Fork 112
swap: fix instabilities in simulation tests #1930
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix seems to work. I let the sim tests run a few thousand times and it didn't happen anymore.
It still fails sometimes but due to other issues (see #1899 (comment)). waitForChequeProcessed
needs to be adapted to also wait for the cheque to be confirmed. Currently a new cheque is attempted to be sent while the old one is not confirmed yet, leading to the pending cheque to be resent, but waitForChequeProcessed
is waiting for a new cheque causing the test to hang and timeout. This can also be a separate PR if you only want to focus on the other issue here.
Regarding the underlying process vulnerability the message sequence being wrong (e.g. the message arriving too late) is not really a problem, after all the other node also sends an extra cheque in response to compensate. (From what I recall if both nodes cashed the cheques at the end the effective transferred value would still be 0).
The bigger issue uncovered here is that we can bring another node to send us a cheque simply by sending a cheque to it first (even if no traffic even occurred) and putting it into high enough debt. Not sure what the solution here is, maybe we should just reject a cheque (as in not confirm) if it would put us too much into debt? I opened an issue for this here.
I was testing this PR on ubuntu in virtualbox where I could reproduce the problem most frequently. While tests usually passes, sometimes TestMultiChequeSimulation gets blocked at swap/simulations_test.go:454 I am running
|
@janos this issue is known and @holisticode wants to fix it in a separate PR. I created an issue for this here: #1933 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @holisticode @ralph-pichler LGTM.
This PR fixes #1899 .
It does this by introducing an additional check when waiting for the cheque to be processed, which is not enough (as documented by @ralph-pichler at #1899 (comment)), due to the high parallelization of the test.
We now also wait and check that the message has been received on the other end by checking the appropriate metrics counter.
Note that this should (but may not) fix the test, but does not address an underlying process vulnerability which has been exposed by #1899 , which is that if the message sequence is expressed in an unexpected way, there may be bugs in the workflow