You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like our connection handlers consider themselves disconnected before the tcp socket is really disconnected.
This severely breaks some core assumptions on lightning message delivery...
Here are logs showing that Phoenix thinks it's disconnected before receiving a commit_sig, but it has actually been received in the tcp socket. When we reconnect we receive that commit_sigbefore going through the init / channel_reestablish flow, which breaks the channel restore logic.
INFO: 2021-01-21T11:07:47.686302 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 sending UpdateFulfillHtlc(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, id=1, paymentPreimage=b95e2fc2563d1272d7b291b86b3ca7489238029614af28b77f7c2c1a520954ba)
INFO: 2021-01-21T11:07:47.984533 | f.a.e.channel.ChannelState: c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 built remote commit number=9 toLocalMsat=95468058 toRemoteMsat=30551942 htlc_in= htlc_out= feeratePerKw=FeeratePerKw(feerate=5000 sat) txid=a550cb48571dc3e2e188f2bd7233f4e05cefaa2f057ba0bd65b82e429798ab2f tx=0200000001ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e620444970100000000fca23080044a01000000000000220020c219eb7f4540c76362943a9f36a248734a879373488e677a2ef39a69e73542814a01000000000000220020fbb4b6620be5da83753256006febd0db92b2939dfb6476dbe2482c2e2791290a5777000000000000220020e894e2a963db6f16f6278115828281fe9005a5f3a3fbfb5f5e450b734fb7c1cc645c0100000000002200207b145c141421178db6a97e2d0f4767e87596062e87268e9dd8edae53a23885583e37ec20
INFO: 2021-01-21T11:07:48.716675 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 storing state=class fr.acinq.eclair.channel.Normal
INFO: 2021-01-21T11:07:49.400001 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 sending CommitSig(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, signature=a9956caf1e53043934dc740b490ccd8ef77a74d47eab1a3df7dfbed4b86d3be242ba8c95f79a426101185eb5e6cda6414702c09488d2898e26a73eb4ab28ea51, htlcSignatures=[], channelData=...)
INFO: 2021-01-21T11:07:49.755166 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received RevokeAndAck(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, perCommitmentSecret=PrivateKey(value=5d27c84e58aa3fedaa9679994e493f9895a9d383981bf7ef3a7b2416b9b4f16d), nextPerCommitmentPoint=03771d75d514ac3150629ac4a2eb1aa3f289b7cdb09e993fdde032e8326f7a06ca, channelData=)
INFO: 2021-01-21T11:07:49.756575 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received class fr.acinq.eclair.wire.RevokeAndAck for channel ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496
INFO: 2021-01-21T11:07:49.758858 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 new state: class fr.acinq.eclair.channel.Normal
INFO: 2021-01-21T11:07:49.759551 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 storing state=class fr.acinq.eclair.channel.Normal
DEBUG: 2021-01-21T11:07:50.348445 | f.a.p.app.AppConnectionsDaemon: peerControlFlow = TrafficControl(networkIsAvailable=true, disconnectCount=1)
DEBUG: 2021-01-21T11:07:50.349124 | f.a.p.app.AppConnectionsDaemon: electrumControlFlow = TrafficControl(networkIsAvailable=true, disconnectCount=1)
More than 3 minutes between these log lines, when nothing happened (app put into the background?)
WARNING: 2021-01-21T11:11:03.651435 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 Software caused connection abort
WARNING: 2021-01-21T11:11:03.654493 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 Socket is not connected
DEBUG: 2021-01-21T11:11:03.657661 | f.a.p.app.AppConnectionsDaemon: peerControlFlow = TrafficControl(networkIsAvailable=true, disconnectCount=0)
DEBUG: 2021-01-21T11:11:03.657897 | f.a.p.app.AppConnectionsDaemon: electrumControlFlow = TrafficControl(networkIsAvailable=true, disconnectCount=0)
DEBUG: 2021-01-21T11:11:03.658724 | f.a.p.app.AppConnectionsDaemon: New Peer status CLOSED
DEBUG: 2021-01-21T11:11:03.658775 | f.a.p.app.AppConnectionsDaemon: Wait for 500ms before retrying connection on Peer
DEBUG: 2021-01-21T11:11:03.658952 | f.a.p.app.AppConnectionsDaemon: New Electrum status ESTABLISHED
Notice the CommitSig that we receive now from the tcp socket -> it ends up being half-received which completely breaks our state.
INFO: 2021-01-21T11:11:10.667554 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received CommitSig(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, signature=4823e7c1735303c4f6cbcdd19b1b3b1a853a8a7e07df06aea38a8b47f9b4aa4e1eac0b25f1fad65e6a770eb647e5c838fa59f320cff055ad2b0e798eee7fb794, htlcSignatures=[], channelData=)
INFO: 2021-01-21T11:11:10.669285 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received class fr.acinq.eclair.wire.CommitSig for channel ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496
INFO: 2021-01-21T11:11:10.967573 | f.a.e.channel.ChannelState: c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 built local commit number=9 toLocalMsat=30551942 toRemoteMsat=95468058 htlc_in= htlc_out= feeratePerKw=FeeratePerKw(feerate=5000 sat) txid=733d6dafcabfb68e2fa93fb9593b5ce7a4f6b288184595b52df3323b950dfd8f tx=0200000001ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e620444970100000000fca23080044a01000000000000220020c219eb7f4540c76362943a9f36a248734a879373488e677a2ef39a69e73542814a01000000000000220020fbb4b6620be5da83753256006febd0db92b2939dfb6476dbe2482c2e2791290a577700000000000022002008d4aedcd2d0a94f5c1070450e30bd411b660d01f81a5b997615212c8feb5674645c010000000000220020345fda6ab07a492d86598542a1592f26a93124f6a5bfe5f6c1754c85ba2171933e37ec20
INFO: 2021-01-21T11:11:11.677044 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 new state: class fr.acinq.eclair.channel.Normal
And because of this, we fail to send our revocation but still believe we sent it...
WARNING: 2021-01-21T11:11:11.678158 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 unhandled action: class fr.acinq.eclair.channel.ChannelAction.Message.Send
INFO: 2021-01-21T11:11:11.679148 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 storing state=class fr.acinq.eclair.channel.Normal
WARNING: 2021-01-21T11:11:12.533617 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 disconnect peer from TCP socket
WARNING: 2021-01-21T11:11:12.536374 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 disconnecting channels
INFO: 2021-01-21T11:11:12.540368 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 connecting to 13.248.222.197
DEBUG: 2021-01-21T11:11:12.798762 | f.a.p.app.AppConnectionsDaemon: New Peer status ESTABLISHING
INFO: 2021-01-21T11:11:12.885989 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 sending init Init(features=080000002aa201, tlvs=TlvStream(records=[], unknown=[]))
INFO: 2021-01-21T11:11:12.941995 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received Init(features=0aaa8a, tlvs=TlvStream(records=[Networks(chainHashes=[43497fd7f826957108f4a30fd9cec3aeba79972084e90ead01ea330900000000])], unknown=[]))
INFO: 2021-01-21T11:11:12.944982 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 peer is using features Features(activated=[ActivatedFeature(feature=option_data_loss_protect, support=optional), ActivatedFeature(feature=initial_routing_sync, support=optional), ActivatedFeature(feature=gossip_queries, support=optional), ActivatedFeature(feature=var_onion_optin, support=optional), ActivatedFeature(feature=gossip_queries_ex, support=optional), ActivatedFeature(feature=option_static_remotekey, support=optional), ActivatedFeature(feature=payment_secret, support=optional), ActivatedFeature(feature=basic_mpp, support=optional), ActivatedFeature(feature=option_support_large_channel, support=optional)], unknown=[])
WARNING: 2021-01-21T11:11:12.949575 | f.a.e.channel.ChannelState: c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 offline processing class fr.acinq.eclair.channel.ChannelEvent.Connected
INFO: 2021-01-21T11:11:12.950236 | f.a.e.channel.ChannelState: c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 syncing class fr.acinq.eclair.channel.Normal, waiting fo their channelReestablish message
DEBUG: 2021-01-21T11:11:12.958208 | f.a.p.app.AppConnectionsDaemon: New Peer status ESTABLISHED
INFO: 2021-01-21T11:11:13.553430 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received ChannelReestablish(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, nextLocalCommitmentNumber=10, nextRemoteRevocationNumber=8, yourLastCommitmentSecret=PrivateKey(value=74fbdca67997bb2fae623ecd073c6988b39f060fd5717dc31756952abec1b951), myCurrentPerCommitmentPoint=0248bc63821d0e3edc10a0ec20ff8fb1b67511d44f7ce76cf4595eabce7358dec1, channelData=)
INFO: 2021-01-21T11:11:13.778981 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received class fr.acinq.eclair.wire.ChannelReestablish for channel ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496
The text was updated successfully, but these errors were encountered:
@romainbsl@sstone it looks like an issue in connection management, can you investigate? I'm hoping we can fix the connection layer to avoid being in that case, otherwise we'll have to do some nasty work-arounds to ensure our state stays consistent...
Note that a simple solution could be to discard the contents of the tcp socket once the peer considers the connection closed?
In this specific example it means we would discard the last commit_sig, so when doing our channel_reestablish on the new connection we would tell our peer to retransmit it and everything would work.
I really thought that we weren't processing old LN messages when (re)starting the Peer with a fresh connection.
I think we all agree that we need the reproduce this in a test case, unit and/or integration, to be sure it will not happen again. Maybe this is where the Peer unit test mechanism will be useful.
I not sure that it that easy to discard some TCP socket message, because we still can receive / process TCP messages before the state machine has processed the Disconnect event. My first thought was to work on the coroutine that handles the TCP workflow, and by cancelling it the TCP should close itself and trigger the Disconnect event.
It looks like our connection handlers consider themselves disconnected before the tcp socket is really disconnected.
This severely breaks some core assumptions on lightning message delivery...
Here are logs showing that Phoenix thinks it's disconnected before receiving a
commit_sig
, but it has actually been received in the tcp socket. When we reconnect we receive thatcommit_sig
before going through theinit
/channel_reestablish
flow, which breaks the channel restore logic.The text was updated successfully, but these errors were encountered: