Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection handling issue #175

Closed
t-bast opened this issue Jan 21, 2021 · 3 comments · Fixed by #182
Closed

Connection handling issue #175

t-bast opened this issue Jan 21, 2021 · 3 comments · Fixed by #182
Assignees
Labels
bug Something isn't working
Milestone

Comments

@t-bast
Copy link
Member

t-bast commented Jan 21, 2021

It looks like our connection handlers consider themselves disconnected before the tcp socket is really disconnected.
This severely breaks some core assumptions on lightning message delivery...

Here are logs showing that Phoenix thinks it's disconnected before receiving a commit_sig, but it has actually been received in the tcp socket. When we reconnect we receive that commit_sig before going through the init / channel_reestablish flow, which breaks the channel restore logic.

INFO: 2021-01-21T11:07:47.686302 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 sending UpdateFulfillHtlc(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, id=1, paymentPreimage=b95e2fc2563d1272d7b291b86b3ca7489238029614af28b77f7c2c1a520954ba)
INFO: 2021-01-21T11:07:47.984533 | f.a.e.channel.ChannelState: c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 built remote commit number=9 toLocalMsat=95468058 toRemoteMsat=30551942 htlc_in= htlc_out= feeratePerKw=FeeratePerKw(feerate=5000 sat) txid=a550cb48571dc3e2e188f2bd7233f4e05cefaa2f057ba0bd65b82e429798ab2f tx=0200000001ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e620444970100000000fca23080044a01000000000000220020c219eb7f4540c76362943a9f36a248734a879373488e677a2ef39a69e73542814a01000000000000220020fbb4b6620be5da83753256006febd0db92b2939dfb6476dbe2482c2e2791290a5777000000000000220020e894e2a963db6f16f6278115828281fe9005a5f3a3fbfb5f5e450b734fb7c1cc645c0100000000002200207b145c141421178db6a97e2d0f4767e87596062e87268e9dd8edae53a23885583e37ec20
INFO: 2021-01-21T11:07:48.716675 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 storing state=class fr.acinq.eclair.channel.Normal
INFO: 2021-01-21T11:07:49.400001 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 sending CommitSig(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, signature=a9956caf1e53043934dc740b490ccd8ef77a74d47eab1a3df7dfbed4b86d3be242ba8c95f79a426101185eb5e6cda6414702c09488d2898e26a73eb4ab28ea51, htlcSignatures=[], channelData=...)
INFO: 2021-01-21T11:07:49.755166 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received RevokeAndAck(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, perCommitmentSecret=PrivateKey(value=5d27c84e58aa3fedaa9679994e493f9895a9d383981bf7ef3a7b2416b9b4f16d), nextPerCommitmentPoint=03771d75d514ac3150629ac4a2eb1aa3f289b7cdb09e993fdde032e8326f7a06ca, channelData=)
INFO: 2021-01-21T11:07:49.756575 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received class fr.acinq.eclair.wire.RevokeAndAck for channel ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496
INFO: 2021-01-21T11:07:49.758858 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 new state: class fr.acinq.eclair.channel.Normal
INFO: 2021-01-21T11:07:49.759551 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 storing state=class fr.acinq.eclair.channel.Normal
DEBUG: 2021-01-21T11:07:50.348445 | f.a.p.app.AppConnectionsDaemon: peerControlFlow = TrafficControl(networkIsAvailable=true, disconnectCount=1)
DEBUG: 2021-01-21T11:07:50.349124 | f.a.p.app.AppConnectionsDaemon: electrumControlFlow = TrafficControl(networkIsAvailable=true, disconnectCount=1)

More than 3 minutes between these log lines, when nothing happened (app put into the background?)

WARNING: 2021-01-21T11:11:03.651435 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 Software caused connection abort
WARNING: 2021-01-21T11:11:03.654493 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 Socket is not connected
DEBUG: 2021-01-21T11:11:03.657661 | f.a.p.app.AppConnectionsDaemon: peerControlFlow = TrafficControl(networkIsAvailable=true, disconnectCount=0)
DEBUG: 2021-01-21T11:11:03.657897 | f.a.p.app.AppConnectionsDaemon: electrumControlFlow = TrafficControl(networkIsAvailable=true, disconnectCount=0)
DEBUG: 2021-01-21T11:11:03.658724 | f.a.p.app.AppConnectionsDaemon: New Peer status CLOSED
DEBUG: 2021-01-21T11:11:03.658775 | f.a.p.app.AppConnectionsDaemon: Wait for 500ms before retrying connection on Peer
DEBUG: 2021-01-21T11:11:03.658952 | f.a.p.app.AppConnectionsDaemon: New Electrum status ESTABLISHED

Notice the CommitSig that we receive now from the tcp socket -> it ends up being half-received which completely breaks our state.

INFO: 2021-01-21T11:11:10.667554 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received CommitSig(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, signature=4823e7c1735303c4f6cbcdd19b1b3b1a853a8a7e07df06aea38a8b47f9b4aa4e1eac0b25f1fad65e6a770eb647e5c838fa59f320cff055ad2b0e798eee7fb794, htlcSignatures=[], channelData=)
INFO: 2021-01-21T11:11:10.669285 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received class fr.acinq.eclair.wire.CommitSig for channel ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496
INFO: 2021-01-21T11:11:10.967573 | f.a.e.channel.ChannelState: c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 built local commit number=9 toLocalMsat=30551942 toRemoteMsat=95468058 htlc_in= htlc_out= feeratePerKw=FeeratePerKw(feerate=5000 sat) txid=733d6dafcabfb68e2fa93fb9593b5ce7a4f6b288184595b52df3323b950dfd8f tx=0200000001ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e620444970100000000fca23080044a01000000000000220020c219eb7f4540c76362943a9f36a248734a879373488e677a2ef39a69e73542814a01000000000000220020fbb4b6620be5da83753256006febd0db92b2939dfb6476dbe2482c2e2791290a577700000000000022002008d4aedcd2d0a94f5c1070450e30bd411b660d01f81a5b997615212c8feb5674645c010000000000220020345fda6ab07a492d86598542a1592f26a93124f6a5bfe5f6c1754c85ba2171933e37ec20
INFO: 2021-01-21T11:11:11.677044 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 new state: class fr.acinq.eclair.channel.Normal

And because of this, we fail to send our revocation but still believe we sent it...

WARNING: 2021-01-21T11:11:11.678158 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 unhandled action: class fr.acinq.eclair.channel.ChannelAction.Message.Send
INFO: 2021-01-21T11:11:11.679148 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 storing state=class fr.acinq.eclair.channel.Normal
WARNING: 2021-01-21T11:11:12.533617 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 disconnect peer from TCP socket
WARNING: 2021-01-21T11:11:12.536374 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 disconnecting channels
INFO: 2021-01-21T11:11:12.540368 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 connecting to 13.248.222.197
DEBUG: 2021-01-21T11:11:12.798762 | f.a.p.app.AppConnectionsDaemon: New Peer status ESTABLISHING
INFO: 2021-01-21T11:11:12.885989 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 sending init Init(features=080000002aa201, tlvs=TlvStream(records=[], unknown=[]))
INFO: 2021-01-21T11:11:12.941995 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received Init(features=0aaa8a, tlvs=TlvStream(records=[Networks(chainHashes=[43497fd7f826957108f4a30fd9cec3aeba79972084e90ead01ea330900000000])], unknown=[]))
INFO: 2021-01-21T11:11:12.944982 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 peer is using features Features(activated=[ActivatedFeature(feature=option_data_loss_protect, support=optional), ActivatedFeature(feature=initial_routing_sync, support=optional), ActivatedFeature(feature=gossip_queries, support=optional), ActivatedFeature(feature=var_onion_optin, support=optional), ActivatedFeature(feature=gossip_queries_ex, support=optional), ActivatedFeature(feature=option_static_remotekey, support=optional), ActivatedFeature(feature=payment_secret, support=optional), ActivatedFeature(feature=basic_mpp, support=optional), ActivatedFeature(feature=option_support_large_channel, support=optional)], unknown=[])
WARNING: 2021-01-21T11:11:12.949575 | f.a.e.channel.ChannelState: c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 offline processing class fr.acinq.eclair.channel.ChannelEvent.Connected
INFO: 2021-01-21T11:11:12.950236 | f.a.e.channel.ChannelState: c:ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496 syncing class fr.acinq.eclair.channel.Normal, waiting fo their channelReestablish message
DEBUG: 2021-01-21T11:11:12.958208 | f.a.p.app.AppConnectionsDaemon: New Peer status ESTABLISHED
INFO: 2021-01-21T11:11:13.553430 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received ChannelReestablish(channelId=ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496, nextLocalCommitmentNumber=10, nextRemoteRevocationNumber=8, yourLastCommitmentSecret=PrivateKey(value=74fbdca67997bb2fae623ecd073c6988b39f060fd5717dc31756952abec1b951), myCurrentPerCommitmentPoint=0248bc63821d0e3edc10a0ec20ff8fb1b67511d44f7ce76cf4595eabce7358dec1, channelData=)
INFO: 2021-01-21T11:11:13.778981 | f.a.e.io.Peer: n:03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134 received class fr.acinq.eclair.wire.ChannelReestablish for channel ff7c567773605fbb24660fc2eb0bda73439c8a1796a1a45d69102a5e62044496
@t-bast t-bast self-assigned this Jan 21, 2021
@t-bast t-bast changed the title Channels stuck syncing Connection handling issue Jan 21, 2021
@t-bast t-bast removed their assignment Jan 21, 2021
@t-bast
Copy link
Member Author

t-bast commented Jan 21, 2021

@romainbsl @sstone it looks like an issue in connection management, can you investigate? I'm hoping we can fix the connection layer to avoid being in that case, otherwise we'll have to do some nasty work-arounds to ensure our state stays consistent...

@t-bast
Copy link
Member Author

t-bast commented Jan 21, 2021

Note that a simple solution could be to discard the contents of the tcp socket once the peer considers the connection closed?
In this specific example it means we would discard the last commit_sig, so when doing our channel_reestablish on the new connection we would tell our peer to retransmit it and everything would work.

@romainbsl
Copy link
Contributor

I really thought that we weren't processing old LN messages when (re)starting the Peer with a fresh connection.
I think we all agree that we need the reproduce this in a test case, unit and/or integration, to be sure it will not happen again. Maybe this is where the Peer unit test mechanism will be useful.

I not sure that it that easy to discard some TCP socket message, because we still can receive / process TCP messages before the state machine has processed the Disconnect event. My first thought was to work on the coroutine that handles the TCP workflow, and by cancelling it the TCP should close itself and trigger the Disconnect event.

@sstone sstone added the bug Something isn't working label Jan 22, 2021
@sstone sstone added this to the 1.0 milestone Jan 22, 2021
@t-bast t-bast moved this from To do to Done in v1.0 - ready for mainnet Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants