Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor: DAO State Head available is faulty #4166

Closed
freimair opened this issue Apr 13, 2020 · 9 comments
Closed

Monitor: DAO State Head available is faulty #4166

freimair opened this issue Apr 13, 2020 · 9 comments

Comments

@freimair
Copy link
Member

Description

The DAO State Head available metric shows a lot of red. We should investigate if these are legit error states or false positives.

Screenshots

Screenshot from 2020-04-13 12-00-40

@ripcurlx
Copy link
Contributor

I do agree that it doesn't make sense to have a monitoring system if we think there are so many false positives in there.

@ripcurlx
Copy link
Contributor

@wiz maybe you could work with @freimair on one of your nodes what causes this ´no Data´ states in the monitoring.

@stale
Copy link

stale bot commented Jul 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the was:dropped label Jul 19, 2020
@stale
Copy link

stale bot commented Jul 26, 2020

This issue has been automatically closed because of inactivity. Feel free to reopen it if you think it is still relevant.

@stale stale bot closed this as completed Jul 26, 2020
@ripcurlx
Copy link
Contributor

Still relevant.

@ripcurlx ripcurlx reopened this Jul 27, 2020
@freimair
Copy link
Member Author

freimair commented Sep 10, 2020

just had a look into the issue. Seems to me like the issue is caused by a familiar face: huge amounts of data take their time on the wire.

Here is a graph of a test run of the affected metric:
Screenshot from 2020-09-10 13-18-56

There are some things to notice:

  • it takes more than 60 seconds to send the request data
  • it takes more than 25 seconds to receive all the response data
  • in total, the whole request/response loop takes 100 seconds
  • the graph above only queries 3 seednodes - as it is a test run!

Here are the things to consider:

  • socket timeouts for connections are set to 120 seconds (in word: two minutes!) with a nice TODO attached to it
    Screenshot from 2020-09-10 13-38-10
  • currently, the monitor times out after 60 seconds (in words: one minute) when not getting anything back
  • hence, even though the DaoStateRequest takes no time at all, it is not sent because previous steps fail.

What happens if I add all seednodes:

  • well, sending PreliminaryDataRequest followed by Read timed out on connection exactly 120 seconds later. Sometimes, some request goes through - although only 1 out of 14.
13:41:53.381 [SendMessage-to-devinv3rhon24gqf5v6ondoqgyrbzyqihzyouzv7ptltsewhfmox2zqd.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:53.557 [SendMessage-to-devinsn2teu33efff62bnvwbxmfgbfjlgqsu3ad4b4fudx3a725eqnyd.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:53.698 [SendMessage-to-723ljisnynbtdohi.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:53.734 [SendMessage-to-s67qglwhkgkyvr74.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:54.117 [SendMessage-to-sn3emzy56u3mxzsr4geysc52feoq5qt7ja56km6gygwnszkshunn2sid.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:54.546 [SendMessage-to-wizseed7ab2gi3x267xahrp2pkndyrovczezzb46jk6quvguciuyqrid.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:54.657 [SendMessage-to-5quyxpxheyvzmb2d.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:54.750 [SendMessage-to-wizseedscybbttk4bmb2lzvbuk2jtect37lcpva4l3twktmkzemwbead.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:55.348 [SendMessage-to-wizseed3d376esppbmbjxk2fhk2jg5fpucddrzj2kxtbxbx4vrnwclad.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:55.348 [SendMessage-to-devinsn3xuzxhj6pmammrxpydhwwmwp75qkksedo5dn2tlmu7jggo7id.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:57.179 [SendMessage-to-3f3cu2yw7u457ztq.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:59.501 [SendMessage-to-fl3mmribyxgrv63c.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:41:59.729 [SendMessage-to-rm7b56wbrcczpjvl.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:42:03.370 [SendMessage-to-sn2bisqad7ncazupgbd3dcedqh5ptirgwofw63djwpdtftwhddo75oid.onion:8000] Sending a PreliminaryGetDataRequest with 1813.352 kB
13:43:08.733 [InputHandler-devinsn2teu33efff62bnvwbxmfgbfjlgqsu3ad4b4fudx3a725eqnyd.onion:8000] SocketException (expected if connection lost). closeConnectionReason=CloseConnectionReason{sendCloseMessage=false, isIntended=false} RESET; connection=Connection{peerAddress=Optional[devinsn2teu33efff62bnvwbxmfgbfjlgqsu3ad4b4fudx3a725eqnyd.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='4635c9a2-2543-47c4-afd9-5f20f808bbaa'}
13:43:53.439 [InputHandler-devinv3rhon24gqf5v6ondoqgyrbzyqihzyouzv7ptltsewhfmox2zqd.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[devinv3rhon24gqf5v6ondoqgyrbzyqihzyouzv7ptltsewhfmox2zqd.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='314fece9-9820-4b7b-bf7a-ab1f06a985de'}
13:43:53.763 [InputHandler-723ljisnynbtdohi.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[723ljisnynbtdohi.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='7a538b36-4062-41cc-b70a-57ce0b96bf0f'}
13:43:53.808 [InputHandler-s67qglwhkgkyvr74.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[s67qglwhkgkyvr74.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='e2f95575-34cb-4bb9-82b6-9de644c53dc0'}
13:43:54.201 [InputHandler-sn3emzy56u3mxzsr4geysc52feoq5qt7ja56km6gygwnszkshunn2sid.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[sn3emzy56u3mxzsr4geysc52feoq5qt7ja56km6gygwnszkshunn2sid.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='69328052-f1db-4b4c-b362-82d1aaaf8f52'}
13:43:54.605 [InputHandler-wizseed7ab2gi3x267xahrp2pkndyrovczezzb46jk6quvguciuyqrid.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[wizseed7ab2gi3x267xahrp2pkndyrovczezzb46jk6quvguciuyqrid.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='2fbcf53f-9b9d-4da7-8ffe-92d6f82581ea'}
13:43:54.654 [Metric: P2PSeedNodeSnapshot] WARN  bisq.monitor.ThreadGate - timeout occured!
13:43:54.845 [InputHandler-wizseedscybbttk4bmb2lzvbuk2jtect37lcpva4l3twktmkzemwbead.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[wizseedscybbttk4bmb2lzvbuk2jtect37lcpva4l3twktmkzemwbead.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='20b6a326-5ac6-4084-a621-83187f961f0f'}
13:43:55.352 [InputHandler-devinsn3xuzxhj6pmammrxpydhwwmwp75qkksedo5dn2tlmu7jggo7id.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[devinsn3xuzxhj6pmammrxpydhwwmwp75qkksedo5dn2tlmu7jggo7id.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='284f27a2-d570-4492-ac9d-40dbf3fd45d3'}
13:43:55.439 [InputHandler-wizseed3d376esppbmbjxk2fhk2jg5fpucddrzj2kxtbxbx4vrnwclad.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[wizseed3d376esppbmbjxk2fhk2jg5fpucddrzj2kxtbxbx4vrnwclad.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='d5db6b0d-86ba-40eb-87c6-e9680e76ab50'}
13:43:57.275 [InputHandler-3f3cu2yw7u457ztq.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[3f3cu2yw7u457ztq.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='ca23608d-09a6-4cf5-820c-aedc5ea8d07e'}
13:43:59.525 [InputHandler-fl3mmribyxgrv63c.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[fl3mmribyxgrv63c.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='b3affb64-fa17-4ef4-be90-1bb37bdb4e02'}
13:43:59.792 [InputHandler-rm7b56wbrcczpjvl.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[rm7b56wbrcczpjvl.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='ef7d23bd-a1d6-4739-8739-810d27480e31'}
13:44:03.458 [InputHandler-sn2bisqad7ncazupgbd3dcedqh5ptirgwofw63djwpdtftwhddo75oid.onion:8000] Shut down caused by exception java.net.SocketTimeoutException: Read timed out on connection=Connection{peerAddress=Optional[sn2bisqad7ncazupgbd3dcedqh5ptirgwofw63djwpdtftwhddo75oid.onion:8000], peerType=PEER, connectionType=OutboundConnection, uid='8c80e8c3-2932-4206-b58c-8509703461b0'}

How can we fix that?

  • the metric needs to do a PreliminaryDataRequest before anything else, because otherwise, connections are denied
  • increase monitor timeout to 125 seconds - already done, allows for 3 seednodes to be queried
  • schedule requests differently, thereby loosing consistency of measurements (as they are not taken at roughly the same time anymore) and ignoring the fact, that the network is on the brink of its capabilities
  • Reduce bandwidth-requirement on Bisq app startup projects#25

Disclaimer:

  • yes, I am using my very private internet uplink, which is only 4MBit/s
  • in the graph you can see, that Tor seems to be the limiting factor (200kb/s upstream)
  • but we are so far away from anything reasonable that we can draw valid conclusions

@chimp1984
Copy link
Contributor

http://104.248.88.175/ does request all 12 nodes with PreliminaryDataRequest and it works. Yes its also slow (see rrt) but very few errors. I guess its mainly a hosting issue.

@stale
Copy link

stale bot commented Dec 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the was:dropped label Dec 12, 2020
@stale
Copy link

stale bot commented Dec 20, 2020

This issue has been automatically closed because of inactivity. Feel free to reopen it if you think it is still relevant.

@stale stale bot closed this as completed Dec 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants