You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've found the "scheduler_queue_total_count" to be a more useful one tbh. It shows the current total backlog of these kind of events waiting to be processed. We aim for it to always be very low - ideally single or double digits. When we saw the recent OOM issues, before the node crashed, that value would invariably skyrocket to hundreds of thousands - meaning we had a huge backlog of events
Here is the moment of actual crash: ( 2021-01-10 14:11:52 UTC )
Since we sure this is OOM most luckily:
Here is no doubts, the reason for node to loose ability providing us with HTTP and RPC end point is coming from the same source. We can observe it on mostly every graph, as example I post graphs below.
From here we can see, issue start accumulating before the actual catastrophe ( 2021-01-10 06:27:00 UTC )
Update 1
I been tracking network activity, and here is some in case can be usefull.
From here we can see, since my crawler still crawling after node colapse, new IP addresses still joining chain.
At moment of crash my node have 69 peers connected, auction order book have 65 bids and network coun 132 unique IP addresses. My validator crash happen when 133 addresses been count, it was exactly on the moment when new IP join network, not sure if this can be related to some specific missconfigured node joining.
Many updates were made for Delta-10, including network changes and optimization to eliminate OOM crashes. With heavy load in Delta-10 with up to 100 wasm deploys we are not longer seeing this occur.
Based on:
I've found the "scheduler_queue_total_count" to be a more useful one tbh. It shows the current total backlog of these kind of events waiting to be processed. We aim for it to always be very low - ideally single or double digits. When we saw the recent OOM issues, before the node crashed, that value would invariably skyrocket to hundreds of thousands - meaning we had a huge backlog of events
Here is the moment of actual crash: ( 2021-01-10 14:11:52 UTC )
Since we sure this is OOM most luckily:
Here is no doubts, the reason for node to loose ability providing us with HTTP and RPC end point is coming from the same source. We can observe it on mostly every graph, as example I post graphs below.
From here we can see, issue start accumulating before the actual catastrophe ( 2021-01-10 06:27:00 UTC )
I been tracking network activity, and here is some in case can be usefull.
From here we can see, since my crawler still crawling after node colapse, new IP addresses still joining chain.
At moment of crash my node have 69 peers connected, auction order book have 65 bids and network coun 132 unique IP addresses. My validator crash happen when 133 addresses been count, it was exactly on the moment when new IP join network, not sure if this can be related to some specific missconfigured node joining.
Last IP's joined network:
130
95.216.224.235
131
95.217.0.233
132
95.217.154.113
133
95.217.154.200
134
95.217.213.84
135
95.217.84.84
136
95.217.94.83
List of current "still alive peers":
Dead peers:
Last delta-8 geomap:
Geodata CSV attached
delta-8_geo_data.txt
The text was updated successfully, but these errors were encountered: