added prometheus metric for failing to send partials #1223

CluEleSsUK · 2023-04-26T09:45:07Z

No description provided.

chain/beacon/node.go

metrics/metrics.go

chain/beacon/node.go

CluEleSsUK · 2023-05-02T08:07:26Z

Interestingly the postgres tests are failing consistently here, but because they wait for a round which gets emitted moments after the test fails. Dunno how prometheus metrics would have affected the test run time that much!?

- upped local sleep slightly

* added prometheus metric for failing to send partials * moved import around * fixed tests * made sleeps in the publicrandstream variable - upped local sleep slightly

* threshold monitoring for beacon processes (#1220) * threshold monitoring for beacon processes beacon processes now log some error and warning messages whenever we get close to or cross a threshold number of failures per node * return on finished threshold monitor Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> * added prometheus metric for failing to send partials (#1223) * added prometheus metric for failing to send partials * moved import around * fixed tests * made sleeps in the publicrandstream variable - upped local sleep slightly * aggregate prometheus partial errors by beaconID (#1232) The additional fields in the partial error counter meant that a new counter was created for _every_ combination. Given that they could only be emitted for each round, a new entry was created for every partial... which is definitely wrong * use a gauge for counting nodes that have invalid partials (#1233) * use a gauge for counting nodes that have invalid partials strictly speaking this will lose sight of nodes that go down and back up between prometheus polls; however it's much easier to handle in grafana * combined threshold monitor and prometheus call * return 404 when no beacon hash exists instead of 500 (#1234) * return 404 when no beacon hash exists instead of 500 * add test for 404 on nonexistent chain * cherry-picked master commits and fixed a few bits --------- Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> Co-authored-by: Houlton McGuinn <hoult.mcguinn@gmail.com>

* added prometheus metric for failing to send partials * moved import around * fixed tests * made sleeps in the publicrandstream variable - upped local sleep slightly

* added prometheus metric for failing to send partials * moved import around * fixed tests * made sleeps in the publicrandstream variable - upped local sleep slightly aggregate prometheus partial errors by beaconID (#1232) The additional fields in the partial error counter meant that a new counter was created for _every_ combination. Given that they could only be emitted for each round, a new entry was created for every partial... which is definitely wrong use a gauge for counting nodes that have invalid partials (#1233) * use a gauge for counting nodes that have invalid partials strictly speaking this will lose sight of nodes that go down and back up between prometheus polls; however it's much easier to handle in grafana * combined threshold monitor and prometheus call return 404 when no beacon hash exists instead of 500 (#1234) * return 404 when no beacon hash exists instead of 500 * add test for 404 on nonexistent chain bumped patch version for testnet release (#1239) fixed a bug where docker image is pushed with invalid tag (#1240) fixed docker publish action to publish to GHCR (#1241) * fixed docker publish action to publish to GHCR added reference to GCHR in docker publish (#1242) updated docker entrypoint permissions so the container starts (#1243) Update go-libp2p to v0.27.3 (#1244) Updating our dependencies (#1246) * Updating our dependencies for mainnet release * Bumping version to v1.5.6 upgrade libp2p to v0.27.5 (#1247) Fixing wrong logging levels when checking past beacons with sync and shorter tests (#1248) * Fixing wrong logging levels when checking past beacons with sync * Skipping slow tests in short mode Adding a G1 scheme that's RFC conformant (#1249) * Adding a G1 scheme that's RFC conformant * Using Go 1.20 in GHA * patching GHA to support new RFC conformant scheme added new flag for backup out with better text (#1250) updated dockerfiles to use go 1.20 (#1255) * updated dockerfiles to use go 1.20 - added a github action step for building the docker image on branches to ensure compat * renamed the docker build task * added branches for push * run on all pull requests patching updating deps and patching last details More slow test remove from short improving test logging to debug improving TestBeaconSync fixing TestDrandPublicRand fixing test trying to avoid unfinished sync avoiding race in DKG fixing weird build issue adding TestBeaconSync to slow test improve logging tag slow test removing noisy log Adding scheme to self-signature of key Also adding keys signature validation to DKG proposal validation adding missing imports using proper scheme in tests properly close control client in cli Hopefully avoiding the control tcp issues avoid closing conn always cancelling the stream ctx correctly do not close conn increased sleep time when waiting for clock changed DKG runner status check to not optimistically return an error

* threshold monitoring for beacon processes (#1220) * threshold monitoring for beacon processes beacon processes now log some error and warning messages whenever we get close to or cross a threshold number of failures per node * return on finished threshold monitor Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> * added prometheus metric for failing to send partials (#1223) * added prometheus metric for failing to send partials * moved import around * fixed tests * made sleeps in the publicrandstream variable - upped local sleep slightly * aggregate prometheus partial errors by beaconID (#1232) The additional fields in the partial error counter meant that a new counter was created for _every_ combination. Given that they could only be emitted for each round, a new entry was created for every partial... which is definitely wrong * use a gauge for counting nodes that have invalid partials (#1233) * use a gauge for counting nodes that have invalid partials strictly speaking this will lose sight of nodes that go down and back up between prometheus polls; however it's much easier to handle in grafana * combined threshold monitor and prometheus call * return 404 when no beacon hash exists instead of 500 (#1234) * return 404 when no beacon hash exists instead of 500 * add test for 404 on nonexistent chain * cherry-picked master commits and fixed a few bits --------- Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> Co-authored-by: Houlton McGuinn <hoult.mcguinn@gmail.com>

* Updating libp2p again * Make sure version is compat with v2 DKG refactor -> develop to unblock Florin (#1200) - moved dkg package from core to root - re-added demo project using the new DKG - fixed first round of PR comments - refactored test objects for easier readability in tests using alice, bob, carol - extracted some action utilities and add helpful comments - PrivateGateway now implements a DKG client for TLS - added transition time flag to the DKG CLI - fixed a lot of small bugs - dkg kickoff grace period is now a daemon param and is reduced in the test code - added stop check to broadcaster - moved stopped into echoBroadcast itself - DKG requests are now sent in parallel to speed up execution - use the internal clock correctly in the beacon handler Final round of tests improvements (#1187) * Test improvements * Fix connectivity test * Fix test after rebase * Revert to computed round in test * Do not use a package level variable for command output * Allow node status to reply regardless of DKG state or params * We expect at least the previous round to be caught up by now * The sync mechanism will balance out and resolve the discrepancy, eventually. * Speed up the orchestration tests shutdown * A bit more sleep in test test with retrive old beacon from new node (#761) * test with retrive old beacon from new node * Update to latest code changes --------- Co-authored-by: Florin Pățan <florinpatan@gmail.com> Handle error in test (#1212) * Handle error in test Fix typo in vault (#1213) * Fix typo in vault Allow logs to propagate to the test name (#1114) * Make logger propagate across whole codebase * Remove sleep in tests * Remove deprecated sleep function * Fix missing cancel function missing call DKG migration tool from v1.* to v2.0.0 (#1215) * DKG migration tool from v1.* to v2.0.0 * implemented the migration * added a CLI command for running it automagically for each beaconID * reload beacon upon migration * added integration test for restoring node state from migration run tests on develop (#1216) improved output for DKG status command (#1218) - fixed a bug where the transition time was 0 for initial DKG - increased genesis time for tests due to the memdb test getting some weird sync deadlock Add OpenTelemetry tracing instrumentation (#1199) * OT tracing using a customizable endpoint --------- Co-authored-by: Patrick McClurg <patrick.mcclurg@protocol.ai> Co-authored-by: PM <3749956+CluEleSsUK@users.noreply.github.com> show DKG status string instead of iota (#1221) Removing redundant Expires header from server (#1191) * Removing redundant Expires header from server If there is a Cache-Control header with the max-age or s-maxage directive in the response, the Expires header is ignored. Also reworking a few Cache-control max-age directives Move to internal (#1214) * Add OpenTelemetry tracing instrumentation Initial commit for adding OT tracing using a customizable endpoint Instrument the DKG code Move everything to internal Upgrade golangci-lint to 1.50.2 and correctly configure the imports linter Allow Docker to build with the new paths Align Go versions to 1.19.5 in both CI and Docker Expose types to allow for Gossip client demo. Drop WithLogger usage More code cleanup. Stop using panic in tests Allow clients to be built based on Lotus example and demo code moved tracing docker-compose * added details of tracing to the readme * fixed import error * extracted `crypto` module to top level * ran GCI --------- Co-authored-by: Patrick McClurg <patrick.mcclurg@protocol.ai> cherry-picked commits from master (#1235) * threshold monitoring for beacon processes (#1220) * threshold monitoring for beacon processes beacon processes now log some error and warning messages whenever we get close to or cross a threshold number of failures per node * return on finished threshold monitor Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> * added prometheus metric for failing to send partials (#1223) * added prometheus metric for failing to send partials * moved import around * fixed tests * made sleeps in the publicrandstream variable - upped local sleep slightly * aggregate prometheus partial errors by beaconID (#1232) The additional fields in the partial error counter meant that a new counter was created for _every_ combination. Given that they could only be emitted for each round, a new entry was created for every partial... which is definitely wrong * use a gauge for counting nodes that have invalid partials (#1233) * use a gauge for counting nodes that have invalid partials strictly speaking this will lose sight of nodes that go down and back up between prometheus polls; however it's much easier to handle in grafana * combined threshold monitor and prometheus call * return 404 when no beacon hash exists instead of 500 (#1234) * return 404 when no beacon hash exists instead of 500 * add test for 404 on nonexistent chain * cherry-picked master commits and fixed a few bits --------- Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> Co-authored-by: Houlton McGuinn <hoult.mcguinn@gmail.com> Addressing comments for #1199 (#1236) * fixing belated code review comments about 1199 * callback now has access to bp.log * replacing otel.SpanContext with plain Context * using new context for DKG update listeners * using new context for broadcast * using new context for sender run Updating our dependencies to support Go 1.20 (#1238) * Updating our dependencies to support Go 1.20 This also allows us to finally update to the latest grpc since our blocking dependencies got updated too. * Updating to protoc 3.20.3 * updating to golangci-lint 1.52.2 * updating to latest golangci * Not on my watch Automate DKG migration on startup (#1225) * automate DKG migration on node startup handle no such file or directory error as expected fixed the workflow a little to ensure no nil references check no such group file error specifically DKG messages between participants are now signed and verified (#1229) Miscellaneous fixes from merged reviews (#1237) * added lots of comments to the state machine * don't consume genesis beacon from others * make transition time mandatory to CLI * added a test around pretty printing * added missing flag to demo project * added missing tests all DKG packets are now gossiped throughout the network (#1230) * DKG messages between participants are now signed and verified * all DKG packets are now gossiped throughout the network * DKG RPCs are split into `Command`s and `GossipPacket`s * the flow of execution is now a lot simpler and with fewer higher-order-function shenanigans * remove the unnecessary (racy) brokenbroadcaster * updated with the rebased signatures branch * additional nil checks * added DKG failed state in for DKGs that don't hit threshold * use beaconID from metadata rather than passing it around * added a timeout for the large DKG * removed unnecessary grpc call options * Update internal/dkg/actions.go Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> * use mock clock for determining all DKG timings * fixed some references to clocks * fixed an old test where multiple clocks were being created --------- Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> DKG refactor: use existing keys to generate proposal (#1253) * use existing keys when generating proposal files * fixed an encoding test * copy instead of pointer to defeat the dreaded race detector * rebase + errors.Is Don't use more than 1 handler (#1260) * Don't use more than 1 handler * short tests must be quick Cherry-pick master ontop of develop (#1256) * added prometheus metric for failing to send partials * moved import around * fixed tests * made sleeps in the publicrandstream variable - upped local sleep slightly aggregate prometheus partial errors by beaconID (#1232) The additional fields in the partial error counter meant that a new counter was created for _every_ combination. Given that they could only be emitted for each round, a new entry was created for every partial... which is definitely wrong use a gauge for counting nodes that have invalid partials (#1233) * use a gauge for counting nodes that have invalid partials strictly speaking this will lose sight of nodes that go down and back up between prometheus polls; however it's much easier to handle in grafana * combined threshold monitor and prometheus call return 404 when no beacon hash exists instead of 500 (#1234) * return 404 when no beacon hash exists instead of 500 * add test for 404 on nonexistent chain bumped patch version for testnet release (#1239) fixed a bug where docker image is pushed with invalid tag (#1240) fixed docker publish action to publish to GHCR (#1241) * fixed docker publish action to publish to GHCR added reference to GCHR in docker publish (#1242) updated docker entrypoint permissions so the container starts (#1243) Update go-libp2p to v0.27.3 (#1244) Updating our dependencies (#1246) * Updating our dependencies for mainnet release * Bumping version to v1.5.6 upgrade libp2p to v0.27.5 (#1247) Fixing wrong logging levels when checking past beacons with sync and shorter tests (#1248) * Fixing wrong logging levels when checking past beacons with sync * Skipping slow tests in short mode Adding a G1 scheme that's RFC conformant (#1249) * Adding a G1 scheme that's RFC conformant * Using Go 1.20 in GHA * patching GHA to support new RFC conformant scheme added new flag for backup out with better text (#1250) updated dockerfiles to use go 1.20 (#1255) * updated dockerfiles to use go 1.20 - added a github action step for building the docker image on branches to ensure compat * renamed the docker build task * added branches for push * run on all pull requests patching updating deps and patching last details More slow test remove from short improving test logging to debug improving TestBeaconSync fixing TestDrandPublicRand fixing test trying to avoid unfinished sync avoiding race in DKG fixing weird build issue adding TestBeaconSync to slow test improve logging tag slow test removing noisy log Adding scheme to self-signature of key Also adding keys signature validation to DKG proposal validation adding missing imports using proper scheme in tests properly close control client in cli Hopefully avoiding the control tcp issues avoid closing conn always cancelling the stream ctx correctly do not close conn increased sleep time when waiting for clock changed DKG runner status check to not optimistically return an error Co-authored-by: PM <3749956+CluEleSsUK@users.noreply.github.com> demo bug fixes; execution kickoff alignment (#1263) * fixed error generating first proposal and panic when viewing DKG state * all nodes now share a kickoff time set by the leader, rather than simply waiting 5mins parallelise state machine tests (#1269) separate initial DKG and resharing commands (#1264) * separate initial DKG and resharing commands * time -> delay Tear-out of client and relay code. (#1265) * Initial tear-out. Working daemon, failing tests * Removed builtin TLS * removed unused initial stores from daemon * exposing the time of round APIs * Improving version handling * migration path self-sign * simplify version handling in v2, we keep retro-compat with 1.5.7+ and 2.x-1.y * Fixing metrics * Leaner logs, fixing wait in test Cherry picking typos corrections (#1274) * cherry picking typo corr * Correcting comment --------- Co-authored-by: Alejandro Criado-Pérez <alejandro@criadoperez.com> Properly process all beacons when self-signing (#1275) Merging version changes from master to v2 (#1276) migration path for pub keys (#1277) * added a migration path for new public key format * removed some TLS mentions * added test for migration path and fixed some errors * PR comments and fixed generated protobuf * fixed some linter rules and turned off depguard * modified catchup period (and renamed in places) re-add tls for keys (#1281) * re-added TLS to identities/participants * fixed some linter complaints * fixed some tests * remove TLS from check and follow chains * add TLS/insecure to remote-status stuff * Bringing back TLS keys * added missing TLS to TomlParticipant * use insecure keys for more testing instances --------- Co-authored-by: Yolan Romailler <yolan@protocol.ai> Exposing the test/mock package (#1278) * Exposing the test/mock package * Making sure we are using the expected output format More specifically chain.Beacon doesn't have the same tags for JSON formatting as client.RandomData did and so it wasn't being marshaled as it used to when we used the asRD and the RandomData struct in our client code for the HTTP server * Updating to latest x/net dependencies * fixing orchestrator as per review chore(docs): update TOC addressed merge comments from PR readded some insecures in docker updated some docker image versions seal of approval

* threshold monitoring for beacon processes (#1220) * threshold monitoring for beacon processes beacon processes now log some error and warning messages whenever we get close to or cross a threshold number of failures per node * return on finished threshold monitor Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> * added prometheus metric for failing to send partials (#1223) * added prometheus metric for failing to send partials * moved import around * fixed tests * made sleeps in the publicrandstream variable - upped local sleep slightly * aggregate prometheus partial errors by beaconID (#1232) The additional fields in the partial error counter meant that a new counter was created for _every_ combination. Given that they could only be emitted for each round, a new entry was created for every partial... which is definitely wrong * use a gauge for counting nodes that have invalid partials (#1233) * use a gauge for counting nodes that have invalid partials strictly speaking this will lose sight of nodes that go down and back up between prometheus polls; however it's much easier to handle in grafana * combined threshold monitor and prometheus call * return 404 when no beacon hash exists instead of 500 (#1234) * return 404 when no beacon hash exists instead of 500 * add test for 404 on nonexistent chain * cherry-picked master commits and fixed a few bits --------- Co-authored-by: Yolan Romailler <AnomalRoil@users.noreply.github.com> Co-authored-by: Houlton McGuinn <hoult.mcguinn@gmail.com>

CluEleSsUK requested review from AnomalRoil, nikkolasg and willscott as code owners April 26, 2023 09:45

AnomalRoil requested changes Apr 27, 2023

View reviewed changes

chain/beacon/node.go Outdated Show resolved Hide resolved

metrics/metrics.go Show resolved Hide resolved

chain/beacon/node.go Outdated Show resolved Hide resolved

CluEleSsUK added 2 commits May 2, 2023 10:10

added prometheus metric for failing to send partials

135826d

moved import around

20016b1

CluEleSsUK force-pushed the feature/count-failed-sending-partials branch from a2739ef to 20016b1 Compare May 2, 2023 08:12

CluEleSsUK added 2 commits May 2, 2023 10:55

fixed tests

1bfcc26

made sleeps in the publicrandstream variable

e5893c8

- upped local sleep slightly

AnomalRoil approved these changes May 8, 2023

View reviewed changes

AnomalRoil merged commit 7a99115 into master May 8, 2023

AnomalRoil deleted the feature/count-failed-sending-partials branch May 8, 2023 09:13

AnomalRoil restored the feature/count-failed-sending-partials branch May 8, 2023 09:13

AnomalRoil deleted the feature/count-failed-sending-partials branch May 23, 2023 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added prometheus metric for failing to send partials #1223

added prometheus metric for failing to send partials #1223

CluEleSsUK commented Apr 26, 2023

CluEleSsUK commented May 2, 2023

added prometheus metric for failing to send partials #1223

added prometheus metric for failing to send partials #1223

Conversation

CluEleSsUK commented Apr 26, 2023

CluEleSsUK commented May 2, 2023