Skip to content

Conversation

@masih
Copy link
Member

@masih masih commented May 22, 2024

The intermittent failures always occur for async tests, in form of going beyond the maximum number of rounds allowed in tests (mostly 10). The number of rounds is directly impacted by the order of message delivery to each node, dictated by the latency model.

The Async test options use a latency model that is instantiated when the simulation options are instantiated, not when the simulation itself is instantiated and run. This results in the reuse of the same latency model object across multiple tests.

In go tests may be run in parallel, but the same test never runs in parallel with itself when repeated multiple times via -count=n flag. When test are run in parallel using the same latency object across tests becomes the root cause of indeterministic behaviour, where at times a test case can require more rounds than allowed to complete.

The changes here introduce a latency model instantiator, called latency.Modeler which encapsulates the responsibility of constructing a latency model object used by a simulation. The simulation is then adopted to take modelers as option, not already instantiated models. This change dramatically reduces the scope of error that was previously occurring where latency models with unclean RNG state were being used by multiple test and assures that the order of message delivery in tests is deterministic regardless of the test execution parallelism.

Additionally, maximum test parallelism is enabled across all table tests and fuzz tests to reduce the total execution runtime.

Fixes #262

@codecov-commenter
Copy link

codecov-commenter commented May 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.00%. Comparing base (a3670f6) to head (6af1dc2).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #269      +/-   ##
==========================================
+ Coverage   71.99%   72.00%   +0.01%     
==========================================
  Files          32       32              
  Lines        2389     2390       +1     
==========================================
+ Hits         1720     1721       +1     
  Misses        537      537              
  Partials      132      132              
Files Coverage Δ
sim/options.go 75.67% <100.00%> (+0.33%) ⬆️

@masih masih requested review from Kubuxu and anorth May 22, 2024 17:27
@github-actions
Copy link

Fuzz test failed on commit 5fc1170. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9195742636 -n testdata

Aleternatively, download directly from here.

@masih masih force-pushed the masih/fix-intermittent-test-failures branch from 5fc1170 to 6af1dc2 Compare May 22, 2024 17:32
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM although I'd consider breaking this into two PRs:

  1. One to change from a latency model to latency modeler.
  2. One to change the tests (parallel, naming, etc.).

@masih masih force-pushed the masih/fix-intermittent-test-failures branch from 6af1dc2 to c683815 Compare May 23, 2024 10:14
@masih
Copy link
Member Author

masih commented May 23, 2024

I'd consider breaking this into two PRs

Sure. Test refactors and extra parallelism increase is now removed from this one. Next one coming up shortly.

The intermittent failures always occur for async tests, in form of going
beyond the maximum number of rounds allowed in tests (mostly 10). The
number of rounds is directly impacted by the order of message delivery
to each node, dictated by the latency model.

The Async test options use a latency model that is instantiated when
the simulation _options_ are instantiated, not when the simulation
itself is instantiated and run. This results in the reuse of the same
latency model object across multiple tests.

In go tests may be run in parallel, but the same test never runs in
parallel with itself when repeated multiple times via `-count=n` flag.
When test are run in parallel using the same latency object across tests
becomes the root cause of indeterministic behaviour, where at times a
test case can require more rounds than allowed to complete.

The changes here introduce a latency model instantiator, called
`latency.Modeler` which encapsulates the responsibility of constructing
a latency model object used by a simulation. The simulation is then
adopted to take modelers as option, not already instantiated models.
This change dramatically reduces the scope of error that was previously
occurring where latency models with unclean RNG state were being used by
multiple test and assures that the order of message delivery in tests
is deterministic regardless of the test execution parallelism.

Fixes #262
@masih masih force-pushed the masih/fix-intermittent-test-failures branch from c683815 to a18a596 Compare May 23, 2024 10:15
@masih masih added this pull request to the merge queue May 23, 2024
Merged via the queue into main with commit 4415a55 May 23, 2024
@masih masih deleted the masih/fix-intermittent-test-failures branch May 23, 2024 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Interference in parallel tests with TestHonestMultiInstance_Agreement

4 participants