Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate fuzzing process hung or terminated unexpectedly #251

Closed
masih opened this issue May 20, 2024 · 10 comments · Fixed by #277
Closed

Investigate fuzzing process hung or terminated unexpectedly #251

masih opened this issue May 20, 2024 · 10 comments · Fixed by #277
Assignees
Labels
testing Related to testing and validation

Comments

@masih
Copy link
Member

masih commented May 20, 2024

Specific fuzz tests seem to sporadically fail with error: fuzzing process hung or terminated unexpectedly: exit status 2. This could be a bug of sorts in the fuzzer itself as outlined in golang/go#56238. Or possibly a panic during simulation?

See example CI fuzz failure here.

@masih masih added the testing Related to testing and validation label May 20, 2024
@masih
Copy link
Member Author

masih commented May 20, 2024

More context:

  • this failure is intermittent. For example, rerunning this build passed.
  • it seems to be sensitive to the fzzer worker count. I can reproduce this locally with -parallel=4, i.e. what the CI fuzzer picks.

@masih masih self-assigned this May 22, 2024
@masih masih added this to F3 May 22, 2024
@masih masih moved this to In progress in F3 May 22, 2024
masih added a commit that referenced this issue May 22, 2024
The intermittent failures always occur for async tests, in form of going
beyond the maximum number of rounds allowed in tests (mostly 10). The
number of rounds is directly impacted by the order of message delivery
to each node, dictated by the latency model.

The Async test options use a latency model that is instantiated when
the simulation _options_ are instantiated, not when the simulation
itself is instantiated and run. This results in the reuse of the same
latency model object across multiple tests.

In go tests may be run in parallel, but the same test never runs in
parallel with itself when repeated multiple times via `-count=n` flag.
When test are run in parallel using the same latency object across tests
becomes the root cause of indeterministic behaviour, where at times a
test case can require more rounds than allowed to complete.

The changes here introduce a latency model instantiator, called
`latency.Modeler` which encapsulates the responsibility of constructing
a latency model object used by a simulation. The simulation is then
adopted to take modelers as option, not already instantiated models.
This change dramatically reduces the scope of error that was previously
occurring where latency models with unclean RNG state were being used by
multiple test and assures that the order of message delivery in tests
is deterministic regardless of the test execution parallelism.

Additionally, maximum test parallelism is enabled across all table tests
and fuzz tests to reduce the total execution runtime.

Fixes #262, #251
@masih masih moved this from In progress to Todo in F3 May 23, 2024
@masih
Copy link
Member Author

masih commented May 23, 2024

More context:

  • seems to always fail for FuzzHonestMultiInstance_AsyncAgreement

@Kubuxu
Copy link
Contributor

Kubuxu commented May 23, 2024

Each fuzz input run cannot take longer than 10s: https://github.com/golang/go/blob/019353d5323fcbffde939f4e85a68bd0093c6e14/src/internal/fuzz/worker.go#L492-L494

@masih
Copy link
Member Author

masih commented May 24, 2024

Yep that's pretty much the conclusion in the golang issue linked in the description, in that timeout is not configurable in fuzz tests.

masih added a commit that referenced this issue May 24, 2024
`FuzzHonestMultiInstance_AsyncAgreement` intermittently fails on CI,
most likely due to taking too long to complete a test.

To avoid intermittent failures:
* Decrease the instance count to 3K
* change the honest multi instance tests to cover incremental network
  sizes, up to 5 for sync and up to 4 for async
* add additional static corpus to the async test for better coverage.

Fixes #251
masih added a commit that referenced this issue May 24, 2024
`FuzzHonestMultiInstance_AsyncAgreement` intermittently fails on CI,
most likely due to taking too long to complete a test.

To avoid intermittent failures:
* Decrease the instance count to 3K
* change the honest multi instance tests to cover incremental network
  sizes, up to 4
* add additional static corpus that failed the fuzz test

Fixes #251
masih added a commit that referenced this issue May 24, 2024
`FuzzHonestMultiInstance_AsyncAgreement` intermittently fails on CI,
most likely due to taking too long to complete a test.

To avoid intermittent failures:
* Decrease the instance count to 3K
* change the honest multi instance tests to cover incremental network
  sizes, up to 4
* add additional static corpus that failed the fuzz test

Fixes #251
masih added a commit that referenced this issue May 24, 2024
`FuzzHonestMultiInstance_AsyncAgreement` intermittently fails on CI,
most likely due to taking too long to complete a test.

To avoid intermittent failures:
* change the honest multi instance tests to cover incremental network
  sizes, up to 4
* reduce the length of randomly generated EC chains at each instance as
  this should not affect the test quality for what it is testing. But it
  should make it run faster due to less GC.

Additionally, add static corpus that failed locally after extended fuzz
time to the fuzz test.

Fixes #251
@masih masih moved this from Todo to In progress in F3 May 24, 2024
github-merge-queue bot pushed a commit that referenced this issue May 27, 2024
`FuzzHonestMultiInstance_AsyncAgreement` intermittently fails on CI,
most likely due to taking too long to complete a test.

To avoid intermittent failures:
* change the honest multi instance tests to cover incremental network
  sizes, up to 4
* reduce the length of randomly generated EC chains at each instance as
  this should not affect the test quality for what it is testing. But it
  should make it run faster due to less GC.

Additionally, add static corpus that failed locally after extended fuzz
time to the fuzz test.

Fixes #251
@github-project-automation github-project-automation bot moved this from In progress to Done in F3 May 27, 2024
@Kubuxu
Copy link
Contributor

Kubuxu commented May 28, 2024

Might be still a bit too slow, at least for my machine:

--- FAIL: FuzzHonest_AsyncMajorityCommonPrefix (20.17s)
    fuzzing process hung or terminated unexpectedly: exit status 2
    Failing input written to testdata/fuzz/FuzzHonest_AsyncMajorityCommonPrefix/6bac56228674bce6
    To re-run:
    go test -run=FuzzHonest_AsyncMajorityCommonPrefix/6bac56228674bce6

@Kubuxu
Copy link
Contributor

Kubuxu commented May 28, 2024

Or not, it is a different test

@masih
Copy link
Member Author

masih commented May 28, 2024

I think I know why FuzzHonest_AsyncMajorityCommonPrefix fails. I'll do a re-scan of fuzz tests.

Did that happen on CI or locally @Kubuxu ?

@Kubuxu
Copy link
Contributor

Kubuxu commented May 28, 2024

Locally

@masih
Copy link
Member Author

masih commented May 28, 2024

OK thanks. Since this issue was closed CI has been passing consistently? unless I have missed a failure. Having said that, that could be fluke.

I'll take a closer look at the fuzz tests.

@Kubuxu
Copy link
Contributor

Kubuxu commented May 28, 2024

It is more that we are close to the edge of 10s that fuzzing fails from time to time on my laptop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Related to testing and validation
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants