Skip to content

Conversation

@tyd
Copy link
Contributor

@tyd tyd commented Nov 3, 2025

Per discussion in #526, here's a PR to introduce maxRecreateAttempts and to panic when the limit is reached. I set default to 10 which seemed reasonable.

Notes:

  • maxRecreateAttempts is a var so we can override it in test to a smaller value. Outside of the test context it's value comes from the const maxRecreateAttemptsDefault
  • Testing panics isn't easy, so I had to create a couple tests to accomplish it via a sub-process. Please check those closely to make sure that approach is OK.
  • Added some code to fix a potential bug introduced in make tests parallel #514. After a failed streamer recreation, set c.streamer = nil so the next loop iteration detects the failure immediately. This forces recreation on each iteration until we hit the max attempts limit, rather than waiting for 5 more GetEvent() failures or hitting a nil pointer crash.

New Tests:

  • TestMaxRecreateAttempts
  • TestMaxRecreateAttemptsPanic
  • testMaxRecreateAttemptsPanicSubprocess

fix: Possible bug introduced with PR block#514, need to set c.streamer to nil to make sure its fully recreated.

Signed-off-by: Tyler Davis <tyler@tylerdavis.com>
@tyd tyd marked this pull request as draft November 3, 2025 21:32
Signed-off-by: Tyler Davis <tyler@tylerdavis.com>
@morgo
Copy link
Collaborator

morgo commented Nov 3, 2025

Sorry, you are getting a CI failure due to #522 - it is unrelated.

If this is ready for review, I can take a look tomorrow.

@tyd tyd marked this pull request as ready for review November 3, 2025 22:52
@tyd
Copy link
Contributor Author

tyd commented Nov 3, 2025

Sorry, you are getting a CI failure due to #522 - it is unrelated.

If this is ready for review, I can take a look tomorrow.

Ah ok, was trying to reproduce locally earlier but ran out of time. Its ready now.

Signed-off-by: Tyler Davis <tyler@tylerdavis.com>
@morgo morgo self-requested a review November 4, 2025 14:21
Comment on lines +432 to +435
if recreateAttempts >= maxRecreateAttempts {
panic(fmt.Sprintf("failed to recreate binlog streamer after %d attempts, current position: %v, giving up",
recreateAttempts, c.getBufferedPos()))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is in a go routine, so returning errors is challenging, but the repl client runs in a context, and the cancelFunc is embedded as c.cancelFunc(), so you may be able to simplify and remove the panic?

Comment on lines +784 to +814
if os.Getenv("TEST_PANIC_SUBPROCESS") == "1" {
// This is the subprocess that should panic
testMaxRecreateAttemptsPanicSubprocess(t)
return
}

// Run the test in a subprocess
cmd := exec.Command(os.Args[0], "-test.run=TestMaxRecreateAttemptsPanic")
cmd.Env = append(os.Environ(), "TEST_PANIC_SUBPROCESS=1")
output, err := cmd.CombinedOutput()

outputStr := string(output)
t.Logf("Subprocess output:\n%s", outputStr)

// We expect the subprocess to exit with non-zero (panic or crash)
if err == nil {
t.Fatal("Expected subprocess to panic or crash, but it exited successfully")
}

if !strings.Contains(outputStr, "consecutive errors") {
t.Errorf("Expected to see consecutive errors. Output:\n%s", outputStr)
}
if !strings.Contains(outputStr, "Failed to recreate streamer") {
t.Errorf("Expected to see recreation attempt. Output:\n%s", outputStr)
}

// If we DID get the max attempts panic message, verify it's correct
if strings.Contains(outputStr, "failed to recreate binlog streamer after") {
if !strings.Contains(outputStr, "giving up") {
t.Errorf("Panic message should contain 'giving up'. Output:\n%s", outputStr)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The technique here to run a sub-process is very clever :-) But I think it's better to either add a recover for panic, or see if you can get this to work correctly with context canceling. If you pursue context canceling, we need to make sure it works end-to-end because the migration caller should also detect that the replClient has failed and bail.

tyd and others added 2 commits November 4, 2025 09:33
refactor: move maxRecreateAttempts test override to TestMain and set to 3
fix: remove TestMaxRecreateAttempts, may interfere with parralel tests

Signed-off-by: Tyler Davis <tyler@tylerdavis.com>
@morgo morgo self-requested a review November 4, 2025 17:47
Copy link
Collaborator

@morgo morgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

We may refactor + some tests in migration to make sure the repl client can safely fail part way through (I'm not sure that's something we've got good coverage for). But this is an improvement as is, and it has test coverage, so there is no reason to hold it up.

@morgo morgo enabled auto-merge November 4, 2025 17:49
@morgo morgo merged commit fb6d701 into block:main Nov 4, 2025
7 checks passed
@tyd tyd deleted the retry-panic branch November 4, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants