-
Notifications
You must be signed in to change notification settings - Fork 26
feat: Add maxRecreateAttempts logic #527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fix: Possible bug introduced with PR block#514, need to set c.streamer to nil to make sure its fully recreated. Signed-off-by: Tyler Davis <tyler@tylerdavis.com>
Signed-off-by: Tyler Davis <tyler@tylerdavis.com>
|
Sorry, you are getting a CI failure due to #522 - it is unrelated. If this is ready for review, I can take a look tomorrow. |
Ah ok, was trying to reproduce locally earlier but ran out of time. Its ready now. |
Signed-off-by: Tyler Davis <tyler@tylerdavis.com>
| if recreateAttempts >= maxRecreateAttempts { | ||
| panic(fmt.Sprintf("failed to recreate binlog streamer after %d attempts, current position: %v, giving up", | ||
| recreateAttempts, c.getBufferedPos())) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is in a go routine, so returning errors is challenging, but the repl client runs in a context, and the cancelFunc is embedded as c.cancelFunc(), so you may be able to simplify and remove the panic?
| if os.Getenv("TEST_PANIC_SUBPROCESS") == "1" { | ||
| // This is the subprocess that should panic | ||
| testMaxRecreateAttemptsPanicSubprocess(t) | ||
| return | ||
| } | ||
|
|
||
| // Run the test in a subprocess | ||
| cmd := exec.Command(os.Args[0], "-test.run=TestMaxRecreateAttemptsPanic") | ||
| cmd.Env = append(os.Environ(), "TEST_PANIC_SUBPROCESS=1") | ||
| output, err := cmd.CombinedOutput() | ||
|
|
||
| outputStr := string(output) | ||
| t.Logf("Subprocess output:\n%s", outputStr) | ||
|
|
||
| // We expect the subprocess to exit with non-zero (panic or crash) | ||
| if err == nil { | ||
| t.Fatal("Expected subprocess to panic or crash, but it exited successfully") | ||
| } | ||
|
|
||
| if !strings.Contains(outputStr, "consecutive errors") { | ||
| t.Errorf("Expected to see consecutive errors. Output:\n%s", outputStr) | ||
| } | ||
| if !strings.Contains(outputStr, "Failed to recreate streamer") { | ||
| t.Errorf("Expected to see recreation attempt. Output:\n%s", outputStr) | ||
| } | ||
|
|
||
| // If we DID get the max attempts panic message, verify it's correct | ||
| if strings.Contains(outputStr, "failed to recreate binlog streamer after") { | ||
| if !strings.Contains(outputStr, "giving up") { | ||
| t.Errorf("Panic message should contain 'giving up'. Output:\n%s", outputStr) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The technique here to run a sub-process is very clever :-) But I think it's better to either add a recover for panic, or see if you can get this to work correctly with context canceling. If you pursue context canceling, we need to make sure it works end-to-end because the migration caller should also detect that the replClient has failed and bail.
refactor: move maxRecreateAttempts test override to TestMain and set to 3 fix: remove TestMaxRecreateAttempts, may interfere with parralel tests Signed-off-by: Tyler Davis <tyler@tylerdavis.com>
morgo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We may refactor + some tests in migration to make sure the repl client can safely fail part way through (I'm not sure that's something we've got good coverage for). But this is an improvement as is, and it has test coverage, so there is no reason to hold it up.
Per discussion in #526, here's a PR to introduce
maxRecreateAttemptsand to panic when the limit is reached. I set default to 10 which seemed reasonable.Notes:
maxRecreateAttemptsis a var so we can override it in test to a smaller value. Outside of the test context it's value comes from theconst maxRecreateAttemptsDefaultc.streamer = nilso the next loop iteration detects the failure immediately. This forces recreation on each iteration until we hit the max attempts limit, rather than waiting for 5 moreGetEvent()failures or hitting a nil pointer crash.New Tests:
TestMaxRecreateAttemptsTestMaxRecreateAttemptsPanictestMaxRecreateAttemptsPanicSubprocess