Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpContactPointBootstrap always dead after probing timeout #209

Open
Roiocam opened this issue Apr 25, 2024 · 4 comments
Open

HttpContactPointBootstrap always dead after probing timeout #209

Roiocam opened this issue Apr 25, 2024 · 4 comments

Comments

@Roiocam
Copy link
Member

Roiocam commented Apr 25, 2024

Explain

In the cluster bootstrapping, we will create a child actor for handling HTTP probing, this actor will use the config probingFailureTimeout as the deadline time:

/**
* If probing keeps failing until the deadline triggers, we notify the parent,
* such that it rediscover again.
*/
private var probingKeepFailingDeadline: Deadline = settings.contactPoint.probingFailureTimeout.fromNow

At the same time, we are using the same configuration probingFailureTimeout as probing future timeout too.

log.debug("Probing [{}] for seed nodes...", probeRequest.uri)
val reply = http.singleRequest(probeRequest, settings = connectionPoolWithoutRetries).flatMap(handleResponse)
val afterTimeout = after(settings.contactPoint.probingFailureTimeout, context.system.scheduler)(replyTimeout)
Future.firstCompletedOf(List(reply, afterTimeout)).pipeTo(self)

There is only one way to handle these timeouts and deadlines, As you can see, because of the existence of a deadline, the else logic will never be executed.

case Status.Failure(cause) =>
log.warning("Probing [{}] failed due to: {}", probeRequest.uri, cause.getMessage)
if (probingKeepFailingDeadline.isOverdue()) {
log.error("Overdue of probing-failure-timeout, stop probing, signaling that it's failed")
context.parent ! BootstrapCoordinator.Protocol.ProbingFailed(contactPoint, cause)
context.stop(self)
} else {
// keep probing, hoping the request will eventually succeed
scheduleNextContactPointProbing()
}

Discuss

I think we may need two configurations for deadline and timeout. In such cases, when there is network latency for the contact point node, theHttpContactPointBootstrap actor does not need to be frequently destroyed and created. At least we have some buffer time.

wdyt @pjfanning @He-Pin @mdedetrich @samueleresca

@He-Pin
Copy link
Member

He-Pin commented Apr 25, 2024

Maybe we can add something timefactor for it, eg what we are using in the testkit?

@He-Pin
Copy link
Member

He-Pin commented Apr 25, 2024

Or we can introduce a connect timeout and start the tick after connected

@Roiocam
Copy link
Member Author

Roiocam commented Apr 25, 2024

eg what we are using in the testkit?

Which testkit did you mention?

HttpContactPointBootstrap is very lack of test cases, so that is why this problem hasn't been discovered for so long?

class HttpContactPointBootstrapSpec extends AnyWordSpec with Matchers {
"HttpContactPointBootstrap" should {
"use a safe name when connecting over IPv6" in {
val name = HttpContactPointBootstrap.name(Host("[fe80::1013:2070:258a:c662]"), 443)
ActorPath.isValidPathElement(name) should be(true)
}
}
}

@He-Pin
Copy link
Member

He-Pin commented Apr 25, 2024

I mean some kind of factor, any way , send a pr when you have time, I will review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants