Skip to content

Comments

Fixing overlord issued too many redirects#12908

Merged
abhishekagarwal87 merged 2 commits intoapache:masterfrom
cryptoe:overlord_race_redirect
Aug 17, 2022
Merged

Fixing overlord issued too many redirects#12908
abhishekagarwal87 merged 2 commits intoapache:masterfrom
cryptoe:overlord_race_redirect

Conversation

@cryptoe
Copy link
Contributor

@cryptoe cryptoe commented Aug 16, 2022

Description

While debugging a cluster, I found that during an overlord switch, the tasks were failing.

The task log showed

Caused by: org.apache.druid.rpc.RpcException: Service [overlord] issued too many redirects
	at org.apache.druid.rpc.ServiceClientImpl$1.onSuccess(ServiceClientImpl.java:203)
	at org.apache.druid.rpc.ServiceClientImpl$1.onSuccess(ServiceClientImpl.java:168)
	at com.google.common.util.concurrent.Futures$4.run(Futures.java:1181)

Which lead me to debug the overlord redirect code.

The RedirectFilter on the overlord checks if the current node is the elected leader and is initialized using TaskMaster.isLeader() and if no, redirect it to the elected leader using overlordLeaderSelector.getCurrentLeader().

Now the redirectFilter was redirecting requests to itself resulting in a never-ending loop until the initialized flag is set hence causing issued too many redirects for the clients.

Note that the new leader may take some time to set the initialized flag as it has to fetch tasks/locks from the metadatastore.

Adjusted the code in such a way that the redirectFilter will not redirect the requests to itself resulting in a SC_SERVICE_UNAVAILABLE = 503 which will cause the overlord clients to back off exponentially which IMHO is the correct behaviour.


Key changed/added classes in this PR
  • TaskMaster

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekagarwal87 abhishekagarwal87 merged commit a3a9c5f into apache:master Aug 17, 2022
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
if (redirectCount >= MAX_REDIRECTS) {
retVal.setException(new RpcException("Service [%s] issued too many redirects", serviceName));
retVal.setException(new RpcException(
"Service [%s] redirected too many times [%d] to invalid url %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message here isn't right: the url isn't known to be invalid at this point. It may very well be valid. It just exceeded the limit on number of redirects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants