-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Affected Version: Master branch
Description
Lookup instantiation mainly works on broadcasting two notice, AddNotice, DropNotice. When we create fresh lookup, we issue AddNotice and when we update the existing lookup we issue the AddNotice and DropNotice.
We have been seeing the inconsistent lookup state esp with JDBC in the cluster that caused by the following scenario, I was able to reproduce this in local as well.
- Post create jdbc lookup request, assuming that jdbc server is consistent, it loads the lookup and ready to serve.
- Next pooling fails due to jdbc server not available/ some issues with lookup jdbc connection but old lookup is still serving good.
- Druid User tries to update the lookup json and post update request and old good lookup is also killed and query fails with cache state CACHE_NOT_INITIALIZED.
Bug is the behavior of Step 3, currently we blindly issue the AddNotice for new lookup and DropNotice for old lookup without making sure that new lookup cache population is successful.
There is another bug at Step 2 where we do have resiliency in dealing with JDBC handle lookups, we do not retry handle on transient errors. If transient error occurs then we need to wait for next pooling period to reach and populate lookups. In this time, lookup's state remains CACHE_NOT_INITIALIZED if we have no successful load previously.
Proposals for the addressing Step 2 and Step 3 bugs:
- Step 2: Create resilient handle and retry on transient error.
- Step 3: Delay the Drop notice execution until AddNotice loads the lookup on the current node, and make sure that we have one latest lookup loaded successfully and good to drop the previous one. This can be done by starting the schedule executor thread here https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/query/lookup/LookupReferencesManager.java#L675 that can execute it after Delay D, N times and it can get the latest stateRef from LookupReferenceManager and make sure that latest Ref cache is loaded successfully (Also make sure that LookupExtractorFactoryContainer that we are trying to remove is not same as current stateRef container).