Skip to content

Lookup updates that leads inconsistencies  #14796

@pranavbhole

Description

@pranavbhole

Affected Version: Master branch

Description

Lookup instantiation mainly works on broadcasting two notice, AddNotice, DropNotice. When we create fresh lookup, we issue AddNotice and when we update the existing lookup we issue the AddNotice and DropNotice.
We have been seeing the inconsistent lookup state esp with JDBC in the cluster that caused by the following scenario, I was able to reproduce this in local as well.

  1. Post create jdbc lookup request, assuming that jdbc server is consistent, it loads the lookup and ready to serve.
  2. Next pooling fails due to jdbc server not available/ some issues with lookup jdbc connection but old lookup is still serving good.
  3. Druid User tries to update the lookup json and post update request and old good lookup is also killed and query fails with cache state CACHE_NOT_INITIALIZED.

Bug is the behavior of Step 3, currently we blindly issue the AddNotice for new lookup and DropNotice for old lookup without making sure that new lookup cache population is successful.

There is another bug at Step 2 where we do have resiliency in dealing with JDBC handle lookups, we do not retry handle on transient errors. If transient error occurs then we need to wait for next pooling period to reach and populate lookups. In this time, lookup's state remains CACHE_NOT_INITIALIZED if we have no successful load previously.

Proposals for the addressing Step 2 and Step 3 bugs:

  1. Step 2: Create resilient handle and retry on transient error.
  2. Step 3: Delay the Drop notice execution until AddNotice loads the lookup on the current node, and make sure that we have one latest lookup loaded successfully and good to drop the previous one. This can be done by starting the schedule executor thread here https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/query/lookup/LookupReferencesManager.java#L675 that can execute it after Delay D, N times and it can get the latest stateRef from LookupReferenceManager and make sure that latest Ref cache is loaded successfully (Also make sure that LookupExtractorFactoryContainer that we are trying to remove is not same as current stateRef container).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions