Skip to content

SOLR-15705: A deleteById request without _route_ param for compositeid router could be sent to all shard leaders#288

Merged
epugh merged 13 commits intoapache:mainfrom
makosten:solr-6910
Nov 8, 2021
Merged

SOLR-15705: A deleteById request without _route_ param for compositeid router could be sent to all shard leaders#288
epugh merged 13 commits intoapache:mainfrom
makosten:solr-6910

Conversation

@makosten
Copy link
Contributor

@makosten makosten commented Sep 8, 2021

https://issues.apache.org/jira/browse/SOLR-15705

Description

This is an implementation of distributed delete-by id-when using the CompositeId router with a router field defined and the field value is missing. The current behavior in this circumstance is to ignore delete by id request.

Solution

This first approach detects this condition and forwards the request to the doDeleteByQuery method. This works, but it would be better to extract the shared code from doDeleteByQuery. I looked at this but need some advice on how to proceed.

Tests

I've extended an existing test to verify the new behavior.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@makosten
Copy link
Contributor Author

makosten commented Oct 6, 2021

I revised the logic to call DeleteByQuery earlier. Now instead of calling doDeleteByQuery frrom doDistribDeleteById, it is called from doDeleteById. I tried revising doDeleteById to forward the request to all shard leaders if the required route is missing from the request, but this did not work. At one time I believe a deleteById would work even if the route was missing if it happened to land on the right shard, but it seems that this behavior is changed.

if (router instanceof CompositeIdRouter && routeField != null) {
DistribPhase phase = DistribPhase.parseParam(req.getParams().get(DISTRIB_UPDATE_PARAM));
if (phase == DistribPhase.NONE) {
log.debug("Using compositeId router and deleteById command is with missing route value, distributing to all shard leaders");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makosten small nit, should this debug line be "command is missing" instead of "is with" ;-)

@epugh epugh requested a review from dsmiley October 6, 2021 20:08
@dsmiley
Copy link
Contributor

dsmiley commented Oct 15, 2021

Sorry, I find having doDeleteById call doDeleteByQuery to be way too kludgy (and you admitted as much in JIRA). It appears that this works almost accidentally, and is thus fragile and perhaps internally slower than it could be. It seems the logic should go into doDistribDeleteById.

But firstly, I'm confused about how this PR relates to the JIRA issue. The JIRA issue clearly states this issue occurs when the implicit router is being used. Yet the test here and the fix use/check-for the CompositeIdRouter. Am I missing something?

@makosten
Copy link
Contributor Author

As far as implicit vs. composite router, you aren't missing anything. I was trying to avoid creating a duplicate JIRA and messed up by picking 6910. The patch only applies to the composite id router. Maybe the best way to unravel this is to create a new JIRA and reference 6910?

@dsmiley
Copy link
Contributor

dsmiley commented Oct 19, 2021

Yes, please create a separate JIRA as it's only vaguely related to SOLR-6910.

@makosten makosten changed the title SOLR-6910: A deleteById request without _route_ param for implicit router could be sent to all shard leaders SOLR-15705: A deleteById request without _route_ param for implicit router could be sent to all shard leaders Oct 20, 2021
@makosten
Copy link
Contributor Author

Here is a process flow of a DeleteByQuery. The doDistribDeleteByQuery is where a leader forwards the command to its replicas. The doDeleteByQuery in DistributedZkUpdateProcessor is where the command is forwarded to the shard leaders, so I do think that my logic of forwarding the request to doDeleteByQuery is in the right location, although the whole approach is not optimal. I did try modeling how DeleteByQuery works by forwarding the request to shard leaders, but found that the local deleteById, actually deleting the document from the index, does not happen if the route field value is missing.

image

@epugh
Copy link
Contributor

epugh commented Oct 23, 2021

@cpoerschke any chance you'd be interested in reviewing this fix as well? Looking to get some more experienced folks to confirm this patch!

@makosten
Copy link
Contributor Author

Here's a diagram if DeleteById with a compositeId router and a route value:
image

I've added some logic in the doDistribDeleteById for Phase=NONE to also distribute to request to the leaders of the other shards if the route is missing, which is more along the lines of David Smiley's suggestion. I can now see the local deletion happening on all the nodes, all the way down to DirectUpdateHandler2.delete, but the deletion doesn't happen. This isn't completely surprising, because even now if the receiving node is the shard leader of the correct shard and the route is missing, the local delete still appears to happen on the shard leader and follower but ultimately the deletion fails without error, at least in my test it does. I think if I could discern why it still fails even though the local deletion appears to be happening, that this would be the correct approach.

@dsmiley
Copy link
Contributor

dsmiley commented Oct 23, 2021

I've added some logic in the doDistribDeleteById for Phase=NONE to also distribute to request to the leaders of the other shards if the route is missing, which is more along the lines of David Smiley's suggestion.

If you do that and the test fails (presumably), I could help debug to see why.

In terms of reviewers, I think @hossman would be a good choice based on him touching the same test very recently in SOLR-8889 to fix similar-ish bugs.

@cpoerschke
Copy link
Contributor

Thanks @makosten for the diagrams to illustrate process flow, and @epugh for tagging me, this is an intriguing area of the code!

... distribute to request to the leaders of the other shards if the route is missing ...

That seems intuitively right. From what I've learnt so far from code reading, the issue is that https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.10.1/solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java#L609 picks a shard even if it does not have sufficient information to do so i.e. in the absence of the route it picks based on id which statistically will only sometimes pick the right shard and distribution to all shard leaders is necessary to be sure it reaches the right shard.

... If you do that and the test fails (presumably), I could help debug to see why. ...

I'd be interested too in looking more into this.

@makosten makosten changed the title SOLR-15705: A deleteById request without _route_ param for implicit router could be sent to all shard leaders SOLR-15705: A deleteById request without _route_ param for compositeid or implicit router could be sent to all shard leaders Oct 31, 2021
@makosten
Copy link
Contributor Author

makosten commented Nov 1, 2021

@cpoerschke You nailed it! The CompositeId Router has some inconsistent behavior. It's strict on the router field when adding but if the route is missing when deleting by id, it returns the slice based on the unique id hash. I updated the CompositeIdRouter to return null for getTargetSlice for this condition, in which case it is run on the current slice, and then flag the request to be broadcast to the other shard leaders.

This also affected the Implicit Router as it will now broadcast the delete-by-id request if the route is missing, so I updated a failing test. I'd like some feedback on this, as maybe the old behavior is more desirable for the Implicit router.

@@ -316,6 +317,41 @@ public void processDelete(DeleteUpdateCommand cmd) throws IOException {
protected void doDeleteById(DeleteUpdateCommand cmd) throws IOException {
setupRequest(cmd);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THREAD_SAFETY_VIOLATION: Unprotected write. Non-private method DistributedZkUpdateProcessor.doDeleteById(...) indirectly writes to field noggit.JSONParser.devNull.buf outside of synchronization.
Reporting because another access to the same memory occurs on a background thread, although this access may not.
(at-me in a reply with help or ignore)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sonatype-lift help

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m LiftBot and I automatically analyze new code in code review, and comment when I find potential bugs. I also accept comments as commands. Just @sonatype-lift followed by the command: ignore to mark as false positive, unignore to undo, and help to see this message. Click here to add LiftBot to another repo.

@cpoerschke
Copy link
Contributor

This also affected the Implicit Router as it will now broadcast the delete-by-id request if the route is missing, so I updated a failing test. I'd like some feedback on this, as maybe the old behavior is more desirable for the Implicit router.

From the "... does not automatically route ..." mention at https://solr.apache.org/guide/8_10/collection-management.html#create-parameters documentation I think the existing behaviour for the implicit router makes sense i.e. routing of both indexing and delete-by-id requests is the caller's responsibility.

(If we wanted to change that behaviour, hypothetically, then doing so separately (e.g. as a follow-up issue and pull request) might be clearest and non-backwards compatibility would need to be considered and communicated e.g. users currently doing round-robins on delete-by-id external to Solr can cease doing that and users which use the implicit routing to (temporarily?) have the same document id on multiple shards would benefit from a way to opt-out of the delete-by-id broadcasting.)

Copy link
Contributor

@cpoerschke cpoerschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @makosten! I haven't yet looked at the test changes but the code changes conceptually make sense to me, comments inline are implementation detail related. What do you think?


if (broadcastDeleteById && DistribPhase.parseParam(req.getParams().get(DISTRIB_UPDATE_PARAM)) == DistribPhase.NONE ) {

log.debug("The deleteById command is missing the required route, broadcasting to leaders of other shards");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/4 - minor: maybe include cmd.getId() in the debug logging here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll make this change.

DocCollection coll = clusterState.getCollection(collection);
Collection<Slice> slices = coll.getRouter().getSearchSlices(route, params, coll);

// if just one slice, we can skip this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/4 - if the "just one slice" logic happened earlier so that broadcastDeleteById is only true if there's more than one slice this:

  • would remove the need for this if here
  • could facilitate code sharing with doDeleteByQuery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good idea. I moved the check to setupRequest(). I believe the most efficient check that return the same value is coll.getActiveSlicesMap().size().

outParams.remove("commit"); // this will be distributed from the local commit

cmdDistrib.distribDelete(cmd, leaders, outParams, false, rollupReplicationTracker, null);
boolean leaderForAnyShard = forwardDelete(coll, cmd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3/4 - added a commit to the pull request branch -- feel free to revert or amend -- that factors out forwardDelete method from doDeleteByQuery, for potential code sharing with doDeleteById

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I'm going to see if forwardDelete can be shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That worked quite nicely!


// check if client has requested minimum replication factor information. will set replicationTracker to null if
// we aren't the leader or subShardLeader
checkReplicationTracker(cmd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4/4 - doDeleteByQuery does rollup replication tracker logic before the forward-delete logic, doDeleteById here does check replication tracker logic after the forward-logic -- i've not yet considered this difference in detail, just noticed it whilst factoring out the forward-delete logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added creating the rollup replication tracker before forwarding the delete-by-id to the other shard leaders and saw no behavior difference. The handleReplicationFactor method didn't receive responses from other shard leaders.

@makosten
Copy link
Contributor Author

makosten commented Nov 4, 2021

On the unintentional impact to ImplicitIdRouter, I was thinking adding a property to the router as whether to do the broadcast to avoid isinstanceof CompositeIdRouter. But, there are other places in DistributedZkUpdateProcessor that do exactly that.

@epugh
Copy link
Contributor

epugh commented Nov 4, 2021

Thanks @cpoerschke for digging into this PR ;-)

@makosten makosten changed the title SOLR-15705: A deleteById request without _route_ param for compositeid or implicit router could be sent to all shard leaders SOLR-15705: A deleteById request without _route_ param for compositeid router could be sent to all shard leaders Nov 4, 2021
@cpoerschke
Copy link
Contributor

Added two commits with small tweaks, yet to complete running full test suite locally but assuming it passes LGTM here.

@makosten - would you like to go add a solr/CHANGES.txt entry also?

@epugh epugh merged commit f589607 into apache:main Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants