SOLR-15705: A deleteById request without _route_ param for compositeid router could be sent to all shard leaders by makosten · Pull Request #288 · apache/solr

makosten · 2021-09-08T15:56:32Z

https://issues.apache.org/jira/browse/SOLR-15705

Description

This is an implementation of distributed delete-by id-when using the CompositeId router with a router field defined and the field value is missing. The current behavior in this circumstance is to ignore delete by id request.

Solution

This first approach detects this condition and forwards the request to the doDeleteByQuery method. This works, but it would be better to extract the shared code from doDeleteByQuery. I looked at this but need some advice on how to proceed.

Tests

I've extended an existing test to verify the new behavior.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

solr/core/src/test/org/apache/solr/cloud/FullSolrCloudDistribCmdsTest.java

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java

makosten · 2021-10-06T00:47:51Z

I revised the logic to call DeleteByQuery earlier. Now instead of calling doDeleteByQuery frrom doDistribDeleteById, it is called from doDeleteById. I tried revising doDeleteById to forward the request to all shard leaders if the required route is missing from the request, but this did not work. At one time I believe a deleteById would work even if the route was missing if it happened to land on the right shard, but it seems that this behavior is changed.

epugh · 2021-10-06T20:06:21Z

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java

+      if (router instanceof CompositeIdRouter && routeField != null)  {
+        DistribPhase phase = DistribPhase.parseParam(req.getParams().get(DISTRIB_UPDATE_PARAM));
+        if (phase == DistribPhase.NONE) {
+          log.debug("Using compositeId router and deleteById command is with missing route value, distributing to all shard leaders");


@makosten small nit, should this debug line be "command is missing" instead of "is with" ;-)

dsmiley · 2021-10-15T20:57:11Z

Sorry, I find having doDeleteById call doDeleteByQuery to be way too kludgy (and you admitted as much in JIRA). It appears that this works almost accidentally, and is thus fragile and perhaps internally slower than it could be. It seems the logic should go into doDistribDeleteById.

But firstly, I'm confused about how this PR relates to the JIRA issue. The JIRA issue clearly states this issue occurs when the implicit router is being used. Yet the test here and the fix use/check-for the CompositeIdRouter. Am I missing something?

makosten · 2021-10-19T20:04:54Z

As far as implicit vs. composite router, you aren't missing anything. I was trying to avoid creating a duplicate JIRA and messed up by picking 6910. The patch only applies to the composite id router. Maybe the best way to unravel this is to create a new JIRA and reference 6910?

dsmiley · 2021-10-19T22:22:10Z

Yes, please create a separate JIRA as it's only vaguely related to SOLR-6910.

makosten · 2021-10-20T17:57:53Z

Here is a process flow of a DeleteByQuery. The doDistribDeleteByQuery is where a leader forwards the command to its replicas. The doDeleteByQuery in DistributedZkUpdateProcessor is where the command is forwarded to the shard leaders, so I do think that my logic of forwarding the request to doDeleteByQuery is in the right location, although the whole approach is not optimal. I did try modeling how DeleteByQuery works by forwarding the request to shard leaders, but found that the local deleteById, actually deleting the document from the index, does not happen if the route field value is missing.

epugh · 2021-10-23T11:54:47Z

@cpoerschke any chance you'd be interested in reviewing this fix as well? Looking to get some more experienced folks to confirm this patch!

makosten · 2021-10-23T14:24:14Z

Here's a diagram if DeleteById with a compositeId router and a route value:

I've added some logic in the doDistribDeleteById for Phase=NONE to also distribute to request to the leaders of the other shards if the route is missing, which is more along the lines of David Smiley's suggestion. I can now see the local deletion happening on all the nodes, all the way down to DirectUpdateHandler2.delete, but the deletion doesn't happen. This isn't completely surprising, because even now if the receiving node is the shard leader of the correct shard and the route is missing, the local delete still appears to happen on the shard leader and follower but ultimately the deletion fails without error, at least in my test it does. I think if I could discern why it still fails even though the local deletion appears to be happening, that this would be the correct approach.

dsmiley · 2021-10-23T20:39:37Z

I've added some logic in the doDistribDeleteById for Phase=NONE to also distribute to request to the leaders of the other shards if the route is missing, which is more along the lines of David Smiley's suggestion.

If you do that and the test fails (presumably), I could help debug to see why.

In terms of reviewers, I think @hossman would be a good choice based on him touching the same test very recently in SOLR-8889 to fix similar-ish bugs.

cpoerschke · 2021-10-25T16:17:56Z

Thanks @makosten for the diagrams to illustrate process flow, and @epugh for tagging me, this is an intriguing area of the code!

... distribute to request to the leaders of the other shards if the route is missing ...

That seems intuitively right. From what I've learnt so far from code reading, the issue is that https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.10.1/solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java#L609 picks a shard even if it does not have sufficient information to do so i.e. in the absence of the route it picks based on id which statistically will only sometimes pick the right shard and distribution to all shard leaders is necessary to be sure it reaches the right shard.

... If you do that and the test fails (presumably), I could help debug to see why. ...

I'd be interested too in looking more into this.

…e-by-id request. Refactoring.

makosten · 2021-11-01T00:36:43Z

@cpoerschke You nailed it! The CompositeId Router has some inconsistent behavior. It's strict on the router field when adding but if the route is missing when deleting by id, it returns the slice based on the unique id hash. I updated the CompositeIdRouter to return null for getTargetSlice for this condition, in which case it is run on the current slice, and then flag the request to be broadcast to the other shard leaders.

This also affected the Implicit Router as it will now broadcast the delete-by-id request if the route is missing, so I updated a failing test. I'd like some feedback on this, as maybe the old behavior is more desirable for the Implicit router.

sonatype-lift · 2021-11-01T02:05:47Z

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java

@@ -316,6 +317,41 @@ public void processDelete(DeleteUpdateCommand cmd) throws IOException {
  protected void doDeleteById(DeleteUpdateCommand cmd) throws IOException {
    setupRequest(cmd);


THREAD_SAFETY_VIOLATION: Unprotected write. Non-private method DistributedZkUpdateProcessor.doDeleteById(...) indirectly writes to field noggit.JSONParser.devNull.buf outside of synchronization.
Reporting because another access to the same memory occurs on a background thread, although this access may not.
(at-me in a reply with help or ignore)

@sonatype-lift help

I’m LiftBot and I automatically analyze new code in code review, and comment when I find potential bugs. I also accept comments as commands. Just @sonatype-lift followed by the command: ignore to mark as false positive, unignore to undo, and help to see this message. Click here to add LiftBot to another repo.

cpoerschke · 2021-11-03T16:29:06Z

This also affected the Implicit Router as it will now broadcast the delete-by-id request if the route is missing, so I updated a failing test. I'd like some feedback on this, as maybe the old behavior is more desirable for the Implicit router.

From the "... does not automatically route ..." mention at https://solr.apache.org/guide/8_10/collection-management.html#create-parameters documentation I think the existing behaviour for the implicit router makes sense i.e. routing of both indexing and delete-by-id requests is the caller's responsibility.

(If we wanted to change that behaviour, hypothetically, then doing so separately (e.g. as a follow-up issue and pull request) might be clearest and non-backwards compatibility would need to be considered and communicated e.g. users currently doing round-robins on delete-by-id external to Solr can cease doing that and users which use the implicit routing to (temporarily?) have the same document id on multiple shards would benefit from a way to opt-out of the delete-by-id broadcasting.)

cpoerschke

Hi @makosten! I haven't yet looked at the test changes but the code changes conceptually make sense to me, comments inline are implementation detail related. What do you think?

cpoerschke · 2021-11-03T16:31:55Z

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java


+    if (broadcastDeleteById && DistribPhase.parseParam(req.getParams().get(DISTRIB_UPDATE_PARAM)) == DistribPhase.NONE ) {
+
+      log.debug("The deleteById command is missing the required route, broadcasting to leaders of other shards");


1/4 - minor: maybe include cmd.getId() in the debug logging here

Good idea, I'll make this change.

cpoerschke · 2021-11-03T16:32:43Z

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java

+      DocCollection coll = clusterState.getCollection(collection);
+      Collection<Slice> slices = coll.getRouter().getSearchSlices(route, params, coll);
+
+      // if just one slice, we can skip this


2/4 - if the "just one slice" logic happened earlier so that broadcastDeleteById is only true if there's more than one slice this:

would remove the need for this if here

could facilitate code sharing with doDeleteByQuery

Another good idea. I moved the check to setupRequest(). I believe the most efficient check that return the same value is coll.getActiveSlicesMap().size().

cpoerschke · 2021-11-03T16:33:44Z

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java

-      outParams.remove("commit"); // this will be distributed from the local commit
-
-      cmdDistrib.distribDelete(cmd, leaders, outParams, false, rollupReplicationTracker, null);
+      boolean leaderForAnyShard = forwardDelete(coll, cmd);


3/4 - added a commit to the pull request branch -- feel free to revert or amend -- that factors out forwardDelete method from doDeleteByQuery, for potential code sharing with doDeleteById

This is great! I'm going to see if forwardDelete can be shared.

That worked quite nicely!

cpoerschke · 2021-11-03T16:34:24Z

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java

+
    // check if client has requested minimum replication factor information. will set replicationTracker to null if
    // we aren't the leader or subShardLeader
    checkReplicationTracker(cmd);


4/4 - doDeleteByQuery does rollup replication tracker logic before the forward-delete logic, doDeleteById here does check replication tracker logic after the forward-logic -- i've not yet considered this difference in detail, just noticed it whilst factoring out the forward-delete logic

I added creating the rollup replication tracker before forwarding the delete-by-id to the other shard leaders and saw no behavior difference. The handleReplicationFactor method didn't receive responses from other shard leaders.

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java

makosten · 2021-11-04T16:03:21Z

On the unintentional impact to ImplicitIdRouter, I was thinking adding a property to the router as whether to do the broadcast to avoid isinstanceof CompositeIdRouter. But, there are other places in DistributedZkUpdateProcessor that do exactly that.

epugh · 2021-11-04T16:16:35Z

Thanks @cpoerschke for digging into this PR ;-)

…rd leaders

cpoerschke · 2021-11-05T17:53:34Z

Added two commits with small tweaks, yet to complete running full test suite locally but assuming it passes LGTM here.

@makosten - would you like to go add a solr/CHANGES.txt entry also?

Add tests and initial implementation of distributed delete by id

66195f8

epugh reviewed Sep 8, 2021

View reviewed changes

solr/core/src/test/org/apache/solr/cloud/FullSolrCloudDistribCmdsTest.java Show resolved Hide resolved

epugh reviewed Sep 8, 2021

View reviewed changes

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java Outdated Show resolved Hide resolved

mkosten added 2 commits October 5, 2021 16:23

Move call to deleteByQuery to earlier in deleteById processing.

a8451c5

Remove accidentally added benchmark test

7265fac

epugh reviewed Oct 6, 2021

View reviewed changes

epugh requested a review from dsmiley October 6, 2021 20:08

Fix grammar of debug log entry

4580a53

makosten changed the title ~~SOLR-6910: A deleteById request without _route_ param for implicit router could be sent to all shard leaders~~ SOLR-15705: A deleteById request without _route_ param for implicit router could be sent to all shard leaders Oct 20, 2021

makosten changed the title ~~SOLR-15705: A deleteById request without _route_ param for implicit router could be sent to all shard leaders~~ SOLR-15705: A deleteById request without _route_ param for compositeid or implicit router could be sent to all shard leaders Oct 31, 2021

Update CompositeIdRouter to not hash on id if missing route for delet…

831c702

…e-by-id request. Refactoring.

sonatype-lift bot reviewed Nov 1, 2021

View reviewed changes

factor out DistributedZkUpdateProcessor.forwardDelete() method

a0475aa

cpoerschke reviewed Nov 3, 2021

View reviewed changes

sonatype-lift bot reviewed Nov 3, 2021

View reviewed changes

solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java Show resolved Hide resolved

mkosten added 2 commits November 4, 2021 09:38

Revert behavior change for ImplicitRouter. Small code improvements.

14ae1f9

Use code shared with DeleteByQuery to broadcast requests to other sha…

a1b3d32

…rd leaders

Wrap new log.debug call with isEnabled check

784e298

makosten changed the title ~~SOLR-15705: A deleteById request without _route_ param for compositeid or implicit router could be sent to all shard leaders~~ SOLR-15705: A deleteById request without _route_ param for compositeid router could be sent to all shard leaders Nov 4, 2021

cpoerschke added 2 commits November 5, 2021 17:17

tweak: just-in-time distrib-phase parsing

0dd6f0c

tweak: remove distracting whitespace changes in test

b682cc3

mkosten and others added 2 commits November 8, 2021 06:24

Add entry to changes.txt

0d53467

Merge branch 'main' into solr-6910

05b727d

epugh merged commit f589607 into apache:main Nov 8, 2021

		@@ -316,6 +317,41 @@ public void processDelete(DeleteUpdateCommand cmd) throws IOException {
		protected void doDeleteById(DeleteUpdateCommand cmd) throws IOException {
		setupRequest(cmd);


		if (broadcastDeleteById && DistribPhase.parseParam(req.getParams().get(DISTRIB_UPDATE_PARAM)) == DistribPhase.NONE ) {

		log.debug("The deleteById command is missing the required route, broadcasting to leaders of other shards");

Conversation

makosten commented Sep 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

makosten commented Oct 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsmiley commented Oct 15, 2021

Uh oh!

makosten commented Oct 19, 2021

Uh oh!

dsmiley commented Oct 19, 2021

Uh oh!

makosten commented Oct 20, 2021

Uh oh!

epugh commented Oct 23, 2021

Uh oh!

makosten commented Oct 23, 2021

Uh oh!

dsmiley commented Oct 23, 2021

Uh oh!

cpoerschke commented Oct 25, 2021

Uh oh!

makosten commented Nov 1, 2021

Uh oh!

sonatype-lift bot Nov 1, 2021

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonatype-lift bot Nov 5, 2021

Choose a reason for hiding this comment

Uh oh!

cpoerschke commented Nov 3, 2021

Uh oh!

cpoerschke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

makosten commented Nov 4, 2021

Uh oh!

epugh commented Nov 4, 2021

Uh oh!

cpoerschke commented Nov 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

makosten commented Sep 8, 2021 •

edited

Loading