Skip to content

Conversation

thelabdude
Copy link
Contributor

@thelabdude thelabdude commented May 18, 2021

Description

When doing the work for ae1ac22, I discovered that the TestPullReplica class was disabled via an AwaitsFix. I enabled it and it passed fairly consistently, but then saw flaky failures on Jenkins. This PR fixes those flaky tests ...

IndexFetcher was triggering a local commit for a PULL replica core (see snippet below) which caused a race condition where the PULL replica would think its index generation / version were the same as the leader (see log I posted in JIRA).

From IndexFetcher, line 479:

          if (skipCommitOnLeaderVersionZero) {
            openNewSearcherAndUpdateCommitPoint();
          } else {
            SolrQueryRequest req = new LocalSolrQueryRequest(solrCore, new ModifiableSolrParams());
            solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false));
          }

(the else block hitting is what causes the tests to fail intermittently)

This would cause the PULL replica to not sync index updates with the leader in the tests (since they do it doc at a time).

Solution

I don't think PULL replicas should ever commit locally and should always get the index from the leader. So this PR ensures that skipCommitOnLeaderVersionZero is true for PULL replicas, which avoids this race condition.

Tests

Removing @BadApple from several flaky tests that pass consistently now: 200/200 beasts (tests would fail within about 40 beast runs w/o this fix)

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@thelabdude thelabdude changed the title SOLR-15399: Pull replica shouldn't commit locally in IndexFetcher SOLR-15399: IndexFetcher should not issue a local commit for PULL replicas May 19, 2021
Copy link
Member

@tflobbe tflobbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this, Tim. LGTM. I'm guessing the side effects are that the PULL replica won't really clean up it's index (i.e. disk will still contain all index files) until a real-non-gen-zero replication happen from the leader, right? That index won't be searchable though. Even if there is a restart, since the PULL replica will go through the recovery process, the index on disk won't be searchable again IIUC. This should also solve SOLR-10751 and SOLR-12100

@thelabdude
Copy link
Contributor Author

thanks for the review @tflobbe ... let me add a test case for the restart while the PULL replica is in this side-effect state, just so we can be sure it's ok and recovers properly

@dsmiley
Copy link
Contributor

dsmiley commented May 20, 2021

Impressive work to track this down!

@thelabdude thelabdude merged commit c731873 into apache:main May 20, 2021
epugh pushed a commit to epugh/solr that referenced this pull request Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants