New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: re-enable the merge queue by default #29583

Merged
merged 6 commits into from Oct 15, 2018

Conversation

Projects
None yet
3 participants
@benesch
Member

benesch commented Sep 5, 2018

This reverts commit a1cc4c5.

The results from last night's stress run on were extremely promising. All the failures were existing flaky tests. You can see for yourself here:

https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_Nightlies_Stress&tab=buildTypeStatusDiv&branch_Cockroach_Nightlies=refs%2Fheads%2Fmerge-default&state=failed

@benesch benesch requested a review from tschottdorf Sep 5, 2018

@benesch benesch requested a review from cockroachdb/core-prs as a code owner Sep 5, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Sep 5, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Sep 5, 2018

This change is Reviewable

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Sep 5, 2018

Member

The results from last night's stress run on were extremely promising. All the failures were existing flaky tests. You can see for yourself here:

Scratch that. There is a longstanding bug in our stress test configuration which results in them always getting run against master. Thanks, TeamCity.

Member

benesch commented Sep 5, 2018

The results from last night's stress run on were extremely promising. All the failures were existing flaky tests. You can see for yourself here:

Scratch that. There is a longstanding bug in our stress test configuration which results in them always getting run against master. Thanks, TeamCity.

@benesch benesch requested a review from cockroachdb/sql-rest-prs as a code owner Sep 6, 2018

@benesch benesch requested review from cockroachdb/distsql-prs as code owners Oct 10, 2018

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 10, 2018

Member

Alright. There are only two stress failures that are not addressed in this PR.

The first failure is #31062, which is a scary consistency failure. I've run that test 100+ times with roachprod-stress (it's really slow; 100 runs took 45m) and haven't been able to reproduce. It's entirely possible that the bug was fixed by #30986, which wasn't included in the build that I triggered the stress run on.

The second stress failure is #31059. This is an ambiguous result caused by a replica removal. I don't think this is related to merges, actually. I'm willing to bet that merges exacerbate this issue—they cause a lot of replica removals—but whatever the fix is here, it's going to be tricky. I'm inclined to leave this one alone for now. Like #31062, I haven't been able to reproduce it, even with roachprod-stress.

Any objections to merging as-is? We're coming down to the wire. Introducing two rarely-flaky tests in exchange for getting merges on by default seems worth it to me. /cc @bdarnell @tschottdorf

Member

benesch commented Oct 10, 2018

Alright. There are only two stress failures that are not addressed in this PR.

The first failure is #31062, which is a scary consistency failure. I've run that test 100+ times with roachprod-stress (it's really slow; 100 runs took 45m) and haven't been able to reproduce. It's entirely possible that the bug was fixed by #30986, which wasn't included in the build that I triggered the stress run on.

The second stress failure is #31059. This is an ambiguous result caused by a replica removal. I don't think this is related to merges, actually. I'm willing to bet that merges exacerbate this issue—they cause a lot of replica removals—but whatever the fix is here, it's going to be tricky. I'm inclined to leave this one alone for now. Like #31062, I haven't been able to reproduce it, even with roachprod-stress.

Any objections to merging as-is? We're coming down to the wire. Introducing two rarely-flaky tests in exchange for getting merges on by default seems worth it to me. /cc @bdarnell @tschottdorf

@benesch benesch requested a review from bdarnell Oct 10, 2018

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Oct 10, 2018

Member

The first failure is #31062, which is a scary consistency failure

Have you tried to repro on the same sha? If you manage that and don't manage right after #30986 lands, this looks fine.

Any objections to merging as-is? We're coming down to the wire. Introducing two rarely-flaky tests in exchange for getting merges on by default seems worth it to me. /cc @bdarnell @tschottdorf

I'd like to figure the consistency failure out, but the other one seems OK to deal with in isolation.

Member

tschottdorf commented Oct 10, 2018

The first failure is #31062, which is a scary consistency failure

Have you tried to repro on the same sha? If you manage that and don't manage right after #30986 lands, this looks fine.

Any objections to merging as-is? We're coming down to the wire. Introducing two rarely-flaky tests in exchange for getting merges on by default seems worth it to me. /cc @bdarnell @tschottdorf

I'd like to figure the consistency failure out, but the other one seems OK to deal with in isolation.

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 10, 2018

Member

Have you tried to repro on the same sha? If you manage that and don't manage right after #30986 lands, this looks fine.

Welp, yep, the consistency failure repro'd straight away without #30986. I'll try to repro once more for good measure, but feeling pretty good that the bug fixed in #30986 was actually to blame.

Member

benesch commented Oct 10, 2018

Have you tried to repro on the same sha? If you manage that and don't manage right after #30986 lands, this looks fine.

Welp, yep, the consistency failure repro'd straight away without #30986. I'll try to repro once more for good measure, but feeling pretty good that the bug fixed in #30986 was actually to blame.

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 11, 2018

Member

Yep, I ran this twice more without #30986 and twice more with #30986. I repro'd the consistency failure both times without it and never with it. I'm going to tentatively call this case closed and merge!

bors r=tschottdorf

Member

benesch commented Oct 11, 2018

Yep, I ran this twice more without #30986 and twice more with #30986. I repro'd the consistency failure both times without it and never with it. I'm going to tentatively call this case closed and merge!

bors r=tschottdorf

craig bot pushed a commit that referenced this pull request Oct 11, 2018

Merge #29583 #31141
29583: storage: re-enable the merge queue by default r=tschottdorf a=benesch

This reverts commit a1cc4c5.

The results from last night's stress run on were extremely promising. All the failures were existing flaky tests. You can see for yourself here:

https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_Nightlies_Stress&tab=buildTypeStatusDiv&branch_Cockroach_Nightlies=refs%2Fheads%2Fmerge-default&state=failed

31141: roachtest: add a cluster.Start() flavor r=andreimatei a=andreimatei

that returns an error.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot Oct 11, 2018

Build failed (retrying...)

craig bot commented Oct 11, 2018

Build failed (retrying...)

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 11, 2018

Member

And somehow TC manages to be much better than I am about reproducing flakes. Gah!

bors r-

Member

benesch commented Oct 11, 2018

And somehow TC manages to be much better than I am about reproducing flakes. Gah!

bors r-

@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot Oct 11, 2018

Canceled

craig bot commented Oct 11, 2018

Canceled

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Oct 11, 2018

Member

When you roachprod-stress{,race}, are you using identical machines?

roachprod create peter-stress -n 20 --gce-machine-type=n1-standard-8 --local-ssd=false

Of course the difference is that CI runs the whole test suite at once (so multiple packages are running concurrently).

Member

tschottdorf commented Oct 11, 2018

When you roachprod-stress{,race}, are you using identical machines?

roachprod create peter-stress -n 20 --gce-machine-type=n1-standard-8 --local-ssd=false

Of course the difference is that CI runs the whole test suite at once (so multiple packages are running concurrently).

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 11, 2018

Member

Yep, using identical machines. I think the problem is that this failure was a timeout, and since I didn't run roachprod-stress for 25+ minutes I never hit the timeout. Maybe I can repro this locally.

Member

benesch commented Oct 11, 2018

Yep, using identical machines. I think the problem is that this failure was a timeout, and since I didn't run roachprod-stress for 25+ minutes I never hit the timeout. Maybe I can repro this locally.

benesch added some commits Sep 5, 2018

testcluster: fully disable the merge queue in manual replication mode
The merge queue needs to be disabled via cluster setting (and not
testing knob) in order for ALTER TABLE ... SPLIT AT to work.

Release note: None
storage: deflake TestLogMerges
This test needs the merge queue disabled.

Fix #31061.

Release note: None
storage: deflake TestStoreSetRangesMaxBytes
TestStoreSetRangesMaxBytes creates manual splits (using a horrendously
outdated function, but that's a problem for another day), so it can't
tolerate the merge queue.

Release note: None
distsqlplan: deflake TestMixedDirections and TestSpanResolver
These tests both need the merge queue to be disabled.

Fix #29321.
Fix #31060.

Release note: None
@tschottdorf

:lgtm:

Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 2 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

benesch added some commits Oct 15, 2018

roachtest: force splits in acceptance/bank's split monkey
When merges are enabled by default, splits need to be forced because
they will be quickly discard by the merge queue. In this test the goal
is maximum chaos, so having the ranges get merged away soon after they
split is desirable.

Release note: None
@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 15, 2018

Member

Ok, 18th time is the charm. 🍿

Member

benesch commented Oct 15, 2018

Ok, 18th time is the charm. 🍿

@tschottdorf

Reviewed 3 of 3 files at r7, 2 of 2 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 15, 2018

Member

whew!

bors r=tschottdorf

Member

benesch commented Oct 15, 2018

whew!

bors r=tschottdorf

craig bot pushed a commit that referenced this pull request Oct 15, 2018

Merge #29583
29583: storage: re-enable the merge queue by default r=tschottdorf a=benesch

This reverts commit a1cc4c5.

The results from last night's stress run on were extremely promising. All the failures were existing flaky tests. You can see for yourself here:

https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_Nightlies_Stress&tab=buildTypeStatusDiv&branch_Cockroach_Nightlies=refs%2Fheads%2Fmerge-default&state=failed

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot commented Oct 15, 2018

Build failed

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Oct 15, 2018

Member

bors r+

flaked on #31287

Member

tschottdorf commented Oct 15, 2018

bors r+

flaked on #31287

craig bot pushed a commit that referenced this pull request Oct 15, 2018

Merge #29583
29583: storage: re-enable the merge queue by default r=tschottdorf a=benesch

This reverts commit a1cc4c5.

The results from last night's stress run on were extremely promising. All the failures were existing flaky tests. You can see for yourself here:

https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_Nightlies_Stress&tab=buildTypeStatusDiv&branch_Cockroach_Nightlies=refs%2Fheads%2Fmerge-default&state=failed

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot commented Oct 15, 2018

Build succeeded

@craig craig bot merged commit 9fee577 into master Oct 15, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details

@benesch benesch deleted the merge-default branch Oct 15, 2018

@benesch

This comment has been minimized.

Show comment
Hide comment
@benesch

benesch Oct 15, 2018

Member

whoa, thanks, I'd just assumed that was my fault!

Member

benesch commented Oct 15, 2018

whoa, thanks, I'd just assumed that was my fault!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment