Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upstorage: re-enable the merge queue by default #29583
Conversation
benesch
requested a review
from
tschottdorf
Sep 5, 2018
benesch
requested a review
from cockroachdb/core-prs
as a
code owner
Sep 5, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
Sep 5, 2018
Member
The results from last night's stress run on were extremely promising. All the failures were existing flaky tests. You can see for yourself here:
Scratch that. There is a longstanding bug in our stress test configuration which results in them always getting run against master. Thanks, TeamCity.
Scratch that. There is a longstanding bug in our stress test configuration which results in them always getting run against master. Thanks, TeamCity. |
benesch
requested a review
from cockroachdb/sql-rest-prs
as a
code owner
Sep 6, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
requested review from
cockroachdb/distsql-prs
as
code owners
Oct 10, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
Oct 10, 2018
Member
Alright. There are only two stress failures that are not addressed in this PR.
The first failure is #31062, which is a scary consistency failure. I've run that test 100+ times with roachprod-stress (it's really slow; 100 runs took 45m) and haven't been able to reproduce. It's entirely possible that the bug was fixed by #30986, which wasn't included in the build that I triggered the stress run on.
The second stress failure is #31059. This is an ambiguous result caused by a replica removal. I don't think this is related to merges, actually. I'm willing to bet that merges exacerbate this issue—they cause a lot of replica removals—but whatever the fix is here, it's going to be tricky. I'm inclined to leave this one alone for now. Like #31062, I haven't been able to reproduce it, even with roachprod-stress.
Any objections to merging as-is? We're coming down to the wire. Introducing two rarely-flaky tests in exchange for getting merges on by default seems worth it to me. /cc @bdarnell @tschottdorf
|
Alright. There are only two stress failures that are not addressed in this PR. The first failure is #31062, which is a scary consistency failure. I've run that test 100+ times with The second stress failure is #31059. This is an ambiguous result caused by a replica removal. I don't think this is related to merges, actually. I'm willing to bet that merges exacerbate this issue—they cause a lot of replica removals—but whatever the fix is here, it's going to be tricky. I'm inclined to leave this one alone for now. Like #31062, I haven't been able to reproduce it, even with Any objections to merging as-is? We're coming down to the wire. Introducing two rarely-flaky tests in exchange for getting merges on by default seems worth it to me. /cc @bdarnell @tschottdorf |
benesch
requested a review
from
bdarnell
Oct 10, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tschottdorf
Oct 10, 2018
Member
The first failure is #31062, which is a scary consistency failure
Have you tried to repro on the same sha? If you manage that and don't manage right after #30986 lands, this looks fine.
Any objections to merging as-is? We're coming down to the wire. Introducing two rarely-flaky tests in exchange for getting merges on by default seems worth it to me. /cc @bdarnell @tschottdorf
I'd like to figure the consistency failure out, but the other one seems OK to deal with in isolation.
Have you tried to repro on the same sha? If you manage that and don't manage right after #30986 lands, this looks fine.
I'd like to figure the consistency failure out, but the other one seems OK to deal with in isolation. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
Oct 10, 2018
Member
Have you tried to repro on the same sha? If you manage that and don't manage right after #30986 lands, this looks fine.
Welp, yep, the consistency failure repro'd straight away without #30986. I'll try to repro once more for good measure, but feeling pretty good that the bug fixed in #30986 was actually to blame.
Welp, yep, the consistency failure repro'd straight away without #30986. I'll try to repro once more for good measure, but feeling pretty good that the bug fixed in #30986 was actually to blame. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bot
pushed a commit
that referenced
this pull request
Oct 11, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 11, 2018
Build failed (retrying...) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
Oct 11, 2018
Member
And somehow TC manages to be much better than I am about reproducing flakes. Gah!
bors r-
|
And somehow TC manages to be much better than I am about reproducing flakes. Gah! bors r- |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 11, 2018
Canceled |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tschottdorf
Oct 11, 2018
Member
When you roachprod-stress{,race}, are you using identical machines?
roachprod create peter-stress -n 20 --gce-machine-type=n1-standard-8 --local-ssd=false
Of course the difference is that CI runs the whole test suite at once (so multiple packages are running concurrently).
|
When you roachprod-stress{,race}, are you using identical machines?
Of course the difference is that CI runs the whole test suite at once (so multiple packages are running concurrently). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benesch
Oct 11, 2018
Member
Yep, using identical machines. I think the problem is that this failure was a timeout, and since I didn't run roachprod-stress for 25+ minutes I never hit the timeout. Maybe I can repro this locally.
|
Yep, using identical machines. I think the problem is that this failure was a timeout, and since I didn't run roachprod-stress for 25+ minutes I never hit the timeout. Maybe I can repro this locally. |
benesch
added some commits
Sep 5, 2018
tschottdorf
approved these changes
Oct 15, 2018
Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 2 of 2 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale)
benesch
added some commits
Oct 15, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Ok, 18th time is the charm. |
tschottdorf
approved these changes
Oct 15, 2018
Reviewed 3 of 3 files at r7, 2 of 2 files at r8.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
whew! bors r=tschottdorf |
bot
pushed a commit
that referenced
this pull request
Oct 15, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 15, 2018
Build failed |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
bors r+ flaked on #31287 |
bot
pushed a commit
that referenced
this pull request
Oct 15, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 15, 2018
Build succeeded |
craig
bot
merged commit 9fee577
into
master
Oct 15, 2018
benesch
deleted the
merge-default
branch
Oct 15, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
whoa, thanks, I'd just assumed that was my fault! |
benesch commentedSep 5, 2018
This reverts commit a1cc4c5.
The results from last night's stress run on were extremely promising. All the failures were existing flaky tests. You can see for yourself here:
https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_Nightlies_Stress&tab=buildTypeStatusDiv&branch_Cockroach_Nightlies=refs%2Fheads%2Fmerge-default&state=failed