release: v2.2.0-alpha.20190211 #34288

knz · 2019-01-28T12:49:35Z

cockroachdb/cockroach#34288

asubiotto · 2019-02-06T14:49:21Z

A new sha will need to be picked to include: #34659

asubiotto · 2019-02-06T20:54:03Z

Updating candidate sha to 66f13c1

New candidate had to be picked: cockroachdb/cockroach#34288

asubiotto · 2019-02-07T15:08:33Z

Note to self: re-run interleavedpartitioned, failed due to a quota issue.

asubiotto · 2019-02-07T20:34:09Z

interleavedpartitioned passed.

asubiotto · 2019-02-07T22:24:39Z

An issue was uncovered when creating automatic statistics on a view, the fix is ready #34726 but a new sha will need to be picked.

Update SHA to new candidate (cockroachdb/cockroach#34288)

asubiotto · 2019-02-08T00:07:42Z

Updated candidate SHA to 4bdf0ad

tbg · 2019-02-08T09:37:59Z

Unfortunately #34712 is another fixed issue that deserves being included.

tbg · 2019-02-08T09:39:09Z

(and #34502; they are both fixes that can cause the same symptom: an inability to upreplicate)

asubiotto · 2019-02-08T16:51:56Z

These fixes don't justify cutting a new alpha as not doing so won't introduce regressions (according to my understanding these issues are present in previous alphas).

asubiotto · 2019-02-11T15:52:36Z

Around 24h after 4bdf0ad was deployed on adriatic 02/08 00:00:00, we started seeing extremely variable p99 service latency and an uptick in replicas per node due to splits.

It's unclear to me whether these splits are expected, but the cluster seems to have stabilized.

Additionally, I observed a 20 minute period of 0 QPS. Double checking the supervisor status shows that there were no restarts of the workload. Checking the logs shows a lot of:

W190210 15:29:29.544486 292684971 storage/engine/mvcc.go:2215  [n1,s1,r3540/5:/Table/57/2/6/{8/NULâ<80>¦-10/10â<80>¦}] unable to find value for /Table/57/2/6/8/1/265962/0 @ 1549812565.802243815,1 (txn:<id:8fa4175f-b14c-4432-9745-5a6d9ced57c4 key:"\301\211\216\211\205\373\360\004\210" timestamp:<wall_time:1549812565381965658 logical:2 > priority:545203 sequence:24 > timestamp:<wall_time:1549812565381965658 logical:2 > deleted:false key_bytes:12 val_bytes:5  vs {ID:a59e27b2-dfdd-48a7-ab0e-46bfde18eca5 Key:[193 137 142 137 133 251 240 5 136] Epoch:3 Timestamp:1549812565.802243815,1 Priority:1306037 Sequence:155 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})

During this period. Graphs:

tbg · 2019-02-11T16:01:16Z

It's weird that the service latency also dropped to zero.

asubiotto · 2019-02-11T16:40:26Z

Some more discussion wrt the stall.

Alfonso Subiotto Marqués [Today at 8:39 AM]
We also just had a 20-minute period of 0 QPS

34 replies

Alfonso Subiotto Marqués [42 minutes ago]
Doesn’t seem like tpcc was stopped according to supervisor

Alfonso Subiotto Marqués [38 minutes ago]
Seeing a fair amount of
unable to find value for /Table/57/2/0/6/3/266420/0 @ 1549812478.372101895,1 (txn:<id:a436cac5-2b9e-4cbe-b920-41019199adef key:"\301\211\210\211\205\373\354\326\210" timestamp:<wall_time:1549812474876199165 logical:2 > priority:989256 sequence:18 > timestamp:<wall_time:1549812474876199165 logical:2 > deleted:false key_bytes:12 val_bytes:5 vs {ID:4d6e9578-0f3c-4977-9e6a-71e97b72561d Key:[193 137 136 137 133 251 236 216 136] Epoch:4 Timestamp:1549812478.372101895,1 Priority:843210 Sequence:152 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
In the logs during this period cc @tbg

Tobias Grieger [35 minutes ago]
Hmm that doesn’t look good. @nathan do you remember if this warning can fire randomly during TPCC?

Tobias Grieger [35 minutes ago]
also is the 20 minute zero qps period the one we’ve been trying to hunt down?

Tobias Grieger [35 minutes ago]
Unfortunately I’m about to head out. cc @bdarnell (edited)

Nathan VanBenschoten [31 minutes ago]
If things recovered then that's not the specific contention issue we've been looking for

Alfonso Subiotto Marqués [31 minutes ago]
Added this information to the issue #34288
knz
#34288 release: v2.2.0-alpha.20190211
Candidate SHA: 4bdf0ad.
Deployment status: Tagging release.
Nightly Suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1129733&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_NightlySuite. No suspicious failures.

Release process checklist

Prep date: 2/5/2019

Pick a SHA, fill in Candidate SHA above.
Tag the provisional SHA
Publish provisional binaries
Show more
Assignees
asubiotto
Comments
10
cockroachdb/cockroachJan 28thAdded by GitHub

Nathan VanBenschoten [30 minutes ago]
@asubiotto mind posting the whole log (w/ line number) here and there? We have a few of those logs that check for subtly different issues (edited)

Alfonso Subiotto Marqués [30 minutes ago]
W190210 15:29:29.544486 292684971 storage/engine/mvcc.go:2215 [n1,s1,r3540/5:/Table/57/2/6/{8/NULâ<80>¦-10/10â<80>¦}] unable to find value for /Table/57/2/6/8/1/265962/0 @ 1549812565.802243815,1 (txn:<id:8fa4175f-b14c-4432-9745-5a6d9ced57c4 key:"\301\211\216\211\205\373\360\004\210" timestamp:<wall_time:1549812565381965658 logical:2 > priority:545203 sequence:24 > timestamp:<wall_time:1549812565381965658 logical:2 > deleted:false key_bytes:12 val_bytes:5 vs {ID:a59e27b2-dfdd-48a7-ab0e-46bfde18eca5 Key:[193 137 142 137 133 251 240 5 136] Epoch:3 Timestamp:1549812565.802243815,1 Priority:1306037 Sequence:155 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})

Nathan VanBenschoten [19 minutes ago]
// This should never happen as ok is true above. hmm

Tobias Grieger [19 minutes ago]
did this run with any of the bugs we’ve had last week?

Nathan VanBenschoten [18 minutes ago]
that’s what I was trying to verify. It looks like it to me

Nathan VanBenschoten [18 minutes ago]
at least, it contains #34659. (edited)

Tobias Grieger [15 minutes ago]
can we run the checks against this cluster?

Nathan VanBenschoten [14 minutes ago]
yes and it’s likely that they will fail because there was a period of time where we ran without #34659 (right @asubiotto?)

Tobias Grieger [14 minutes ago]
let me just run 3.3.2.1

Alfonso Subiotto Marqués [13 minutes ago]
Yes, the first sha didn’t contain #34659

Tobias Grieger [11 minutes ago]
it passed 3.3.2.1 🤷:skin-tone-5:

Tobias Grieger [11 minutes ago]
let me try the rest

Tobias Grieger [9 minutes ago]
wait. how do I actually run the checks using workload?

Tobias Grieger [8 minutes ago]
/root/workload run tpcc --warehouses=10 --wait=false --tolerate-errors 'postgres://root@localhost:26257/?sslmode=verify-ca&sslrootcert=certs/ca.crt&sslcert=certs/client.root.crt&sslkey=certs/client.root.key' --expensive-checks --duration 1s but that doesn’t do it

Nathan VanBenschoten [8 minutes ago]
workload check tpcc

Tobias Grieger [7 minutes ago]
ah

Tobias Grieger [6 minutes ago]
do you have any other ideas for what do look at during the stall?

Nathan VanBenschoten [5 minutes ago]
not after the fact. I am a little confused how txn 8fa4175f has epoch 0

Nathan VanBenschoten [5 minutes ago]
oh never mind, I thought it was 1 indexed

Tobias Grieger [5 minutes ago]
what are you looking at?

Nathan VanBenschoten [4 minutes ago]
the log posted above

Tobias Grieger [3 minutes ago]
I190211 16:16:57.314877 1 workload/tpcc/tpcc.go:258 check 3.3.2.1 took 4.348727723s
I190211 16:19:44.502456 1 workload/tpcc/tpcc.go:258 check 3.3.2.2 took 2m47.187528271s
Error: check failed: 3.3.2.2: inequality at idx 0: order: 282810.000000, newOrder: 282807.000000, district-1: 282806.000000

Tobias Grieger [1 minute ago]
@nathan not seeing it. Kind of hard to search in a thread. Anyway, as long as you’re happy.

Nathan VanBenschoten [< 1 minute ago]
I’m optimistic that 3.3.2.2 is failing because of #34659. We saw that in a few roachtests

Nathan VanBenschoten [< 1 minute ago]
In terms of the stall, it could be something new or not, but since it isn’t getting stuck indefinitely I’m less immediately concerned. FWIW we just got in a new roachtest to stress this kind of stuff.

Tobias Grieger [< 1 minute ago]
what’s the roachtest?

Nathan VanBenschoten [< 1 minute ago]
kv/contention/nodes=4

tbg · 2019-02-11T18:47:31Z

@nvanbenschoten @jordanlewis what would we need to diagnose such stalls in the future?

nvanbenschoten · 2019-02-11T19:38:25Z

If the stall was persistent (like we saw before) then we could have used /debug/requests to trace what was stuck. For instance, it would be pretty clear if things were stuck in the txnwait.Queue or the contentionQueue and what they were stuck on. We could also get a lot of mileage out of goroutine dumps. If we're looking at a stall in the past then we probably won't have the same luxury.

Right now we have no way to tell whether things were stuck in these components after the fact, which is really bad. One of the goals for the new concurrency package I'd like to add is to introduce more meaningful metrics (see #31106) into what's going on with concurrency control handling. We could slap some metrics in various places now, but without thinking through these carefully, I fear they'd be useless for any actual debugging effort. They might tell us where things are stuck, but not why they are stuck.

awoods187 · 2019-03-06T04:35:53Z

this looks like it can be closed

asubiotto self-assigned this Feb 5, 2019

asubiotto added a commit to cockroachdb/bincheck that referenced this issue Feb 5, 2019

v2.2.0-alpha.20190211

f165010

cockroachdb/cockroach#34288

asubiotto added a commit to cockroachdb/bincheck that referenced this issue Feb 5, 2019

v2.2.0-alpha.20190211

4cc1cb7

cockroachdb/cockroach#34288

asubiotto mentioned this issue Feb 5, 2019

v2.2.0-alpha.20190211 cockroachdb/bincheck#87

Merged

asubiotto added a commit to cockroachdb/bincheck that referenced this issue Feb 6, 2019

Update v2.2.0-alpha.20190211 sha

21ecaa1

New candidate had to be picked: cockroachdb/cockroach#34288

asubiotto mentioned this issue Feb 6, 2019

Update v2.2.0-alpha.20190211 SHA cockroachdb/bincheck#88

Merged

asubiotto added a commit to cockroachdb/bincheck that referenced this issue Feb 8, 2019

v2.2.0-alpha.20190211

aac0d5f

Update SHA to new candidate (cockroachdb/cockroach#34288)

asubiotto mentioned this issue Feb 8, 2019

v2.2.0-alpha.20190211 cockroachdb/bincheck#89

Merged

asubiotto assigned jseldess and unassigned asubiotto Feb 11, 2019

awoods187 closed this as completed Mar 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v2.2.0-alpha.20190211 #34288

release: v2.2.0-alpha.20190211 #34288

knz commented Jan 28, 2019 •

edited by jseldess

Loading

asubiotto commented Feb 6, 2019

asubiotto commented Feb 6, 2019

asubiotto commented Feb 7, 2019

asubiotto commented Feb 7, 2019

asubiotto commented Feb 7, 2019

asubiotto commented Feb 8, 2019

tbg commented Feb 8, 2019 •

edited

Loading

tbg commented Feb 8, 2019

asubiotto commented Feb 8, 2019

asubiotto commented Feb 11, 2019 •

edited

Loading

tbg commented Feb 11, 2019

asubiotto commented Feb 11, 2019 •

edited

Loading

tbg commented Feb 11, 2019

nvanbenschoten commented Feb 11, 2019

awoods187 commented Mar 6, 2019

release: v2.2.0-alpha.20190211 #34288

release: v2.2.0-alpha.20190211 #34288

Comments

knz commented Jan 28, 2019 • edited by jseldess Loading

asubiotto commented Feb 6, 2019

asubiotto commented Feb 6, 2019

asubiotto commented Feb 7, 2019

asubiotto commented Feb 7, 2019

asubiotto commented Feb 7, 2019

asubiotto commented Feb 8, 2019

tbg commented Feb 8, 2019 • edited Loading

tbg commented Feb 8, 2019

asubiotto commented Feb 8, 2019

asubiotto commented Feb 11, 2019 • edited Loading

tbg commented Feb 11, 2019

asubiotto commented Feb 11, 2019 • edited Loading

tbg commented Feb 11, 2019

nvanbenschoten commented Feb 11, 2019

awoods187 commented Mar 6, 2019

knz commented Jan 28, 2019 •

edited by jseldess

Loading

tbg commented Feb 8, 2019 •

edited

Loading

asubiotto commented Feb 11, 2019 •

edited

Loading

asubiotto commented Feb 11, 2019 •

edited

Loading