Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: v2.2.0-alpha.20190211 #34288

Closed
12 of 17 tasks
knz opened this issue Jan 28, 2019 · 15 comments
Closed
12 of 17 tasks

release: v2.2.0-alpha.20190211 #34288

knz opened this issue Jan 28, 2019 · 15 comments
Assignees

Comments

@knz
Copy link
Contributor

knz commented Jan 28, 2019

Candidate SHA: 4bdf0ad.
Deployment status: Docs & External comms.
Nightly Suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1129733&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_NightlySuite. No suspicious failures.

Release process checklist

Prep date: 2/5/2019

Release date: 11/02/2019

@asubiotto asubiotto self-assigned this Feb 5, 2019
asubiotto added a commit to cockroachdb/bincheck that referenced this issue Feb 5, 2019
asubiotto added a commit to cockroachdb/bincheck that referenced this issue Feb 5, 2019
@asubiotto
Copy link
Contributor

A new sha will need to be picked to include: #34659

@asubiotto
Copy link
Contributor

Updating candidate sha to 66f13c1

asubiotto added a commit to cockroachdb/bincheck that referenced this issue Feb 6, 2019
@asubiotto
Copy link
Contributor

Note to self: re-run interleavedpartitioned, failed due to a quota issue.

@asubiotto
Copy link
Contributor

interleavedpartitioned passed.

@asubiotto
Copy link
Contributor

An issue was uncovered when creating automatic statistics on a view, the fix is ready #34726 but a new sha will need to be picked.

asubiotto added a commit to cockroachdb/bincheck that referenced this issue Feb 8, 2019
@asubiotto
Copy link
Contributor

Updated candidate SHA to 4bdf0ad

@tbg
Copy link
Member

tbg commented Feb 8, 2019

Unfortunately #34712 is another fixed issue that deserves being included.

@tbg
Copy link
Member

tbg commented Feb 8, 2019

(and #34502; they are both fixes that can cause the same symptom: an inability to upreplicate)

@asubiotto
Copy link
Contributor

These fixes don't justify cutting a new alpha as not doing so won't introduce regressions (according to my understanding these issues are present in previous alphas).

@asubiotto
Copy link
Contributor

asubiotto commented Feb 11, 2019

Around 24h after 4bdf0ad was deployed on adriatic 02/08 00:00:00, we started seeing extremely variable p99 service latency and an uptick in replicas per node due to splits.

screen shot 2019-02-11 at 8 10 28 am

It's unclear to me whether these splits are expected, but the cluster seems to have stabilized.

Additionally, I observed a 20 minute period of 0 QPS. Double checking the supervisor status shows that there were no restarts of the workload. Checking the logs shows a lot of:

W190210 15:29:29.544486 292684971 storage/engine/mvcc.go:2215  [n1,s1,r3540/5:/Table/57/2/6/{8/NULâ<80>¦-10/10â<80>¦}] unable to find value for /Table/57/2/6/8/1/265962/0 @ 1549812565.802243815,1 (txn:<id:8fa4175f-b14c-4432-9745-5a6d9ced57c4 key:"\301\211\216\211\205\373\360\004\210" timestamp:<wall_time:1549812565381965658 logical:2 > priority:545203 sequence:24 > timestamp:<wall_time:1549812565381965658 logical:2 > deleted:false key_bytes:12 val_bytes:5  vs {ID:a59e27b2-dfdd-48a7-ab0e-46bfde18eca5 Key:[193 137 142 137 133 251 240 5 136] Epoch:3 Timestamp:1549812565.802243815,1 Priority:1306037 Sequence:155 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})

During this period. Graphs:
image

@tbg
Copy link
Member

tbg commented Feb 11, 2019

It's weird that the service latency also dropped to zero.

@asubiotto
Copy link
Contributor

asubiotto commented Feb 11, 2019

Some more discussion wrt the stall.

Alfonso Subiotto Marqués [Today at 8:39 AM]
We also just had a 20-minute period of 0 QPS

34 replies

Alfonso Subiotto Marqués [42 minutes ago]
Doesn’t seem like tpcc was stopped according to supervisor

Alfonso Subiotto Marqués [38 minutes ago]
Seeing a fair amount of
unable to find value for /Table/57/2/0/6/3/266420/0 @ 1549812478.372101895,1 (txn:<id:a436cac5-2b9e-4cbe-b920-41019199adef key:"\301\211\210\211\205\373\354\326\210" timestamp:<wall_time:1549812474876199165 logical:2 > priority:989256 sequence:18 > timestamp:<wall_time:1549812474876199165 logical:2 > deleted:false key_bytes:12 val_bytes:5 vs {ID:4d6e9578-0f3c-4977-9e6a-71e97b72561d Key:[193 137 136 137 133 251 236 216 136] Epoch:4 Timestamp:1549812478.372101895,1 Priority:843210 Sequence:152 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
In the logs during this period cc @tbg

Tobias Grieger [35 minutes ago]
Hmm that doesn’t look good. @nathan do you remember if this warning can fire randomly during TPCC?

Tobias Grieger [35 minutes ago]
also is the 20 minute zero qps period the one we’ve been trying to hunt down?

Tobias Grieger [35 minutes ago]
Unfortunately I’m about to head out. cc @bdarnell (edited)

Nathan VanBenschoten [31 minutes ago]
If things recovered then that's not the specific contention issue we've been looking for

Alfonso Subiotto Marqués [31 minutes ago]
Added this information to the issue #34288
knz
#34288 release: v2.2.0-alpha.20190211
Candidate SHA: 4bdf0ad.
Deployment status: Tagging release.
Nightly Suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1129733&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_NightlySuite. No suspicious failures.

Release process checklist

Prep date: 2/5/2019

Pick a SHA, fill in Candidate SHA above.
Tag the provisional SHA
Publish provisional binaries
Show more
Assignees
asubiotto
Comments
10
cockroachdb/cockroachJan 28thAdded by GitHub

Nathan VanBenschoten [30 minutes ago]
@asubiotto mind posting the whole log (w/ line number) here and there? We have a few of those logs that check for subtly different issues (edited)

Alfonso Subiotto Marqués [30 minutes ago]
W190210 15:29:29.544486 292684971 storage/engine/mvcc.go:2215 [n1,s1,r3540/5:/Table/57/2/6/{8/NULâ<80>¦-10/10â<80>¦}] unable to find value for /Table/57/2/6/8/1/265962/0 @ 1549812565.802243815,1 (txn:<id:8fa4175f-b14c-4432-9745-5a6d9ced57c4 key:"\301\211\216\211\205\373\360\004\210" timestamp:<wall_time:1549812565381965658 logical:2 > priority:545203 sequence:24 > timestamp:<wall_time:1549812565381965658 logical:2 > deleted:false key_bytes:12 val_bytes:5 vs {ID:a59e27b2-dfdd-48a7-ab0e-46bfde18eca5 Key:[193 137 142 137 133 251 240 5 136] Epoch:3 Timestamp:1549812565.802243815,1 Priority:1306037 Sequence:155 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})

Nathan VanBenschoten [19 minutes ago]
// This should never happen as ok is true above. hmm

Tobias Grieger [19 minutes ago]
did this run with any of the bugs we’ve had last week?

Nathan VanBenschoten [18 minutes ago]
that’s what I was trying to verify. It looks like it to me

Nathan VanBenschoten [18 minutes ago]
at least, it contains #34659. (edited)

Tobias Grieger [15 minutes ago]
can we run the checks against this cluster?

Nathan VanBenschoten [14 minutes ago]
yes and it’s likely that they will fail because there was a period of time where we ran without #34659 (right @asubiotto?)

Tobias Grieger [14 minutes ago]
let me just run 3.3.2.1

Alfonso Subiotto Marqués [13 minutes ago]
Yes, the first sha didn’t contain #34659

Tobias Grieger [11 minutes ago]
it passed 3.3.2.1 🤷:skin-tone-5:

Tobias Grieger [11 minutes ago]
let me try the rest

Tobias Grieger [9 minutes ago]
wait. how do I actually run the checks using workload?

Tobias Grieger [8 minutes ago]
/root/workload run tpcc --warehouses=10 --wait=false --tolerate-errors 'postgres://root@localhost:26257/?sslmode=verify-ca&sslrootcert=certs/ca.crt&sslcert=certs/client.root.crt&sslkey=certs/client.root.key' --expensive-checks --duration 1s but that doesn’t do it

Nathan VanBenschoten [8 minutes ago]
workload check tpcc

Tobias Grieger [7 minutes ago]
ah

Tobias Grieger [6 minutes ago]
do you have any other ideas for what do look at during the stall?

Nathan VanBenschoten [5 minutes ago]
not after the fact. I am a little confused how txn 8fa4175f has epoch 0

Nathan VanBenschoten [5 minutes ago]
oh never mind, I thought it was 1 indexed

Tobias Grieger [5 minutes ago]
what are you looking at?

Nathan VanBenschoten [4 minutes ago]
the log posted above

Tobias Grieger [3 minutes ago]
I190211 16:16:57.314877 1 workload/tpcc/tpcc.go:258 check 3.3.2.1 took 4.348727723s
I190211 16:19:44.502456 1 workload/tpcc/tpcc.go:258 check 3.3.2.2 took 2m47.187528271s
Error: check failed: 3.3.2.2: inequality at idx 0: order: 282810.000000, newOrder: 282807.000000, district-1: 282806.000000

Tobias Grieger [1 minute ago]
@nathan not seeing it. Kind of hard to search in a thread. Anyway, as long as you’re happy.

Nathan VanBenschoten [< 1 minute ago]
I’m optimistic that 3.3.2.2 is failing because of #34659. We saw that in a few roachtests

Nathan VanBenschoten [< 1 minute ago]
In terms of the stall, it could be something new or not, but since it isn’t getting stuck indefinitely I’m less immediately concerned. FWIW we just got in a new roachtest to stress this kind of stuff.

Tobias Grieger [< 1 minute ago]
what’s the roachtest?

Nathan VanBenschoten [< 1 minute ago]
kv/contention/nodes=4

@asubiotto asubiotto assigned jseldess and unassigned asubiotto Feb 11, 2019
@tbg
Copy link
Member

tbg commented Feb 11, 2019

@nvanbenschoten @jordanlewis what would we need to diagnose such stalls in the future?

@nvanbenschoten
Copy link
Member

If the stall was persistent (like we saw before) then we could have used /debug/requests to trace what was stuck. For instance, it would be pretty clear if things were stuck in the txnwait.Queue or the contentionQueue and what they were stuck on. We could also get a lot of mileage out of goroutine dumps. If we're looking at a stall in the past then we probably won't have the same luxury.

Right now we have no way to tell whether things were stuck in these components after the fact, which is really bad. One of the goals for the new concurrency package I'd like to add is to introduce more meaningful metrics (see #31106) into what's going on with concurrency control handling. We could slap some metrics in various places now, but without thinking through these carefully, I fear they'd be useless for any actual debugging effort. They might tell us where things are stuck, but not why they are stuck.

@awoods187
Copy link
Contributor

this looks like it can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants