release: v20.1.0-alpha.20200123 #43754

danhhz · 2020-01-06T20:54:17Z

Click for current signoff comment

Candidate SHA: d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3
Deployment status: otan-release-v2010-alpha20191216-e9e2a803 / otan-release-v2010-alpha20200123-d818f1ff
Nightly Suite: https://teamcity.cockroachdb.com/viewQueued.html?itemId=1708159&tab=queuedBuildOverviewTab
Release Qualification: https://teamcity.cockroachdb.com/viewLog.html?buildId=1708120&tab=buildResultsDiv&buildTypeId=Cockroach_ReleaseQualification

Deployment Dashboards:

Release process checklist

Prep date: 23rd January, 2020

Pick a SHA, fill in Candidate SHA above, notify #release-announce of SHA.
Tag the provisional SHA
Publish provisional binaries
Check binaries
Deploy to test clusters
Verify node crash reports
Start release qualification suite
Start nightly suite
fill in Deployment status above with clusters and Nightly Suite with the link to Nightly TeamCity Job

One day after prep date:

Get signoff on roachtest failures
Keep an eye on clusters until release date. Do not proceed below until the release date.

Release date: 30th January, 2020

Check cluster status
Tag release
Bless provisional binaries
For production or stable releases in the latest major release series
- Update the Homebrew Formula
- Update the image tag in our Orchestrator configurations
- Announce version to the registration cluster
Update docs
External communications for release
Clean up provisional tag from repository

old info - obsoleted by new SHA -- formerly release: v20.1.0-alpha.20200113

Candidate SHA: 617cb39.
Deployment status: dan-release-v2010-alpha20191216-6bfe4a1/dan-release-v2010-alpha20190113-e9e2a80.
Nightly Suite: ReleaseQual NightlySuite.

The text was updated successfully, but these errors were encountered:

danhhz · 2020-01-06T20:54:30Z

Picking the SHA is blocked on #43616

solongordon · 2020-01-08T17:01:53Z

#43616 is now fixed. Please go ahead with picking the SHA.

danhhz · 2020-01-08T18:03:24Z

Thanks! Picked a SHA: 6bfe4a1

danhhz · 2020-01-08T20:41:35Z

Potentially blocked on #43827. A bunch of the release teamcity builds are failing because of this. I've tried to work around it, but we may have to cherry-pick a fix

danhhz · 2020-01-09T02:57:33Z

Peter fixed the makefile issue. I'll cherry-pick a new sha and get the roachtests running tomorrow. I started up the roachprod tests today so assuming nothing else goes wrong, we should have enough data to keep the original release date

danhhz · 2020-01-10T17:12:51Z

I've been waiting to triage the roachtest failures until I can get someone to make a call on weather "SQL Logic Test High VModule Nightly" failing to build is an alpha release blocker or not. At this point, instead of restarting everything, I think I'll try pinging that one build with the candidate SHA plus a makefile change that fixes the build and then triage the roachtest failures we got. This is unlikely to finish on a Friday so my guess is this alpha is slipping until Tuesday or Wednesday

danhhz · 2020-01-13T16:39:01Z

danhhz · 2020-01-13T16:46:42Z

Something definitely regressed in changefeeds recently. I need to look into exactly what, but I think these failures are okay for an alpha. The initial scan one was making progress, though slowly (the most behind span is logged and it kept changing over the course of the test). The other two got further behind than they should, but eventually caught up.

yuzefovich · 2020-01-13T16:48:29Z

Checked off all sqlsmith failures since those are never blockers (as Matt always says).

rytaft · 2020-01-13T16:56:20Z

checked off hotspotsplits/nodes=4
failed due to fatal error: runtime: out of memory

This error has been happening periodically for a while now (#33660). Should not be a release blocker.

nvanbenschoten · 2020-01-13T17:07:05Z

The scaledata failures are interesting. They're hitting issues because the creation of the initial empty schema it taking a very long time (> 1 minute). This is allowing chaos to start before we expect it to. Here's some logging from the first failure, filtered down to just the schema changes triggered here:

I200109 05:06:45.811349 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_table", target: 53, info: {TableName:ds.public.resources Statement:CREATE TABLE resources (id INT8 NOT NULL DEFAULT unique_rowid(), created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ, resource_id INT8, free INT8, total INT8, PRIMARY KEY (id)) User:root}
I200109 05:06:47.330294 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:06:48.795658 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 53, info: {TableName:ds.public.resources IndexName:idx_resources_deleted_at Statement:CREATE INDEX idx_resources_deleted_at ON resources (deleted_at) User:root MutationID:1}
I200109 05:06:53.756401 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:06:57.729798 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "resources", v=3, m=1
I200109 05:07:04.577244 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:08.280914 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root,scExec] Event: "finish_schema_change", target: 53, info: {MutationID:1}
I200109 05:07:08.853128 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 3}]
I200109 05:07:12.633817 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_table", target: 54, info: {TableName:ds.public.allocations Statement:CREATE TABLE allocations (id INT8 NOT NULL DEFAULT unique_rowid(), created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ, resource_id INT8, job_id INT8, units INT8, PRIMARY KEY (id)) User:root}
I200109 05:07:14.801515 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=54 (allocations) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:16.644276 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 54, info: {TableName:ds.public.allocations IndexName:idx_allocations_deleted_at Statement:CREATE INDEX idx_allocations_deleted_at ON allocations (deleted_at) User:root MutationID:1}
I200109 05:07:21.844560 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:26.125956 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "allocations", v=3, m=1
I200109 05:07:33.747879 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:36.699121 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root,scExec] Event: "finish_schema_change", target: 54, info: {MutationID:1}
I200109 05:07:37.428304 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{allocations 54 3}]
I200109 05:07:40.662192 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=5 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:42.313880 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 53, info: {TableName:ds.public.resources IndexName:idx_pk Statement:CREATE UNIQUE INDEX idx_pk ON resources (resource_id) User:root MutationID:2}
I200109 05:07:46.689986 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 4}]
I200109 05:08:03.614645 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=6 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:08:05.680050 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 5}]
I200109 05:08:08.248063 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "resources", v=6, m=2

One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds.

danhhz · 2020-01-13T21:45:38Z

Okay, sounds good! We'll hold up the alpha until those get investigated, thanks!

cc @lnhsingh (Btw also want to make sure you noticed that the name of the alpha was wrong on the release calendar, it had 2019 in the date part of the version tag. I've fixed all occurrences of this error on the release calendar.)

ajwerner · 2020-01-13T21:53:54Z

One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds.

I'm trying to reproduce anything even remotely close to this sort of latency but I'm having a hard time. I dug in to some more details. One thing to note is that it took almost 20s at the beginning of the test to up-replicate all of the ranges (I see this from the rangelog).

Other things worth noting is that all of the jobs took 5s to start. Also orth noting is that the initial index creation for idx_resources_deleted_at took 15s after it started. Then all of the other jobs took a while each.

I've modified the test to fail if the schema portion of the test takes more than 30 and I'm running it at scale with roachtest and so far after 400 iterations it hasn't failed.

The question is why are these schema changes taking so long. I don't have a great answer but there's some evidence that there's some waiting for leases to expire. Additionally there's a stats job that seems to have happened in the middle here. I'm trying to understand what could be going on here. I don't have any evidence that this is in any way related to closed timestamps.

ajwerner · 2020-01-13T22:45:15Z

After more digging, it seems clear that somehow we're seeing an outstanding lease on every attempt to publish a new version. The other confounding factor is slow up-replication and some lease transfers due to it. All of the other node metrics seem fine. The storage latency numbers are good. The CPU is near zero. The question is whether it's possible for a slow initial lease acquisition to somehow always leave a node with a lease one version behind or something like that.

teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:08.853128 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 3}]                                                                                                                           
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:14.801515 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=54 (allocations) version=2 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                             
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:21.844560 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=3 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                      
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:33.747879 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=4 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                      
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:37.428304 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{allocations 54 3}]                                                                                                                         
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:40.662192 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=5 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                               
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:46.689986 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 4}]                                                                                                                           
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:08:03.614645 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=6 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                        
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:08:05.680050 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 5}]

The question at hand is, how is this lease getting leaked? I've modified the test like so and observed 1000 successful runs. Perhaps the vmodule logging has changed the behavior so I'm going to remove that and try again.

--- a/pkg/cmd/roachtest/scaledata.go
+++ b/pkg/cmd/roachtest/scaledata.go
@@ -38,11 +38,11 @@ func registerScaleData(r *testRegistry) {
 
        for app, flags := range apps {
                app, flags := app, flags // copy loop iterator vars
-               const duration = 10 * time.Minute
+               const duration = 30 * time.Second
                for _, n := range []int{3, 6} {
                        r.Add(testSpec{
                                Name:    fmt.Sprintf("scaledata/%s/nodes=%d", app, n),
-                               Timeout: 2 * duration,
+                               Timeout: 20 * duration,
                                Cluster: makeClusterSpec(n + 1),
                                Run: func(ctx context.Context, t *test, c *cluster) {
                                        runSqlapp(ctx, t, c, app, flags, duration)
@@ -73,7 +73,7 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
 
        c.Put(ctx, b, app, appNode)
        c.Put(ctx, cockroach, "./cockroach", roachNodes)
-       c.Start(ctx, t, roachNodes)
+       c.Start(ctx, t, roachNodes, startArgs("--args='--vmodule=schema*=2,lease=2,jobs=2,registry=2'"))
 
        // TODO(nvanbenschoten): We are currently running these consistency checks with
        // basic chaos. We should also run them in more chaotic environments which
@@ -106,17 +106,22 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
                defer sqlappL.close()
 
                t.Status("installing schema")
+               start := time.Now()
                err = c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s --install_schema "+
                        "--cockroach_ip_addresses_csv='%s' %s", app, addrStr, flags))
                if err != nil {
                        return err
                }
+               if took := time.Since(start); took > 20*time.Second {
+                       return fmt.Errorf("it took %v to set up the schema", took)
+               }
 
                t.Status("running consistency checker")
                const workers = 16
                return c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s  --duration_secs=%d "+
                        "--num_workers=%d --cockroach_ip_addresses_csv='%s' %s",
                        app, int(dur.Seconds()), workers, addrStr, flags))
+               return nil
        })
        m.Wait()
 }

ajwerner · 2020-01-13T22:55:31Z

Okay I've discovered something interesting w.r.t. the scaledata failure. The nodes are geo-distributed:

https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1680508:id/scaledata/distributed_semaphore/nodes%3D3/roachprod_state/hosts/teamcity-1680508-1578546114-10-n4cpu4

# user@host            locality                                           vpcId
ubuntu@35.222.229.231  cloud=gce,region=us-central1,zone=us-central1-b    default
ubuntu@34.83.167.61    cloud=gce,region=us-west1,zone=us-west1-b          default
ubuntu@34.89.88.226    cloud=gce,region=europe-west2,zone=europe-west2-b  default
ubuntu@35.226.213.255  cloud=gce,region=us-central1,zone=us-central1-b    default

@nvanbenschoten confirms that this is the case for the other roachprod failure. Checking off the failures.

It seems that this is all fallout from #43898 (comment)

ajwerner · 2020-01-13T22:56:52Z

@jlinder FYI

danhhz · 2020-01-13T23:20:02Z

Thanks @ajwerner!

danhhz · 2020-01-13T23:21:33Z

Okay, I'm going to let this bake on the long running clusters one more night. If nothing goes wrong, I'll bless the release tomorrow morning

danhhz · 2020-01-14T15:53:31Z

@lnhsingh Binaries have been blessed, turning this over to docs and marketing

danhhz · 2020-01-14T21:52:06Z

We have a late breaking release blocker found just before marketing comms had gone out. Docs PR has been reverted. We'll restart this alpha when there is a fix #43979

otan · 2020-01-23T17:53:25Z

Going to restart this one, cc @jseldess.

jseldess · 2020-01-23T18:02:38Z

Thanks for the heads-up, @otan. Can we also be sure to a new data in the binary name? Also, can you share the new candidate SHA? I'll have to generate new notes since the last candidate.

otan · 2020-01-23T18:06:36Z

haven't picked yet, because waiting for a build to finish - will let you know when it's changed.

it should just be an additional cherry pick of 3c0671c

Can we also be sure to a new data in the binary name?

can you clarify what this means?

danhhz · 2020-01-23T18:07:41Z

I think he means that the alpha-<date> in the tag should be updated to the new target release date

tbg · 2020-01-27T12:33:14Z

The failure mode for headroom, mixed-headroom, version/mixed all seems the same. Makes me curious more who signed off on headroom - is there something I'm missing here? If we can't run TPCC I'd be hesitant to release.

knz · 2020-01-27T14:13:20Z

You can click the "edited by" drop-down on the checklist and see who marked what.

knz · 2020-01-27T14:14:22Z

@andreimatei you're the one who's missing a note about why you signed off.

tbg · 2020-01-27T14:19:17Z

@knz I had no idea! Thanks, it's really comforting that this exists.

andreimatei · 2020-01-27T15:17:30Z

I signed off on tpcc/headroom/n4cpu16 because I didn't see any particular reason not too. I looked for a while trying to understand what timed out, to no avail. It also looked to me that all the AWS tests did run, but perhaps I looked wrong.

tbg · 2020-01-27T15:32:23Z

FWIW, I just ran mixed-headroom 5x on GCE and it passed.

The AWS build did time out (execution timeout), so we probably missed a fair share of tests there. I'm going to run the offending test (last one to finish failingly before the stall) tpcc/headroom/n4cpu16 5x on AWS.

thoszhang · 2020-01-27T15:36:14Z

Signing off on the schema change tests. They look like failures due to context cancellation, which shouldn't block the release.

tbg · 2020-01-27T16:41:18Z

@andreimatei and I figured out through arcane magic that the AWS build was really failed by tpccbench, but that we didn't manage to get any logs. I'm going to own fixing that infrastructure issue.

tbg · 2020-01-28T10:48:07Z

Headroom passed 5x on aws as well. Since the failure mode is so similar, I'm checking off the version-mixed ones too.

tbg · 2020-01-28T10:54:25Z

I believe the cancel failure is owned by sql: #42103

It looks like the same failure we've been seen for close to two months now. @asubiotto I'll take the liberty to check the box

tbg · 2020-01-28T10:58:25Z

tpccbench (cpu=16) has a panic on n2. @andreimatei or @nvanbenschoten do you have context on this?

F200124 03:39:17.909332 8180460 kv/txn_interceptor_heartbeater.go:279  [n2,txn-hb=c715b681] txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" meta={id=c715b681 key=/Table/54/1/1228/2/0 pri=0.05066344 epo=1 ts=1579837154.462135149,2 min=1579837145.904027058,0 seq=49} rw=true stat=ABORTED rts=1579837146.238462972,2 wto=false max=1579837146.404027058,0
goroutine 8180460 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x7056901, 0xed5bc5de5, 0x0, 0xc018a3a000)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x7053740, 0xc000000004, 0x66fc8be, 0x21, 0x117, 0xc02ff43b80, 0x13f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:214 +0xb92
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x492ef00, 0xc0393d97c0, 0x4, 0x2, 0x41d4346, 0x51, 0xc049227c98, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x492ef00, 0xc0393d97c0, 0x1, 0x4, 0x41d4346, 0x51, 0xc04dba7c98, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).heartbeat(0xc044f37830, 0x492ef00, 0xc0393d97c0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:279 +0x18b
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).heartbeatLoop(0xc044f37830, 0x492ef00, 0xc0393d97c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:249 +0x1d1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001c0240, 0x492ef00, 0xc0393d97c0, 0xc038e68270, 0x29, 0x4995640, 0xc000231b60, 0xc03a15b930)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322 +0x152
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:317 +0x131

I'm signing off on the other tpccbench failure since it's the same mystery 255 killed thing.

tbg · 2020-01-28T10:59:16Z

@ajwerner could you pass judgement on the follower reads test failure:

follower_reads.go:209,test_runner.go:719: fewer than 2 follower reads occurred: saw [0 0 0] before and [0 1 0] after

tbg · 2020-01-28T10:59:59Z

Signing off on the gossip chaos test. Looks like an infra failure, we weren't even able to connect to the machines via SSH to fetch logs in the end.

tbg · 2020-01-28T11:05:42Z

the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:

03:18:34 test.go:182: test status: 
03:18:34 tpcc.go:839: --- PASS: tpcc 970 resulted in 11017.9 tpmC (90.1% of max tpmC)

03:18:34 test.go:182: test status: stopping cluster
03:18:34 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1-9
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting..
03:18:36 test.go:182: test status: starting cluster
03:18:36 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false --racks=3 teamcity-1708156-1579821026-27-n10cpu4:1-9
teamcity-1708156-1579821026-27-n10cpu4: starting......................
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings
03:19:14 test.go:182: test status: running benchmark, warehouses=1050
03:19:14 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1708156-1579821026-27-n10cpu4:10 -- ./workload run tpcc --warehouses=2000 --active-warehouses=1050 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s --partitions=3 {pgurl:10} --histograms=perf/warehouses=1050/stats.json
3: 10339
8: 9926
5: 9361
4: 11011
2: 9030
7: 9499
1: 11356
6: 9499
9: 10026
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200124 03:20:20.768517 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:20:44 chaos.go:90: killing :4
03:20:44 test.go:182: test status: stopping cluster
03:20:44 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:4
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.
4: kill exited nonzero
4: dead
CHAOS: 03:20:51 chaos.go:119: restarting :4 after 5s of downtime
03:20:51 test.go:182: test status: starting cluster
03:20:51 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:4
teamcity-1708156-1579821026-27-n10cpu4: starting.
4: 11429
Initializing 10500 workers and preparing statements...
I200124 03:21:24.783933 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:22:21 chaos.go:90: killing :1
03:22:21 test.go:182: test status: stopping cluster
03:22:21 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead

CHAOS: 03:22:28 chaos.go:119: restarting :1 after 5s of downtime
03:22:28 test.go:182: test status: starting cluster
03:22:28 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: startingI200124 03:22:29.250432 1 workload/cli/run.go:361  retrying after error while creating load: EOF
..
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings1: 11960

Initializing 2100 connections...
CHAOS: 03:23:58 chaos.go:90: killing :1
03:23:58 test.go:182: test status: stopping cluster
03:23:58 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead

CHAOS: 03:24:05 chaos.go:119: restarting :1 after 5s of downtime
03:24:05 test.go:182: test status: starting cluster
03:24:05 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: starting.
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings1: 12348

Initializing 10500 workers and preparing statements...
I200124 03:24:24.361240 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200124 03:25:28.126433 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:25:35 chaos.go:90: killing :3
[repeats similary for hours]

@nvanbenschoten have you seen this before/can you make sense of it?

ajwerner · 2020-01-28T13:23:12Z

@ajwerner could you pass judgement on the follower reads test failure:

follower_reads.go:209,test_runner.go:719: fewer than 2 follower reads occurred: saw [0 0 0] before and [0 1 0] after

This failed early on in the test while it seems that the data was still being up-replicated. The test failed at 00:05:03 which is before the user table had been up-replicated fully:

I200124 00:05:06.775279 469 storage/replica_raftstorage.go:806  [n3,s3,r22/3:/Table/2{6-7}] applied LEARNER snapshot [total=17ms ingestion=4@15ms id=c3aa29b3 index=24]

I'll take an action item to wait for the data to be fully replicated before starting the test.

andreimatei · 2020-01-28T15:39:25Z

The tpccbench (cpu=16) crash I was just looking at yesterday in another context.
I'm suspect it was broken by the following commit: 6997710

We probably shouldn't release with that crash. I'll see what to do.

44393: issues: fix issue creation r=andreimatei a=tbg We were specifying `C-test-failure` twice. Github actually doesn't care, but let's fix it anyway. Release note: None 44444: roachtest: wait for full replication in follower_reads test r=tbg a=ajwerner Before this commit it was possible for a follower reads test to commence before the data had been up-replicated to all of the nodes. This change ensures that the data is fully replicated before beginning the test. See #43754 (comment). We saw that [here](https://teamcity.cockroachdb.com/viewLog.html?buildId=1708156&buildTypeId=Cockroach_Nightlies_WorkloadNightly&tab=artifacts#%2Ffollower-reads). Release note: None. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

apantel · 2020-01-28T16:03:14Z

sqlalchemy, typeorm, pgx are broken tests.

hibernate is failing due to #43969 (#44031 is the fix which did not make the release).

andreimatei · 2020-01-28T16:49:08Z

I have #44451 which I hope fixed tpccbench (cpu=16), but I'm not sure it's the right thing to do yet.

nvanbenschoten · 2020-01-28T19:19:43Z

the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:

Yes, I've seen this failure mode in the past occasionally. The workload gets stuck trying to prepare statements on its 2100 connections and doesn't seem to be able to complete this before it gets killed again (~1m later), so it loops indefinitely. Signing off on this failure while we look more into the issue because it's not an issue with the alpha.

andreimatei · 2020-01-28T22:24:56Z

#44460 is now in bors. Once that merges, I'll check off tpccbench (cpu=16) because the cause will hopefully be fixed.
But it'd be good to cherry-pick this PR in the release...

miretskiy · 2020-01-29T22:54:53Z

@andreimatei If you feel that this PR is critical to be included in this alpha, please follow instructions: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/160235536/Changing+a+Scheduled+Release

nathanstilwell · 2020-02-03T18:29:15Z

Released - https://github.com/cockroachdb/cockroach/releases/tag/v20.1.0-alpha.20200123

danhhz self-assigned this Jan 6, 2020

danhhz mentioned this issue Jan 6, 2020

sql: crash on bad descriptor keys #43616

Closed

jseldess mentioned this issue Jan 6, 2020

v20.1.0-alpha.20200113 release notes cockroachdb/docs#6286

Closed

danhhz changed the title ~~release: v20.1.0-alpha.20190113~~ release: v20.1.0-alpha.20200113 Jan 9, 2020

jseldess mentioned this issue Jan 21, 2020

v20.1.0-alpha.20190130 release notes cockroachdb/docs#6345

Closed

otan changed the title ~~release: v20.1.0-alpha.20200113~~ release: v20.1.0-alpha.20200123 Jan 23, 2020

miretskiy self-assigned this Jan 27, 2020

ajwerner mentioned this issue Jan 28, 2020

roachtest: wait for full replication in follower_reads test #44444

Merged

otan removed their assignment Jan 29, 2020

miretskiy assigned jseldess Jan 30, 2020

nathanstilwell self-assigned this Feb 3, 2020

nathanstilwell closed this as completed Feb 3, 2020

nathanstilwell added this to Released in Release (obsolete) Feb 14, 2020

release: v20.1.0-alpha.20200123 #43754

release: v20.1.0-alpha.20200123 #43754

Comments

danhhz commented Jan 6, 2020 • edited by jseldess

Click for current signoff comment

danhhz commented Jan 6, 2020

solongordon commented Jan 8, 2020

danhhz commented Jan 8, 2020

danhhz commented Jan 8, 2020

danhhz commented Jan 9, 2020

danhhz commented Jan 10, 2020

danhhz commented Jan 13, 2020 • edited by otan

danhhz commented Jan 13, 2020

yuzefovich commented Jan 13, 2020

rytaft commented Jan 13, 2020

nvanbenschoten commented Jan 13, 2020

danhhz commented Jan 13, 2020

ajwerner commented Jan 13, 2020

ajwerner commented Jan 13, 2020 • edited

ajwerner commented Jan 13, 2020

ajwerner commented Jan 13, 2020

danhhz commented Jan 13, 2020

danhhz commented Jan 13, 2020

danhhz commented Jan 14, 2020

danhhz commented Jan 14, 2020

otan commented Jan 23, 2020

jseldess commented Jan 23, 2020

otan commented Jan 23, 2020 • edited

danhhz commented Jan 23, 2020 • edited

tbg commented Jan 27, 2020

knz commented Jan 27, 2020

knz commented Jan 27, 2020

tbg commented Jan 27, 2020

andreimatei commented Jan 27, 2020

tbg commented Jan 27, 2020

thoszhang commented Jan 27, 2020

tbg commented Jan 27, 2020

tbg commented Jan 28, 2020

tbg commented Jan 28, 2020

tbg commented Jan 28, 2020

tbg commented Jan 28, 2020

tbg commented Jan 28, 2020

tbg commented Jan 28, 2020

ajwerner commented Jan 28, 2020

andreimatei commented Jan 28, 2020

apantel commented Jan 28, 2020 • edited

andreimatei commented Jan 28, 2020

nvanbenschoten commented Jan 28, 2020

andreimatei commented Jan 28, 2020

miretskiy commented Jan 29, 2020

nathanstilwell commented Feb 3, 2020

danhhz commented Jan 6, 2020 •

edited by jseldess

danhhz commented Jan 13, 2020 •

edited by otan

ajwerner commented Jan 13, 2020 •

edited

otan commented Jan 23, 2020 •

edited

danhhz commented Jan 23, 2020 •

edited

apantel commented Jan 28, 2020 •

edited