Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: v20.1.0-alpha.20200123 #43754

Closed
14 of 19 tasks
danhhz opened this issue Jan 6, 2020 · 55 comments
Closed
14 of 19 tasks

release: v20.1.0-alpha.20200123 #43754

danhhz opened this issue Jan 6, 2020 · 55 comments
Assignees

Comments

@danhhz
Copy link
Contributor

danhhz commented Jan 6, 2020

Click for current signoff comment


Candidate SHA: d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3
Deployment status: otan-release-v2010-alpha20191216-e9e2a803 / otan-release-v2010-alpha20200123-d818f1ff
Nightly Suite: https://teamcity.cockroachdb.com/viewQueued.html?itemId=1708159&tab=queuedBuildOverviewTab
Release Qualification: https://teamcity.cockroachdb.com/viewLog.html?buildId=1708120&tab=buildResultsDiv&buildTypeId=Cockroach_ReleaseQualification

Deployment Dashboards:

Release process checklist

Prep date: 23rd January, 2020

One day after prep date:

Release date: 30th January, 2020


old info - obsoleted by new SHA -- formerly release: v20.1.0-alpha.20200113

Candidate SHA: 617cb39.
Deployment status: dan-release-v2010-alpha20191216-6bfe4a1/dan-release-v2010-alpha20190113-e9e2a80.
Nightly Suite: ReleaseQual NightlySuite.

@danhhz danhhz self-assigned this Jan 6, 2020
@danhhz
Copy link
Contributor Author

danhhz commented Jan 6, 2020

Picking the SHA is blocked on #43616

@solongordon
Copy link
Contributor

#43616 is now fixed. Please go ahead with picking the SHA.

@danhhz
Copy link
Contributor Author

danhhz commented Jan 8, 2020

Thanks! Picked a SHA: 6bfe4a1

@danhhz
Copy link
Contributor Author

danhhz commented Jan 8, 2020

Potentially blocked on #43827. A bunch of the release teamcity builds are failing because of this. I've tried to work around it, but we may have to cherry-pick a fix

@danhhz
Copy link
Contributor Author

danhhz commented Jan 9, 2020

Peter fixed the makefile issue. I'll cherry-pick a new sha and get the roachtests running tomorrow. I started up the roachprod tests today so assuming nothing else goes wrong, we should have enough data to keep the original release date

@danhhz danhhz changed the title release: v20.1.0-alpha.20190113 release: v20.1.0-alpha.20200113 Jan 9, 2020
@danhhz
Copy link
Contributor Author

danhhz commented Jan 10, 2020

I've been waiting to triage the roachtest failures until I can get someone to make a call on weather "SQL Logic Test High VModule Nightly" failing to build is an alpha release blocker or not. At this point, instead of restarting everything, I think I'll try pinging that one build with the candidate SHA plus a makefile change that fixes the build and then triage the roachtest failures we got. This is unlikely to finish on a Friday so my guess is this alpha is slipping until Tuesday or Wednesday

@danhhz
Copy link
Contributor Author

danhhz commented Jan 13, 2020

obsoleted by new SHA


Roachtest failures

CDC

  • cdc/initial-scan/rangefeed=true
  • cdc/cloud-sink-gcs/rangefeed=true
  • cdc/ledger/rangefeed=true

KV

  • hotspotsplits/nodes=4
  • network/tpcc/nodes=4
  • scaledata/distributed_semaphore/nodes=3
  • scaledata/jobcoordinator/nodes=3

SQL Exec

  • sqlsmith/setup=empty/setting=default
  • sqlsmith/setup=rand-tables/setting=default
  • sqlsmith/setup=seed/setting=default
  • sqlsmith/setup=seed/setting=no-ddl
  • sqlsmith/setup=tpcc/setting=ddl-nodrop
  • sqlsmith/setup=tpcc/setting=default

@danhhz
Copy link
Contributor Author

danhhz commented Jan 13, 2020

Something definitely regressed in changefeeds recently. I need to look into exactly what, but I think these failures are okay for an alpha. The initial scan one was making progress, though slowly (the most behind span is logged and it kept changing over the course of the test). The other two got further behind than they should, but eventually caught up.

@yuzefovich
Copy link
Member

Checked off all sqlsmith failures since those are never blockers (as Matt always says).

@rytaft
Copy link
Collaborator

rytaft commented Jan 13, 2020

checked off hotspotsplits/nodes=4
failed due to fatal error: runtime: out of memory

This error has been happening periodically for a while now (#33660). Should not be a release blocker.

@nvanbenschoten
Copy link
Member

The scaledata failures are interesting. They're hitting issues because the creation of the initial empty schema it taking a very long time (> 1 minute). This is allowing chaos to start before we expect it to. Here's some logging from the first failure, filtered down to just the schema changes triggered here:

I200109 05:06:45.811349 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_table", target: 53, info: {TableName:ds.public.resources Statement:CREATE TABLE resources (id INT8 NOT NULL DEFAULT unique_rowid(), created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ, resource_id INT8, free INT8, total INT8, PRIMARY KEY (id)) User:root}
I200109 05:06:47.330294 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:06:48.795658 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 53, info: {TableName:ds.public.resources IndexName:idx_resources_deleted_at Statement:CREATE INDEX idx_resources_deleted_at ON resources (deleted_at) User:root MutationID:1}
I200109 05:06:53.756401 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:06:57.729798 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "resources", v=3, m=1
I200109 05:07:04.577244 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:08.280914 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root,scExec] Event: "finish_schema_change", target: 53, info: {MutationID:1}
I200109 05:07:08.853128 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 3}]
I200109 05:07:12.633817 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_table", target: 54, info: {TableName:ds.public.allocations Statement:CREATE TABLE allocations (id INT8 NOT NULL DEFAULT unique_rowid(), created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ, resource_id INT8, job_id INT8, units INT8, PRIMARY KEY (id)) User:root}
I200109 05:07:14.801515 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=54 (allocations) version=2 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:16.644276 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 54, info: {TableName:ds.public.allocations IndexName:idx_allocations_deleted_at Statement:CREATE INDEX idx_allocations_deleted_at ON allocations (deleted_at) User:root MutationID:1}
I200109 05:07:21.844560 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=3 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:26.125956 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "allocations", v=3, m=1
I200109 05:07:33.747879 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=4 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:36.699121 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root,scExec] Event: "finish_schema_change", target: 54, info: {MutationID:1}
I200109 05:07:37.428304 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{allocations 54 3}]
I200109 05:07:40.662192 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=5 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:07:42.313880 279 sql/event_log.go:132  [n3,client=10.128.0.109:59278,user=root] Event: "create_index", target: 53, info: {TableName:ds.public.resources IndexName:idx_pk Statement:CREATE UNIQUE INDEX idx_pk ON resources (resource_id) User:root MutationID:2}
I200109 05:07:46.689986 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 4}]
I200109 05:08:03.614645 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=6 mtime=1970-01-01 00:00:00 +0000 UTC
I200109 05:08:05.680050 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 5}]
I200109 05:08:08.248063 279 sql/backfill.go:143  [n3,client=10.128.0.109:59278,user=root,scExec] Running backfill for "resources", v=6, m=2

One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds.

@danhhz
Copy link
Contributor Author

danhhz commented Jan 13, 2020

Okay, sounds good! We'll hold up the alpha until those get investigated, thanks!

cc @lnhsingh (Btw also want to make sure you noticed that the name of the alpha was wrong on the release calendar, it had 2019 in the date part of the version tag. I've fixed all occurrences of this error on the release calendar.)

@ajwerner
Copy link
Contributor

One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds.

I'm trying to reproduce anything even remotely close to this sort of latency but I'm having a hard time. I dug in to some more details. One thing to note is that it took almost 20s at the beginning of the test to up-replicate all of the ranges (I see this from the rangelog).

Other things worth noting is that all of the jobs took 5s to start. Also orth noting is that the initial index creation for idx_resources_deleted_at took 15s after it started. Then all of the other jobs took a while each.

I've modified the test to fail if the schema portion of the test takes more than 30 and I'm running it at scale with roachtest and so far after 400 iterations it hasn't failed.

The question is why are these schema changes taking so long. I don't have a great answer but there's some evidence that there's some waiting for leases to expire. Additionally there's a stats job that seems to have happened in the middle here. I'm trying to understand what could be going on here. I don't have any evidence that this is in any way related to closed timestamps.

@ajwerner
Copy link
Contributor

ajwerner commented Jan 13, 2020

After more digging, it seems clear that somehow we're seeing an outstanding lease on every attempt to publish a new version. The other confounding factor is slow up-replication and some lease transfers due to it. All of the other node metrics seem fine. The storage latency numbers are good. The CPU is near zero. The question is whether it's possible for a slow initial lease acquisition to somehow always leave a node with a lease one version behind or something like that.

teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:08.853128 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 3}]                                                                                                                           
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:14.801515 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=54 (allocations) version=2 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                             
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:21.844560 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=3 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                      
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:33.747879 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=54 (allocations) version=4 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                      
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:37.428304 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{allocations 54 3}]                                                                                                                         
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:40.662192 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root] publish: descID=53 (resources) version=5 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                               
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:07:46.689986 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 4}]                                                                                                                           
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:08:03.614645 279 sql/sqlbase/structured.go:1462  [n3,client=10.128.0.109:59278,user=root,scExec] publish: descID=53 (resources) version=6 mtime=1970-01-01 00:00:00 +0000 UTC                                                                                        
teamcity-1680508-1578546114-10-n4cpu4-0003> I200109 05:08:05.680050 279 sql/lease.go:326  [n3,client=10.128.0.109:59278,user=root,scExec] waiting for 1 leases to expire: desc=[{resources 53 5}]  

The question at hand is, how is this lease getting leaked? I've modified the test like so and observed 1000 successful runs. Perhaps the vmodule logging has changed the behavior so I'm going to remove that and try again.

--- a/pkg/cmd/roachtest/scaledata.go
+++ b/pkg/cmd/roachtest/scaledata.go
@@ -38,11 +38,11 @@ func registerScaleData(r *testRegistry) {
 
        for app, flags := range apps {
                app, flags := app, flags // copy loop iterator vars
-               const duration = 10 * time.Minute
+               const duration = 30 * time.Second
                for _, n := range []int{3, 6} {
                        r.Add(testSpec{
                                Name:    fmt.Sprintf("scaledata/%s/nodes=%d", app, n),
-                               Timeout: 2 * duration,
+                               Timeout: 20 * duration,
                                Cluster: makeClusterSpec(n + 1),
                                Run: func(ctx context.Context, t *test, c *cluster) {
                                        runSqlapp(ctx, t, c, app, flags, duration)
@@ -73,7 +73,7 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
 
        c.Put(ctx, b, app, appNode)
        c.Put(ctx, cockroach, "./cockroach", roachNodes)
-       c.Start(ctx, t, roachNodes)
+       c.Start(ctx, t, roachNodes, startArgs("--args='--vmodule=schema*=2,lease=2,jobs=2,registry=2'"))
 
        // TODO(nvanbenschoten): We are currently running these consistency checks with
        // basic chaos. We should also run them in more chaotic environments which
@@ -106,17 +106,22 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
                defer sqlappL.close()
 
                t.Status("installing schema")
+               start := time.Now()
                err = c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s --install_schema "+
                        "--cockroach_ip_addresses_csv='%s' %s", app, addrStr, flags))
                if err != nil {
                        return err
                }
+               if took := time.Since(start); took > 20*time.Second {
+                       return fmt.Errorf("it took %v to set up the schema", took)
+               }
 
                t.Status("running consistency checker")
                const workers = 16
                return c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s  --duration_secs=%d "+
                        "--num_workers=%d --cockroach_ip_addresses_csv='%s' %s",
                        app, int(dur.Seconds()), workers, addrStr, flags))
+               return nil
        })
        m.Wait()
 }

@ajwerner
Copy link
Contributor

Okay I've discovered something interesting w.r.t. the scaledata failure. The nodes are geo-distributed:

https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1680508:id/scaledata/distributed_semaphore/nodes%3D3/roachprod_state/hosts/teamcity-1680508-1578546114-10-n4cpu4

# user@host            locality                                           vpcId
ubuntu@35.222.229.231  cloud=gce,region=us-central1,zone=us-central1-b    default
ubuntu@34.83.167.61    cloud=gce,region=us-west1,zone=us-west1-b          default
ubuntu@34.89.88.226    cloud=gce,region=europe-west2,zone=europe-west2-b  default
ubuntu@35.226.213.255  cloud=gce,region=us-central1,zone=us-central1-b    default

@nvanbenschoten confirms that this is the case for the other roachprod failure. Checking off the failures.

It seems that this is all fallout from #43898 (comment)

@ajwerner
Copy link
Contributor

@jlinder FYI

@danhhz
Copy link
Contributor Author

danhhz commented Jan 13, 2020

Thanks @ajwerner!

@danhhz
Copy link
Contributor Author

danhhz commented Jan 13, 2020

Okay, I'm going to let this bake on the long running clusters one more night. If nothing goes wrong, I'll bless the release tomorrow morning

@danhhz
Copy link
Contributor Author

danhhz commented Jan 14, 2020

@lnhsingh Binaries have been blessed, turning this over to docs and marketing

@danhhz
Copy link
Contributor Author

danhhz commented Jan 14, 2020

We have a late breaking release blocker found just before marketing comms had gone out. Docs PR has been reverted. We'll restart this alpha when there is a fix #43979

@otan
Copy link
Contributor

otan commented Jan 23, 2020

Going to restart this one, cc @jseldess.

@jseldess
Copy link
Contributor

Thanks for the heads-up, @otan. Can we also be sure to a new data in the binary name? Also, can you share the new candidate SHA? I'll have to generate new notes since the last candidate.

@otan
Copy link
Contributor

otan commented Jan 23, 2020

haven't picked yet, because waiting for a build to finish - will let you know when it's changed.

it should just be an additional cherry pick of 3c0671c

Can we also be sure to a new data in the binary name?

can you clarify what this means?

@danhhz
Copy link
Contributor Author

danhhz commented Jan 23, 2020

I think he means that the alpha-<date> in the tag should be updated to the new target release date

@otan otan changed the title release: v20.1.0-alpha.20200113 release: v20.1.0-alpha.20200123 Jan 23, 2020
@tbg
Copy link
Member

tbg commented Jan 27, 2020

The failure mode for headroom, mixed-headroom, version/mixed all seems the same. Makes me curious more who signed off on headroom - is there something I'm missing here? If we can't run TPCC I'd be hesitant to release.

@knz
Copy link
Contributor

knz commented Jan 27, 2020

You can click the "edited by" drop-down on the checklist and see who marked what.

@knz
Copy link
Contributor

knz commented Jan 27, 2020

@andreimatei you're the one who's missing a note about why you signed off.

@tbg
Copy link
Member

tbg commented Jan 27, 2020

@knz I had no idea! Thanks, it's really comforting that this exists.

@miretskiy miretskiy self-assigned this Jan 27, 2020
@andreimatei
Copy link
Contributor

I signed off on tpcc/headroom/n4cpu16 because I didn't see any particular reason not too. I looked for a while trying to understand what timed out, to no avail. It also looked to me that all the AWS tests did run, but perhaps I looked wrong.

@tbg
Copy link
Member

tbg commented Jan 27, 2020

FWIW, I just ran mixed-headroom 5x on GCE and it passed.

The AWS build did time out (execution timeout), so we probably missed a fair share of tests there. I'm going to run the offending test (last one to finish failingly before the stall) tpcc/headroom/n4cpu16 5x on AWS.

@thoszhang
Copy link
Contributor

Signing off on the schema change tests. They look like failures due to context cancellation, which shouldn't block the release.

@tbg
Copy link
Member

tbg commented Jan 27, 2020

@andreimatei and I figured out through arcane magic that the AWS build was really failed by tpccbench, but that we didn't manage to get any logs. I'm going to own fixing that infrastructure issue.

@tbg
Copy link
Member

tbg commented Jan 28, 2020

Headroom passed 5x on aws as well. Since the failure mode is so similar, I'm checking off the version-mixed ones too.

@tbg
Copy link
Member

tbg commented Jan 28, 2020

I believe the cancel failure is owned by sql: #42103

It looks like the same failure we've been seen for close to two months now. @asubiotto I'll take the liberty to check the box

@tbg
Copy link
Member

tbg commented Jan 28, 2020

tpccbench (cpu=16) has a panic on n2. @andreimatei or @nvanbenschoten do you have context on this?

F200124 03:39:17.909332 8180460 kv/txn_interceptor_heartbeater.go:279  [n2,txn-hb=c715b681] txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" meta={id=c715b681 key=/Table/54/1/1228/2/0 pri=0.05066344 epo=1 ts=1579837154.462135149,2 min=1579837145.904027058,0 seq=49} rw=true stat=ABORTED rts=1579837146.238462972,2 wto=false max=1579837146.404027058,0
goroutine 8180460 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x7056901, 0xed5bc5de5, 0x0, 0xc018a3a000)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x7053740, 0xc000000004, 0x66fc8be, 0x21, 0x117, 0xc02ff43b80, 0x13f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:214 +0xb92
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x492ef00, 0xc0393d97c0, 0x4, 0x2, 0x41d4346, 0x51, 0xc049227c98, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x492ef00, 0xc0393d97c0, 0x1, 0x4, 0x41d4346, 0x51, 0xc04dba7c98, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).heartbeat(0xc044f37830, 0x492ef00, 0xc0393d97c0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:279 +0x18b
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).heartbeatLoop(0xc044f37830, 0x492ef00, 0xc0393d97c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:249 +0x1d1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001c0240, 0x492ef00, 0xc0393d97c0, 0xc038e68270, 0x29, 0x4995640, 0xc000231b60, 0xc03a15b930)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322 +0x152
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:317 +0x131

I'm signing off on the other tpccbench failure since it's the same mystery 255 killed thing.

@tbg
Copy link
Member

tbg commented Jan 28, 2020

@ajwerner could you pass judgement on the follower reads test failure:

follower_reads.go:209,test_runner.go:719: fewer than 2 follower reads occurred: saw [0 0 0] before and [0 1 0] after

@tbg
Copy link
Member

tbg commented Jan 28, 2020

Signing off on the gossip chaos test. Looks like an infra failure, we weren't even able to connect to the machines via SSH to fetch logs in the end.

@tbg
Copy link
Member

tbg commented Jan 28, 2020

the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:

03:18:34 test.go:182: test status: 
03:18:34 tpcc.go:839: --- PASS: tpcc 970 resulted in 11017.9 tpmC (90.1% of max tpmC)

03:18:34 test.go:182: test status: stopping cluster
03:18:34 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1-9
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting..
03:18:36 test.go:182: test status: starting cluster
03:18:36 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false --racks=3 teamcity-1708156-1579821026-27-n10cpu4:1-9
teamcity-1708156-1579821026-27-n10cpu4: starting......................
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings
03:19:14 test.go:182: test status: running benchmark, warehouses=1050
03:19:14 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1708156-1579821026-27-n10cpu4:10 -- ./workload run tpcc --warehouses=2000 --active-warehouses=1050 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s --partitions=3 {pgurl:10} --histograms=perf/warehouses=1050/stats.json
3: 10339
8: 9926
5: 9361
4: 11011
2: 9030
7: 9499
1: 11356
6: 9499
9: 10026
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200124 03:20:20.768517 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:20:44 chaos.go:90: killing :4
03:20:44 test.go:182: test status: stopping cluster
03:20:44 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:4
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.
4: kill exited nonzero
4: dead
CHAOS: 03:20:51 chaos.go:119: restarting :4 after 5s of downtime
03:20:51 test.go:182: test status: starting cluster
03:20:51 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:4
teamcity-1708156-1579821026-27-n10cpu4: starting.
4: 11429
Initializing 10500 workers and preparing statements...
I200124 03:21:24.783933 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:22:21 chaos.go:90: killing :1
03:22:21 test.go:182: test status: stopping cluster
03:22:21 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead

CHAOS: 03:22:28 chaos.go:119: restarting :1 after 5s of downtime
03:22:28 test.go:182: test status: starting cluster
03:22:28 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: startingI200124 03:22:29.250432 1 workload/cli/run.go:361  retrying after error while creating load: EOF
..
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings1: 11960

Initializing 2100 connections...
CHAOS: 03:23:58 chaos.go:90: killing :1
03:23:58 test.go:182: test status: stopping cluster
03:23:58 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: stopping and waiting.1: kill exited nonzero
1: dead

CHAOS: 03:24:05 chaos.go:119: restarting :1 after 5s of downtime
03:24:05 test.go:182: test status: starting cluster
03:24:05 cluster.go:335: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1708156-1579821026-27-n10cpu4:1
teamcity-1708156-1579821026-27-n10cpu4: starting.
teamcity-1708156-1579821026-27-n10cpu4: initializing cluster settings1: 12348

Initializing 10500 workers and preparing statements...
I200124 03:24:24.361240 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200124 03:25:28.126433 1 workload/cli/run.go:361  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
CHAOS: 03:25:35 chaos.go:90: killing :3
[repeats similary for hours]

@nvanbenschoten have you seen this before/can you make sense of it?

@ajwerner
Copy link
Contributor

@ajwerner could you pass judgement on the follower reads test failure:

follower_reads.go:209,test_runner.go:719: fewer than 2 follower reads occurred: saw [0 0 0] before and [0 1 0] after

This failed early on in the test while it seems that the data was still being up-replicated. The test failed at 00:05:03 which is before the user table had been up-replicated fully:

I200124 00:05:06.775279 469 storage/replica_raftstorage.go:806  [n3,s3,r22/3:/Table/2{6-7}] applied LEARNER snapshot [total=17ms ingestion=4@15ms id=c3aa29b3 index=24]

I'll take an action item to wait for the data to be fully replicated before starting the test.

@andreimatei
Copy link
Contributor

The tpccbench (cpu=16) crash I was just looking at yesterday in another context.
I'm suspect it was broken by the following commit: 6997710

We probably shouldn't release with that crash. I'll see what to do.

craig bot pushed a commit that referenced this issue Jan 28, 2020
44393: issues: fix issue creation r=andreimatei a=tbg

We were specifying `C-test-failure` twice. Github actually doesn't care,
but let's fix it anyway.

Release note: None

44444: roachtest: wait for full replication in follower_reads test r=tbg a=ajwerner

Before this commit it was possible for a follower reads test to commence before
the data had been up-replicated to all of the nodes. This change ensures that
the data is fully replicated before beginning the test.

See #43754 (comment).  We saw that [here](https://teamcity.cockroachdb.com/viewLog.html?buildId=1708156&buildTypeId=Cockroach_Nightlies_WorkloadNightly&tab=artifacts#%2Ffollower-reads).

Release note: None.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@apantel
Copy link
Contributor

apantel commented Jan 28, 2020

sqlalchemy, typeorm, pgx are broken tests.

hibernate is failing due to #43969 (#44031 is the fix which did not make the release).

@andreimatei
Copy link
Contributor

I have #44451 which I hope fixed tpccbench (cpu=16), but I'm not sure it's the right thing to do yet.

@nvanbenschoten
Copy link
Member

the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:

Yes, I've seen this failure mode in the past occasionally. The workload gets stuck trying to prepare statements on its 2100 connections and doesn't seem to be able to complete this before it gets killed again (~1m later), so it loops indefinitely. Signing off on this failure while we look more into the issue because it's not an issue with the alpha.

@andreimatei
Copy link
Contributor

#44460 is now in bors. Once that merges, I'll check off tpccbench (cpu=16) because the cause will hopefully be fixed.
But it'd be good to cherry-pick this PR in the release...

@otan otan removed their assignment Jan 29, 2020
@miretskiy
Copy link
Contributor

@andreimatei If you feel that this PR is critical to be included in this alpha, please follow instructions: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/160235536/Changing+a+Scheduled+Release

@nathanstilwell
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests