New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release: v20.1.0-alpha.20200123 #43754
Comments
Picking the SHA is blocked on #43616 |
#43616 is now fixed. Please go ahead with picking the SHA. |
Thanks! Picked a SHA: 6bfe4a1 |
Potentially blocked on #43827. A bunch of the release teamcity builds are failing because of this. I've tried to work around it, but we may have to cherry-pick a fix |
Peter fixed the makefile issue. I'll cherry-pick a new sha and get the roachtests running tomorrow. I started up the roachprod tests today so assuming nothing else goes wrong, we should have enough data to keep the original release date |
I've been waiting to triage the roachtest failures until I can get someone to make a call on weather "SQL Logic Test High VModule Nightly" failing to build is an alpha release blocker or not. At this point, instead of restarting everything, I think I'll try pinging that one build with the candidate SHA plus a makefile change that fixes the build and then triage the roachtest failures we got. This is unlikely to finish on a Friday so my guess is this alpha is slipping until Tuesday or Wednesday |
obsoleted by new SHA CDC
KV
SQL Exec
|
Something definitely regressed in changefeeds recently. I need to look into exactly what, but I think these failures are okay for an alpha. The initial scan one was making progress, though slowly (the most behind span is logged and it kept changing over the course of the test). The other two got further behind than they should, but eventually caught up. |
Checked off all sqlsmith failures since those are never blockers (as Matt always says). |
checked off hotspotsplits/nodes=4 This error has been happening periodically for a while now (#33660). Should not be a release blocker. |
The scaledata failures are interesting. They're hitting issues because the creation of the initial empty schema it taking a very long time (> 1 minute). This is allowing chaos to start before we expect it to. Here's some logging from the first failure, filtered down to just the schema changes triggered here:
One interesting thing we see here is multi-second pauses when waiting for leases to expire. @ajwerner said he'll take a look because this could be related to dropping the closed timestamp down to 3 seconds. |
Okay, sounds good! We'll hold up the alpha until those get investigated, thanks! cc @lnhsingh (Btw also want to make sure you noticed that the name of the alpha was wrong on the release calendar, it had 2019 in the date part of the version tag. I've fixed all occurrences of this error on the release calendar.) |
I'm trying to reproduce anything even remotely close to this sort of latency but I'm having a hard time. I dug in to some more details. One thing to note is that it took almost 20s at the beginning of the test to up-replicate all of the ranges (I see this from the rangelog). Other things worth noting is that all of the jobs took 5s to start. Also orth noting is that the initial index creation for I've modified the test to fail if the schema portion of the test takes more than 30 and I'm running it at scale with roachtest and so far after 400 iterations it hasn't failed. The question is why are these schema changes taking so long. I don't have a great answer but there's some evidence that there's some waiting for leases to expire. Additionally there's a stats job that seems to have happened in the middle here. I'm trying to understand what could be going on here. I don't have any evidence that this is in any way related to closed timestamps. |
After more digging, it seems clear that somehow we're seeing an outstanding lease on every attempt to publish a new version. The other confounding factor is slow up-replication and some lease transfers due to it. All of the other node metrics seem fine. The storage latency numbers are good. The CPU is near zero. The question is whether it's possible for a slow initial lease acquisition to somehow always leave a node with a lease one version behind or something like that.
The question at hand is, how is this lease getting leaked? I've modified the test like so and observed 1000 successful runs. Perhaps the vmodule logging has changed the behavior so I'm going to remove that and try again. --- a/pkg/cmd/roachtest/scaledata.go
+++ b/pkg/cmd/roachtest/scaledata.go
@@ -38,11 +38,11 @@ func registerScaleData(r *testRegistry) {
for app, flags := range apps {
app, flags := app, flags // copy loop iterator vars
- const duration = 10 * time.Minute
+ const duration = 30 * time.Second
for _, n := range []int{3, 6} {
r.Add(testSpec{
Name: fmt.Sprintf("scaledata/%s/nodes=%d", app, n),
- Timeout: 2 * duration,
+ Timeout: 20 * duration,
Cluster: makeClusterSpec(n + 1),
Run: func(ctx context.Context, t *test, c *cluster) {
runSqlapp(ctx, t, c, app, flags, duration)
@@ -73,7 +73,7 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
c.Put(ctx, b, app, appNode)
c.Put(ctx, cockroach, "./cockroach", roachNodes)
- c.Start(ctx, t, roachNodes)
+ c.Start(ctx, t, roachNodes, startArgs("--args='--vmodule=schema*=2,lease=2,jobs=2,registry=2'"))
// TODO(nvanbenschoten): We are currently running these consistency checks with
// basic chaos. We should also run them in more chaotic environments which
@@ -106,17 +106,22 @@ func runSqlapp(ctx context.Context, t *test, c *cluster, app, flags string, dur
defer sqlappL.close()
t.Status("installing schema")
+ start := time.Now()
err = c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s --install_schema "+
"--cockroach_ip_addresses_csv='%s' %s", app, addrStr, flags))
if err != nil {
return err
}
+ if took := time.Since(start); took > 20*time.Second {
+ return fmt.Errorf("it took %v to set up the schema", took)
+ }
t.Status("running consistency checker")
const workers = 16
return c.RunL(ctx, sqlappL, appNode, fmt.Sprintf("./%s --duration_secs=%d "+
"--num_workers=%d --cockroach_ip_addresses_csv='%s' %s",
app, int(dur.Seconds()), workers, addrStr, flags))
+ return nil
})
m.Wait()
} |
Okay I've discovered something interesting w.r.t. the scaledata failure. The nodes are geo-distributed:
@nvanbenschoten confirms that this is the case for the other roachprod failure. Checking off the failures. It seems that this is all fallout from #43898 (comment) |
@jlinder FYI |
Thanks @ajwerner! |
Okay, I'm going to let this bake on the long running clusters one more night. If nothing goes wrong, I'll bless the release tomorrow morning |
@lnhsingh Binaries have been blessed, turning this over to docs and marketing |
We have a late breaking release blocker found just before marketing comms had gone out. Docs PR has been reverted. We'll restart this alpha when there is a fix #43979 |
Going to restart this one, cc @jseldess. |
Thanks for the heads-up, @otan. Can we also be sure to a new data in the binary name? Also, can you share the new candidate SHA? I'll have to generate new notes since the last candidate. |
haven't picked yet, because waiting for a build to finish - will let you know when it's changed. it should just be an additional cherry pick of 3c0671c
can you clarify what this means? |
I think he means that the |
The failure mode for |
You can click the "edited by" drop-down on the checklist and see who marked what. |
@andreimatei you're the one who's missing a note about why you signed off. |
@knz I had no idea! Thanks, it's really comforting that this exists. |
I signed off on |
FWIW, I just ran mixed-headroom 5x on GCE and it passed. The AWS build did time out (execution timeout), so we probably missed a fair share of tests there. I'm going to run the offending test (last one to finish failingly before the stall) tpcc/headroom/n4cpu16 5x on AWS. |
Signing off on the schema change tests. They look like failures due to context cancellation, which shouldn't block the release. |
@andreimatei and I figured out through arcane magic that the AWS build was really failed by tpccbench, but that we didn't manage to get any logs. I'm going to own fixing that infrastructure issue. |
Headroom passed 5x on aws as well. Since the failure mode is so similar, I'm checking off the version-mixed ones too. |
I believe the cancel failure is owned by sql: #42103 It looks like the same failure we've been seen for close to two months now. @asubiotto I'll take the liberty to check the box |
tpccbench (cpu=16) has a panic on n2. @andreimatei or @nvanbenschoten do you have context on this?
I'm signing off on the other tpccbench failure since it's the same mystery 255 killed thing. |
@ajwerner could you pass judgement on the follower reads test failure:
|
Signing off on the gossip chaos test. Looks like an infra failure, we weren't even able to connect to the machines via SSH to fetch logs in the end. |
the chaos tpccbench failure is unclear to me. It looks like it's fine for a while, but then we start a new run and never get the workload running (presumably it's never getting past the phase in which it's sensitive to the queries getting canceled), and we spend hours in a loop there but eventually hit the 10h test timeout:
@nvanbenschoten have you seen this before/can you make sense of it? |
This failed early on in the test while it seems that the data was still being up-replicated. The test failed at
I'll take an action item to wait for the data to be fully replicated before starting the test. |
The We probably shouldn't release with that crash. I'll see what to do. |
44393: issues: fix issue creation r=andreimatei a=tbg We were specifying `C-test-failure` twice. Github actually doesn't care, but let's fix it anyway. Release note: None 44444: roachtest: wait for full replication in follower_reads test r=tbg a=ajwerner Before this commit it was possible for a follower reads test to commence before the data had been up-replicated to all of the nodes. This change ensures that the data is fully replicated before beginning the test. See #43754 (comment). We saw that [here](https://teamcity.cockroachdb.com/viewLog.html?buildId=1708156&buildTypeId=Cockroach_Nightlies_WorkloadNightly&tab=artifacts#%2Ffollower-reads). Release note: None. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
I have #44451 which I hope fixed |
Yes, I've seen this failure mode in the past occasionally. The workload gets stuck trying to prepare statements on its 2100 connections and doesn't seem to be able to complete this before it gets killed again (~1m later), so it loops indefinitely. Signing off on this failure while we look more into the issue because it's not an issue with the alpha. |
#44460 is now in bors. Once that merges, I'll check off |
@andreimatei If you feel that this PR is critical to be included in this alpha, please follow instructions: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/160235536/Changing+a+Scheduled+Release |
Click for current signoff comment
Candidate SHA:
d818f1ff64aaa725a8aea2d45c7df9dd4a42d4d3
Deployment status:
otan-release-v2010-alpha20191216-e9e2a803 / otan-release-v2010-alpha20200123-d818f1ff
Nightly Suite: https://teamcity.cockroachdb.com/viewQueued.html?itemId=1708159&tab=queuedBuildOverviewTab
Release Qualification: https://teamcity.cockroachdb.com/viewLog.html?buildId=1708120&tab=buildResultsDiv&buildTypeId=Cockroach_ReleaseQualification
Deployment Dashboards:
Release process checklist
Prep date: 23rd January, 2020
Candidate SHA
above, notify #release-announce of SHA.Deployment status
above with clusters andNightly Suite
with the link to Nightly TeamCity JobOne day after prep date:
Get signoff on roachtest failures
Keep an eye on clusters until release date. Do not proceed below until the release date.
Release date: 30th January, 2020
Check cluster status
Tag release
Bless provisional binaries
For production or stable releases in the latest major release series
Update docs
External communications for release
Clean up provisional tag from repository
old info - obsoleted by new SHA -- formerly release: v20.1.0-alpha.20200113
Candidate SHA: 617cb39.
Deployment status: dan-release-v2010-alpha20191216-6bfe4a1/dan-release-v2010-alpha20190113-e9e2a80.
Nightly Suite: ReleaseQual NightlySuite.
The text was updated successfully, but these errors were encountered: