Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpchvec/perf failed #46664

Closed
cockroach-teamcity opened this issue Mar 27, 2020 · 11 comments · Fixed by #46700
Closed

roachtest: tpchvec/perf failed #46664

cockroach-teamcity opened this issue Mar 27, 2020 · 11 comments · Fixed by #46700
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpchvec/perf failed on master@c0855e9fcf979cca1094d7edeeacb9882ddc9f96:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200327-1832621/tpchvec/perf/run_1
	tpchvec.go:579,tpchvec.go:660,test_runner.go:753: error with attached stack trace:
		    main.execCmdWithBuffer
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:426
		    main.(*cluster).RunWithBuffer
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2031
		    main.registerTPCHVec.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpchvec.go:574
		    main.registerTPCHVec.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpchvec.go:660
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s
		    -- arg 1: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1832621-1585293773-20-n3cpu4:1 -- ./workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=20 --vectorize=off {pgurl:1-3}:
		  - exit status 30

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 27, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Mar 27, 2020
@asubiotto asubiotto added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Mar 27, 2020
@asubiotto
Copy link
Contributor

Error: [q20]: pq: root: memory budget exceeded: 10240 bytes requested, 3940650240 currently allocated, 3940651008 bytes in budget

@asubiotto
Copy link
Contributor

cc @yuzefovich do you think we might've introduced a memory accounting leak in a recent change?

@yuzefovich
Copy link
Member

09:05:22 cluster.go:421: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1832621-1585293773-20-n3cpu4:1 -- ./workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=20 --vectorize=off {pgurl:1-3}
09:05:30 tpchvec.go:575: 
Error: [q20]: pq: root: memory budget exceeded: 10240 bytes requested, 3940650240 currently allocated, 3940651008 bytes in budget

so it happened with vectorize=off, and, in fact, we fixed the memory accounting problem with rowexec. I'm not sure, though, what should we be doing now with query 20.

@asubiotto
Copy link
Contributor

Interesting, I missed that line. It's possible that this is related to the fact that we're now properly accounting for some more hash aggregator memory? I'm seeing two surprising things:

  1. In the memory profiles, it seems like only colexec is showing up, which doesn't seem to match up with the vecorize=off unless I'm looking at the wrong memory profiles.
  2. In the error message it's saying that we have around 4GB but the memory profile with most memory in use is node 2 with around 500MB of in use memory:
    image. If the memory profiles are correct it seems like we have some memory accounting leak. This seems very hard to have in the vectorized engine, though.

@yuzefovich
Copy link
Member

Yes, I believe the fix to memory accounting in hash aggregator is the reason we hit a memory limit here, and unless we messed up the fixes that went in (I'm a little concerned about the first one, about the "internal" memory of map), memory error is the expected behavior now and should have been always.

Are you looking at memprof.000000000887919168_2020-03-27T09_05_10.864.pprof? Note that this was taken at 09:05:10 which is when query 20 was executed with vectorize=on for the second (out of three) time:

I200327 09:04:57.019337 55 workload/tpch/tpch.go:465  [q20] returned 186 rows after 12.51 seconds
I200327 09:05:09.616685 55 workload/tpch/tpch.go:465  [q20] returned 186 rows after 12.60 seconds
I200327 09:05:22.168569 55 workload/tpch/tpch.go:465  [q20] returned 186 rows after 12.55 seconds

vectorize=off config started executing at 09:05:22 and hit an error at 09:05:30, and I don't think we have memory profiles for that period.

@yuzefovich
Copy link
Member

I guess I don't understand how rowexec might be using 4GB while colexec seems to be using 10 times less, this deserves an investigation.

@yuzefovich
Copy link
Member

Ok, I see what's happening. Here is the corresponding part of the plan when vectorize=on:
Screen Shot 2020-03-27 at 2 17 48 PM
The important information to get from here is that hash aggregator outputs 540k tuples with four columns, three of which are populated using anyNotNull aggregate. We've added memory accounting for that kind of aggregate in the same way as all other aggregates are accounted for - by instantiating a memory account for each struct. Now, this results in "reserving" mon.DefaultPoolAllocation = 10240 bytes although we will only be using a single datum (likely of 8 bytes size), so the math comes out to:
540000 * 3 * 10240 bytes > 15GiB of budget we would need to have, but we only have 4GiB, so we will error out.

I left reusing the same account as a TODO because I thought it would be too invasive of a change at this point (plus would be harder to backport), but it seems like now it could be considered a release blocker since we're likely to error out on OLAP-y queries with rowexec aggregation when, in fact, we do have enough RAM to succeed. cc @jordanlewis for your opinion on this.

@yuzefovich
Copy link
Member

cc @rohany it is possible that alterpk-tpcc will error out with a memory limit error until this issue is resolved.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@0e9f07ae35ad39e8789bc51c49066c123968a666:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200328-1834610/tpchvec/perf/run_1
	tpchvec.go:579,tpchvec.go:660,test_runner.go:753: error with attached stack trace:
		    main.execCmdWithBuffer
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:426
		    main.(*cluster).RunWithBuffer
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2031
		    main.registerTPCHVec.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpchvec.go:574
		    main.registerTPCHVec.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpchvec.go:660
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s
		    -- arg 1: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1834610-1585381212-21-n3cpu4:1 -- ./workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=20 --vectorize=off {pgurl:1-3}:
		  - exit status 30

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@92495ffb2182671b3338a8c6cd2734cb6de5d392:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200329-1836230/tpchvec/perf/run_1
	tpchvec.go:579,tpchvec.go:660,test_runner.go:753: error with attached stack trace:
		    main.execCmdWithBuffer
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:426
		    main.(*cluster).RunWithBuffer
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2031
		    main.registerTPCHVec.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpchvec.go:574
		    main.registerTPCHVec.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpchvec.go:660
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s
		    -- arg 1: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1836230-1585466755-20-n3cpu4:1 -- ./workload run tpch --concurrency=1 --db=tpch --max-ops=3 --queries=20 --vectorize=off {pgurl:1-3}:
		  - exit status 30

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@asubiotto
Copy link
Contributor

Good catch. 540k groups is crazy...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants