Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tenantcostclient: figure out how to "unskip" TestEstimateQueryRUConsumption #106350

Closed
yuzefovich opened this issue Jul 6, 2023 · 6 comments · Fixed by #106356 or #106730
Closed

tenantcostclient: figure out how to "unskip" TestEstimateQueryRUConsumption #106350

yuzefovich opened this issue Jul 6, 2023 · 6 comments · Fixed by #106356 or #106730
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). db-cy-23 T-sql-queries SQL Queries Team
Projects

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Jul 6, 2023

Currently, TestEstimateQueryRUConsumption is unconditionally skipped which is no longer allowed under "test vision zero". The test has been skipped since its introduction in #89256 with the following comment:

	// This test becomes flaky when the machine/cluster is under significant
	// background load, so it should only be run manually.

We should figure out how to enable this test at least in some cases or, perhaps, use another skip mechanism that would make this test no longer show up as "skipped" if we truly cannot enable it without it becoming flaky.

cc @DrewKimball in case you have some thoughts

Jira issue: CRDB-29525

@yuzefovich yuzefovich added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test labels Jul 6, 2023
@yuzefovich yuzefovich added this to Triage in SQL Queries via automation Jul 6, 2023
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jul 6, 2023
@DrewKimball
Copy link
Collaborator

I never actually tried enabling it, just didn't want to add a flaky test. We could try enabling with stress and race disabled, but I worry it'd flake eventually. I guess we could just increase the allowed bounds until it no longer flakes (if it flakes).

@yuzefovich yuzefovich self-assigned this Jul 7, 2023
@yuzefovich yuzefovich moved this from Triage to Active in SQL Queries Jul 7, 2023
@craig craig bot closed this as completed in 1137aee Jul 7, 2023
SQL Queries automation moved this from Active to Done Jul 7, 2023
@aadityasondhi aadityasondhi reopened this Jul 12, 2023
SQL Queries automation moved this from Done to Triage Jul 12, 2023
@yuzefovich yuzefovich moved this from Triage to Active in SQL Queries Jul 12, 2023
@yuzefovich yuzefovich added C-test-failure Broken test (automatically or manually discovered). and removed skipped-test labels Jul 12, 2023
@yuzefovich
Copy link
Member Author

Hm, looks like the timeout occurred when the test was executing one of the test cases:

github.com/cockroachdb/cockroach/pkg/testutils/sqlutils.RowsToStrMatrix(0xc00d645d70?)
	github.com/cockroachdb/cockroach/pkg/testutils/sqlutils/sql_runner.go:255 +0x107
github.com/cockroachdb/cockroach/pkg/testutils/sqlutils.(*SQLRunner).QueryStr(0x788d850?, {0x7871a80?, 0xc002060ea0}, {0x63c5df7, 0x57}, {0x0, 0x0, 0x0})
	github.com/cockroachdb/cockroach/pkg/testutils/sqlutils/sql_runner.go:234 +0x11c
github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient_test.TestEstimateQueryRUConsumption(0xc002060ea0)
	github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient_test/pkg/ccl/multitenantccl/tenantcostclient/query_ru_estimate_test.go:179 +0x99b

and in the goroutine dump we see that the query is blocked waiting on network:

goroutine 5650 [select]:
google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00bd4e360)
	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:328 +0x7c
google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:343
google.golang.org/grpc.(*csAttempt).recvMsg(0xc00733be10, {0x5fc5680?, 0xc000c0bf80}, 0xc00927fe60?)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1046 +0xc5
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x0?)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:900 +0x25
google.golang.org/grpc.(*clientStream).withRetry(0xc00bd4e120, 0xc0066990b8, 0xc006699088)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:751 +0x144
google.golang.org/grpc.(*clientStream).RecvMsg(0xc00bd4e120, {0x5fc5680?, 0xc000c0bf80?})
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:899 +0x12e
google.golang.org/grpc.invoke({0x78b79c0?, 0xc00b7eed50?}, {0x625822f?, 0x2?}, {0x61285c0, 0xc00927fe60}, {0x5fc5680, 0xc000c0bf80}, 0x0?, {0xc004be0b80, ...})
	google.golang.org/grpc/external/org_golang_google_grpc/call.go:73 +0xd7
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ClientInterceptor.func2({0x78b79c0, 0xc00b7eed50}, {0x625822f, 0x21}, {0x61285c0, 0xc00927fe60}, {0x5fc5680, 0xc000c0bf80}, 0x7e235b249769edad?, 0x65b8aa0, ...)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:248 +0x424
google.golang.org/grpc.(*ClientConn).Invoke(0x496219688e106cad?, {0x78b79c0?, 0xc00b7eed50?}, {0x625822f?, 0x0?}, {0x61285c0?, 0xc00927fe60?}, {0x5fc5680?, 0xc000c0bf80?}, {0x0, ...})
	google.golang.org/grpc/external/org_golang_google_grpc/call.go:35 +0x223
github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*internalClient).Batch(0xc0084fa650, {0x78b79c0, 0xc00b7eed50}, 0x0?, {0x0, 0x0, 0x0})
...
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc006e0a700?, {0x78b79c0?, 0xc001201d70?}, {0xc002868009?}, 0x4?, 0xc001201c20?)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:953 +0xe6
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:260 +0x446
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:171 +0x22a

I don't quite understand why this would occur, perhaps on the overloaded CI machine the test takes long time? I might just increase the timeout since I don't see any signs of deadlock or something like that.

@yuzefovich
Copy link
Member Author

There are some interesting logs related to the tenant though:

I230712 21:52:52.654612 5650 5@util/log/event_log.go:32 ⋮ [T10,nsql1,client=127.0.0.1:53778,hostssl,user=root] 256 ={"Timestamp":1689198772635843137,"EventType":"create_table","Statement":"CREATE TABLE ‹defaultdb›.public.‹abcd› (‹a› INT8, ‹b› INT8, ‹c› INT8, ‹d› INT8, INDEX (‹a›, ‹b›, ‹c›))","Tag":"CREATE TABLE","User":"root","DescriptorID":104,"TableName":"‹defaultdb.public.abcd›"}
I230712 21:52:52.822313 582 kv/kvserver/queue.go:616 ⋮ [T1,n1,s1,r52/1:‹/Table/5{1-2}›,raft] 257  rate limited in MaybeAdd (merge): throttled on async limiting semaphore
W230712 21:52:53.612281 5650 kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:637 ⋮ [T10,nsql1,client=127.0.0.1:53778,hostssl,user=root] 258  a transaction has hit the intent tracking limit (kv.transaction.max_intents_bytes); is it a bulk operation? Intent cleanup will be slower. txn: "sql txn" meta={id=e7519a1f key=/Tenant/10/Table/104/1/‹881879851610341377›/‹0› iso=Serializable pri=0.04449817 epo=0 ts=1689198772.655347726,0 min=1689198772.655347726,0 seq=46520} lock=true stat=PENDING rts=1689198772.655347726,0 wto=false gul=1689198773.155347726,0 ba: ‹5815 CPut, 5815 InitPut›
...
W230712 21:55:18.877732 5650 kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:637 ⋮ [T10,nsql1,client=127.0.0.1:53778,hostssl,user=root] 293  a transaction has hit the intent tracking limit (kv.transaction.max_intents_bytes); is it a bulk operation? Intent cleanup will be slower. txn: "sql txn" meta={id=1b1d7155 key=/Tenant/10/Table/104/1/‹881880325101846529›/‹0› iso=Serializable pri=0.01728139 epo=0 ts=1689198917.181285140,0 min=1689198917.181285140,0 seq=46520} lock=true stat=PENDING rts=1689198917.181285140,0 wto=false gul=1689198917.681285140,0 ba: ‹5815 CPut, 5815 InitPut›

Let's bump the cluster setting while also reducing the number of rows we're inserting in this test.

@yuzefovich
Copy link
Member Author

Nevermind, it's the randomization of default-batch-bytes-limit that makes the queries much slower. The test needs to force some production values.

@msbutler
Copy link
Collaborator

This test just flaked on me on master.

@craig craig bot closed this as completed in 907479d Jul 13, 2023
SQL Queries automation moved this from Active to Done Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). db-cy-23 T-sql-queries SQL Queries Team
Projects
Archived in project
4 participants