opt: make the optimizer "distribution aware" #47226

rytaft · 2020-04-08T20:56:28Z

CockroachDB is a geo-distributed database, but the optimizer currently operates with the assumption that all data is located on the same physical machine. There are a couple of exceptions to this rule, such as the deprecated feature called "duplicated indexes". To use this feature, a user can create two (or more) indexes that have exactly the same columns but are pinned to different localities. For example, a user might choose to pin one index to a data center in the East coast of the US, and another identical index to a data center in the West coast. In this case, with all other things being equal, the optimizer will choose whichever index is closest to the gateway node of the SQL query.

There are lots of other cases, however, where the optimizer does not take data locality into account when it should. For example, the optimizer does not consider locality when deciding between two non-identical indexes (i.e., indexes with different columns). This can lead to suboptimal plans in many cases. The purpose of this issue is to track the work needed to make the optimizer aware of data distribution when planning all queries, not just those that use the duplicated indexes feature.

A few of the features we plan to add are:

Support a "distribution" physical property in the optimizer, which describes the localities (e.g., regions) that are touched by a given relational operator.
Support a "distribute" operator in the optimizer, which enforces a particular distribution by routing data from one set of localities to another set. This is similar to how the sort operator enforces a particular ordering. In the execution engine, "distribute" can either represent DistSQL routers or the lookups performed by the DistSender. If data is spread across localities, a "distribute" operator should be used to enforce that all results are ultimately returned to the gateway node.
Update the optimizer cost model to give a realistic cost to the "distribute" operator. This should take into account the different localities in the input and output distributions and the latency between them.
Since lookup and index joins rely on the DistSender to fetch data from remote nodes, update the cost model for lookup joins and index joins to include the distribution cost. This will require identifying which localities will be visited by the DistSender when fetching data and using a cost model similar to the "distribute" operator.
Since distributed hash and merge joins require data shuffling, update the cost model for those operators to include the distribution cost. To prevent the optimizer from always choosing a non-distributed join, account for the benefits of parallel computation in the cost of distributed joins.
Similarly, update the cost model for distributed hash group by and distinct operators to include the distribution cost. To prevent the optimizer from always choosing a non-distributed operation, account for the benefits of parallel computation.
Add transformation rules in the optimizer to support different partitioning schemes for distributed joins. For example, the optimizer should explore broadcast joins in addition to hash-partitioned joins (see sql, opt: support broadcast join #84731). The optimizer will be able to choose the best partitioning scheme by using the distributed cost model supported by the above steps.
Add more region-aware transformations in the optimizer that can leverage the distributed cost model. We already have transformations to support locality optimized search, but there are others we could add. For example, if a distributed join is spread across multiple regions and the partitioning column (e.g., crdb_region) is one of the join keys, it may be possible to hash-partition data within a region and perform the join phase before transferring any data across regions. In other words, the optimizer could transform a single join across regions into a union of joins, where each child of the union is a join within a single region. Another possible transformation where we turn a scan of a REGIONAL BY ROW table into a pk-fk join with a GLOBAL table is described in opt: infer lookup join equality conditions from unique key and foreign key #69617 (comment).

This is not an exhaustive list, but gives a sense of the scope of work.

gz#9256

Jira issue: CRDB-5025

The text was updated successfully, but these errors were encountered:

This commit adds a new physical property called "Distribution" to expressions in the optimizer. Currently, Distribution is just the list of regions that are touched by the expression. This commit also adds a new enforcer called "Distribute", which enforces a particular Distribution on its input. This is similar to the way Sort enforces a particular Ordering. Currently, Distribute is only used to enforce that the results of a query end up in the same region as the gateway node. The Distribution property and Distribute enforcer will enable us to perform better costing of query plans in future commits. This commit adds a tiny cost for the Distribute enforcer, but otherwise does not yet take advantage of distribution for costing. In the future, we can use Distribution to better cost distributed joins and aggregations, and accurately cost the Distribute enforcer so as to avoid scanning data in remote regions if possible. This commit represents a departure from the approach taken in the earlier prototype in cockroachdb#43831. Instead of using a "neighborhood" abstraction to represent arbitrary collections of nodes, this commit sticks to the existing abstractions of region and locality. If we need a new abstration in the future, we should be able to simply modify the representation of Distribution. Additionally, this commit does not make an attempt to represent exactly how data is partitioned across regions; it only tracks *which* regions contain data. This greatly simplifies the implementation, and I don't think it will significantly reduce the accuracy of costing. Informs cockroachdb#47226 Release note: None

This commit adds a new physical property called "Distribution" to expressions in the optimizer. Currently, Distribution is just the list of regions that are touched by the expression. This commit also adds a new enforcer called "Distribute", which enforces a particular Distribution on its input. This is similar to the way Sort enforces a particular Ordering. Currently, Distribute is only used to enforce that the results of a query end up in the same region as the gateway node. The Distribution property and Distribute enforcer will enable us to perform better costing of query plans in future commits. This commit adds a tiny cost for the Distribute enforcer, but otherwise does not yet take advantage of distribution for costing. In the future, we can use Distribution to better cost distributed joins and aggregations, and accurately cost the Distribute enforcer so as to avoid scanning data in remote regions if possible. This commit represents a departure from the approach taken in the earlier prototype in cockroachdb#43831. Instead of using a "neighborhood" abstraction to represent arbitrary collections of nodes, this commit sticks to the existing abstractions of region and locality. If we need a new abstration in the future, we should be able to simply modify the representation of Distribution. Additionally, this commit does not make an attempt to represent exactly how data is partitioned across regions (e.g., which column is the partitioning key, whether the data is hash or range partitioned, etc.); it only tracks *which* regions contain data. This greatly simplifies the implementation, and I don't think it will significantly reduce the accuracy of costing. Informs cockroachdb#47226 Release note: None

74349: opt: add Distribution physical property and Distribute enforcer r=rytaft a=rytaft This commit adds a new physical property called `Distribution` to expressions in the optimizer. Currently, `Distribution` is just the list of regions that are touched by the expression. This commit also adds a new enforcer called `Distribute`, which enforces a particular `Distribution` on its input. This is similar to the way `Sort` enforces a particular `Ordering`. Currently, `Distribute` is only used to enforce that the results of a query end up in the same region as the gateway node. The `Distribution` property and `Distribute` enforcer will enable us to perform better costing of query plans in future commits. This commit adds a tiny cost for the `Distribute` enforcer, but otherwise does not yet take advantage of distribution for costing. In the future, we can use `Distribution` to better cost distributed joins and aggregations, and accurately cost the `Distribute` enforcer so as to avoid scanning data in remote regions if possible. This commit represents a departure from the approach taken in the earlier prototype in #43831. Instead of using a "neighborhood" abstraction to represent arbitrary collections of nodes, this commit sticks to the existing abstractions of region and locality. If we need a new abstration in the future, we should be able to simply modify the representation of `Distribution`. Additionally, this commit does not make an attempt to represent exactly how data is partitioned across regions (e.g., which column is the partitioning key, whether the data is hash or range partitioned, etc.); it only tracks *which* regions contain data. This greatly simplifies the implementation, and I don't think it will significantly reduce the accuracy of costing. Informs #47226 Release note: None Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

Updated planProjectionOperators and planSelectionOperators to include case for tree.NotExpr Replaced RunTests with RunTestsWithoutAllNullsInjection Fixes for build & lint issues Removed unused field Addressing PR review comments backupccl: use SpanGroup.Sub instead of CoveringMerge Release note: none. rowenc: reduce reliance on the entire TableDescriptor This change reduces uses of the entire TableDescriptor. Release note: None descpb: add and use NoColumnID constant This change adds a `descpb.NoColumnID` constant so we don't have to use `descpb.ColumnID(encoding.NoColumnID)` everywhere. Release note: None rowenc: remove DatumAlloc.scratch This commit removes the `DatumAlloc.scratch` field which is always nil. Release note: None sql: move `DatumAlloc` to sem/tree This commit moves `DatumAlloc` from `sql/rowenc` to `sql/sem/tree`. This is a fairly low-level construct and is not used exclusively for row encoding/decoding. Release note: None rowenc: move low-level key encoding functions to subpackage The `rowenc` package contains a hefty mix of functions operating at different conceptual levels - some use higher level constructs like table and index descriptors, others are lower level utilities. The naming of these functions doesn't help, for example `EncodeTableKey` sounds like a top-level function but is actually a low level utility that appends a single value to a key This commit moves the lower level utilities for encoding datum values into keys to the `rowenc/keyside` package. Release note: None rowenc: minor cleanup of array decoding code The code around array decoding was confusing; we now have a single `decodeArray` variant which is the inverse of `encodeArray`. Release note: None rowenc: move low-level value encoding functions to subpackage This commit moves the lower level utilities for encoding datum values into values to the `rowenc/valueside` package. Release note: None valueside: introduce ColumnIDDelta type When encoding multiple values, we encode the differences between the ColumnIDs. This difference is passed to `valueside.Encode`, but it is passed as a `descpb.ColumnID`. This commit introduces a new type to make this distinction more evident. Release note: None valueside: add Decoder helper We have a few copies of the same logic of decoding multiple SQL values from a Value, in particular in range feed handling code. This commit adds a helper type that centralizes this logic, reducing duplication. Note that the query executione engines retain their own specialized implementations. Release note: None sql: release BatchFlowCoordinator objects to pool This commit adds `BatchFlowCoordinator` objects to their `vectorizedFlowCreator`'s `releasables` slice, which ensures that their `Release` method is called and that they are properly recycled. Before this change, heap profiles and instrumentation both indicated that these objects were not being recycled as intended. For example, heap profiles looked like: ``` Type: alloc_objects Time: Dec 30, 2021 at 2:10pm (EST) Active filters: focus=Pool\).Get Showing nodes accounting for 92189, 0.57% of 16223048 total ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 71505 79.69% | github.com/cockroachdb/cockroach/pkg/sql/colflow.NewBatchFlowCoordinator /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:218 10923 12.17% | github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.newLockTableGuardImpl /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table.go:452 2048 2.28% | github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree.(*Map).maybeInitialize /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree/map.go:112 1170 1.30% | github.com/cockroachdb/cockroach/pkg/sql/colexec.newMaterializerInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:179 1030 1.15% | github.com/cockroachdb/pebble.(*Iterator).Clone /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:1228 585 0.65% | github.com/cockroachdb/cockroach/pkg/sql.MakeDistSQLReceiver /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:841 585 0.65% | github.com/cockroachdb/cockroach/pkg/sql/colfetcher.NewColBatchScan /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:223 585 0.65% | github.com/cockroachdb/cockroach/pkg/sql/span.MakeBuilder /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/span/span_builder.go:61 585 0.65% | github.com/cockroachdb/cockroach/pkg/storage.mvccGet /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:859 585 0.65% | github.com/cockroachdb/cockroach/pkg/storage.mvccScanToBytes /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2357 128 0.14% | github.com/cockroachdb/pebble.(*DB).newIterInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:765 0 0% 0% 89729 0.55% | sync.(*Pool).Get /Users/nathan/Go/go/src/sync/pool.go:148 ``` Notice that `sync.(*Pool).Get` is responsible for *0.55%* of heap allocations and **79.69%** of these are from `colflow.NewBatchFlowCoordinator`. After this change, that source of allocations goes away and we see the following impact on micro-benchmarks: ``` name old time/op new time/op delta KV/Scan/SQL/rows=1-10 95.1µs ± 7% 95.9µs ± 5% ~ (p=0.579 n=10+10) KV/Scan/SQL/rows=10-10 100µs ± 3% 103µs ±12% ~ (p=0.829 n=8+10) name old alloc/op new alloc/op delta KV/Scan/SQL/rows=10-10 21.7kB ± 0% 21.5kB ± 0% -0.76% (p=0.000 n=10+10) KV/Scan/SQL/rows=1-10 20.1kB ± 0% 19.9kB ± 0% -0.70% (p=0.000 n=10+9) name old allocs/op new allocs/op delta KV/Scan/SQL/rows=1-10 245 ± 0% 244 ± 0% -0.41% (p=0.000 n=10+10) KV/Scan/SQL/rows=10-10 280 ± 0% 279 ± 0% -0.36% (p=0.001 n=8+9) ``` sql/catalog: restore fast-path in FullIndexColumnIDs This commit restores a [fast-path](https://github.com/cockroachdb/cockroach/commit/c9e116e586f24c5f3a831ac653f14fd03f588b93#diff-19625608f4a6e23e6fe0818f3a621e716615765cb338d18fe34b43f0a535f06dL140) in `FullIndexColumnIDs` which was lost in c9e116e. The fast-path avoided the allocation of a `ColumnID` slice and a `IndexDescriptor_Direction` slice in `FullIndexColumnIDs` when given a unique index. In such cases, these slices are already stored on the `IndexDescriptor`. ``` name old time/op new time/op delta KV/Scan/SQL/rows=1-10 94.9µs ±10% 94.9µs ± 8% ~ (p=0.739 n=10+10) name old alloc/op new alloc/op delta KV/Scan/SQL/rows=1-10 20.1kB ± 0% 20.1kB ± 1% ~ (p=0.424 n=10+10) name old allocs/op new allocs/op delta KV/Scan/SQL/rows=1-10 245 ± 0% 241 ± 0% -1.63% (p=0.000 n=10+8) ``` kv: protect Replica's lastToReplica and lastFromReplica fields with raftMu This commit moves the Replica's lastToReplica and lastFromReplica from under the `Replica.mu` mutex to the `Replica.raftMu` mutex. These are strictly Raft-specific pieces of state, so we don't need fine-grained locking around them. As a reward, we don't need to grab the `Replica.mu` exclusively (or at all) when setting the fields in `Store.withReplicaForRequest`. The locking in `setLastReplicaDescriptors` showed up in a mutex profile under a write-heavy workload. It was responsible for **3.44%** of mutex wait time. Grabbing the mutex was probably also slowing down request processing, as the exclusive lock acquisition had to wait for read locks to be dropped. coldata: operate on Nulls value, not reference This commit changes `col.Vec.SetNulls` to accept a `Nulls` struct by value instead of by pointer. This lets us avoid a heap allocation on each call to `Nulls.Or`. Releaes note: None roachtest: fix silly bug in tlp roachtest I thought that we'd get some basic coverage of roachtests in unit tests at low duration or something, but I guess not. Release note: None sql: support readonly default_with_oids var This is just one less error when importing PGDUMP. Release note (sql change): We now support `default_with_oids` which only accepts being false. importccl,jobs: truncate all job system errors In #73303 we truncated the row data in this error message to prevent problems with large row data preventing the job status from being saved. However, truncating the row means that our "experimental_save_rejected" option does not work as expected. This feature previously saved entire rows and now it would have truncated rows in some case. Now, we only truncate the error when producing the error string. Further, we also add truncation to the job system to prevent other parts of the system from saving too much data to the job system. Release note: None importccl: remove unused argument Release note: None sql/logictest: add diff support to expectation output This adds support for generating a unified diff for expectation mismatches. This can make it easier to spot the differences between two results when the result contains many rows. For example: ``` ../../sql/logictest/testdata/logic_test/system_namespace:54: SELECT * FROM system.namespace expected: 0 0 defaultdb 50 0 0 postgres 51 0 0 system 1 0 0 test 52 1 0 public 29 1 29 comments 24 1 29 database_role_settings 44 1 29 descriptor 3 1 29 descriptor_id_seq 7 1 29 eventlog 12 1 29 jobs 15 1 29 join_tokens 41 1 29 lease 11 1 29 locations 21 1 29 migrations 40 1 29 namespace 30 1 29 protected_ts_meta 31 1 29 protected_ts_records 32 1 29 rangelog 13 1 29 replication_constraint_stats 25 1 29 replication_critical_localities 2 1 29 replication_stats 27 1 29 reports_meta 28 1 29 role_members 23 1 29 role_options 33 1 29 scheduled_jobs 37 1 29 settings 6 1 29 sql_instances 46 1 29 sqlliveness 39 1 29 statement_bundle_chunks 34 1 29 statement_diagnostics 36 1 29 statement_diagnostics_requests 35 1 29 statement_statistics 42 1 29 table_statistics 20 1 29 transaction_statistics 43 1 29 ui 14 1 29 users 4 1 29 web_sessions 19 1 29 zones 5 50 0 public 29 51 0 public 29 52 0 public 29 but found (query options: "rowsort" -> ignore the following ordering of rows) : 0 0 defaultdb 50 0 0 postgres 51 0 0 system 1 0 0 test 52 1 0 public 29 1 29 comments 24 1 29 database_role_settings 44 1 29 descriptor 3 1 29 descriptor_id_seq 7 1 29 eventlog 12 1 29 jobs 15 1 29 join_tokens 41 1 29 lease 11 1 29 locations 21 1 29 migrations 40 1 29 namespace 30 1 29 protected_ts_meta 31 1 29 protected_ts_records 32 1 29 rangelog 13 1 29 replication_constraint_stats 25 1 29 replication_critical_localities 26 1 29 replication_stats 27 1 29 reports_meta 28 1 29 role_members 23 1 29 role_options 33 1 29 scheduled_jobs 37 1 29 settings 6 1 29 sql_instances 46 1 29 sqlliveness 39 1 29 statement_bundle_chunks 34 1 29 statement_diagnostics 36 1 29 statement_diagnostics_requests 35 1 29 statement_statistics 42 1 29 table_statistics 20 1 29 transaction_statistics 43 1 29 ui 14 1 29 users 4 1 29 web_sessions 19 1 29 zones 5 50 0 public 29 51 0 public 29 52 0 public 29 Diff: --- Expected +++ Actual @@ -20,3 +20,3 @@ 1 29 replication_constraint_stats 25 - 1 29 replication_critical_localities 2 + 1 29 replication_critical_localities 26 1 29 replication_stats 27 logic.go:3340: ../../sql/logictest/testdata/logic_test/system_namespace:100: error while processing ``` The diff output is only added when the `-show-diff` flag is provided because in some cases it can produce more noise than it is worth. The diff library used here is the same one already used by the testify libraries we depend on. Release note: None teamcity-trigger: give `kvserver` package higher stress timeout Closes https://github.com/cockroachdb/cockroach/issues/69519. Release note: None backupccl: allow columns of type array in EXPORT PARQUET Previously, EXPORT PARQUET only supported CRDB relations whose columns were scalars of type int, string, float, or boolean. This change allows columns of type array whose values can be int, string, float, boolean. Following the CRDB ARRAY documentation, a value in an exported array can be NULL as well. Informs: #67710 Release note (sql change): EXPORT PARQUET can export columns of type array backupccl: remove old makeImportSpans Release note: none. backupccl: expand some comments on covering code Release note: none. security: make the bcrypt cost configurable Release note (security update): For context, when configuring passwords for SQL users, if the client presents the password in cleartext via ALTER/CREATE USER/ROLE WITH PASSWORD, CockroachDB is responsible for hashing this password before storing it. By default, this hashing uses CockroachDB's bespoke `crdb-bcrypt` algorithm, itself based off the standard Bcrypt algorithm. The cost of this hashing function is now configurable via the new cluster setting `server.user_login.password_hashes.default_cost.crdb_bcrypt`. Its default value is 10, which corresponds to an approximate password check latency of 50-100ms on modern hardware. This value should be increased over time to reflect improvements to CPU performance: the latency should not become so small that it becomes feasible to bruteforce passwords via repeated login attempts. Future versions of CockroachDB will likely update the default accordingly. security: make `server.user_login.min_password_length` visible in doc gen This cluster setting was meant to be exported for visibility in auto-generated docs (we've documented it before). This was an oversight. Release note: None authors: add Fenil Patel to authors Release note: None bench: force the index join in BenchmarkIndexJoin Without using the index hint, we now choose to perform the full scan over the primary index and put the filter on top of that rather than performing a limited scan of the secondary followed by an index join. I confirmed that this is the case on 21.1 and 21.2 binaries, so I'll backport to the latter (in order to make the comparison between releases sane). Release note: None dev: a few improvements to the script * Add `set -e` so if the `dev` build fails, it doesn't try to run the binary anyway. * Add a way to force recompilation of `dev` (useful for developers). Release note: None changefeedccl: Shutdown tenant before main server. Shutdown tenant server before stopping main test server. This elliminates some of the error messages we see in the test output when tenant attempts to connect to some ranges which are no longer accessible. Release Notes: None importccl: fix import pgdump target column bug Previously, if a COPY FROM statement had less columns than the CREATE TABLE schema defined in the dump file, we would get a nil pointer exception. This is because we were not filling the non-targeted columns with a NULL datum. This change fixes that and aligns behvaiour with how INSERT handles non-targeted columns. Release note (bug fix): IMPORT TABLE ... PGDUMP with a COPY FROM statement in the dump file that has less target columns than the CREATE TABLE schema definition would result in a nil pointer exception. roachtest: add new passing tests for pgjdbc nightly Release note: None go.mod: fix ordering The first `require` is for direct dependencies. The second one for indirect dependencies. There's no need for a third one. Release note: None Makefile: specify that code needs to be generated for `roach{test,prod}` Release note: None setting: clean up slot indexes The internal setting code passes around slot indexes as bare integers, and there are confusingly two variants: one is 1-indexed, another is 0-indexed. This change makes all slot indexes 0-indexed and adds a `slotIdx` type. Note: the 1-indexed choice was perhaps useful at the time to help find code paths where the slot index is not initialized, but now the slot index is set along with other mandatory fields by `init()`. Release note: None setting: prevent use of SystemOnly settings from tenant code This change implements the `system-only` semantics in the RFC (#73349). All SystemOnly setting values are now invisible from tenant code. If tenant code attempts to get or set a SystemOnly setting by handle, it results in panics in test builds; in non-test builds, these settings always report the default value. Release note (sql change): System-only cluster settings are no longer visible for non-system tenants. build: fix Pebble nightly benchmarks The Pebble nightly benchmarks stopped running due to a build error stemming from the removal of the generated code from the codebase. Release note: None scpb: move Node to screl, introduce scpb.TargetState This refactoring commit moves scpb.Node to screl and groups the targets into a new scpb.TargetState. Separating the quasi-constant target elements and statuses from the elements' intermediate statuses ultimately makes for cleaner code in the declarative schema changer. Release note: None scplan: make child packages internal, including scgraph and scgraphviz This commit moves scgraph and scgraphviz under scplan, and makes all its child packages internal. Release note: None <pkg>: <short description - lowercase, no final period> authors: add gtr to authors Release note: None sql: format statements for the UI Previously, statements sent to the UI were not formatted, making them difficult to read. With this change, statements are now formatted using a builtin function that prettifies statements (using existing pretty-printing logic). Statements are formatted using pre-determined pretty-printing configuration options mentioned in this issue: #71544. Resolves: #71544 Release note (sql change): statements are now formatted prior to being send to the UI, this is done using a new builtin function that formats statements. ui: added formatting to statements on the details pages Previously, statements displayed on the statement/transaction/index details pages were not formatted. Formatting was added to allow for better readability of statements on these detail pages. Requested statements are now formatted on the server using the existing pretty-printing logic. Statements returned from the statement handler are now formatted. Formatting is done via a new builtin function 'prettify_statement', a wrapper function over the existing pretty-printing logic. Resolves: #71544 Release note (ui change): added formatting to statements on the statement, transaction and index details pages. roachpb: InternalServer API for tenant settings This commit introduces the API that the tenants will use to obtain and list for updates to tenant setting overrides. The API was designed to allow for maximum flexibility on the server side so that it can be implemented as a range feed without any extra state (if necessary). Release note: None schemachanger: fully qualify object names inside event log entries Previously, the original SQL was displayed inside the declarative schema changers event logs, which was both missing redactions and full name resolutions. This was inadequate because the existing schema changer always fully resolved missing names within its entries. To address this, this patch adds new metadata that contains the fully resolved and redacted text. Release note: None schemachanger: full resolve names inside the AST for event logs and errors Previously, the declarative schema changer left the AST untouched when resolving names. This was inadequate because both event log entry generation and some error messages generated expect the fully resolved names inside the AST. To address this, this patch adds support for copying and altering the AST including support for adding annotations. Additionally, all resolved names are now propagated back into the AST and extra validation is introduced to make sure that no unresolved names are left. Release note: None schemachanger: add statement tag to event log entries Previously, the tag field inside the event log entries generated by the declarative schema changer were empty. This was inadequate because the existing schema changer always populated these fields. To address this, this patch will now populate the statement tag field inside the element metadata and into event log entries. Release note: None rfc: token-based authentication for SQL session revival Release note: None vendor: upgrade cockroachdb/apd to v3 This commit picks up the following changes to `cockroachdb/apd`: - https://github.com/cockroachdb/apd/pull/103 - https://github.com/cockroachdb/apd/pull/104 - https://github.com/cockroachdb/apd/pull/107 - https://github.com/cockroachdb/apd/pull/108 - https://github.com/cockroachdb/apd/pull/109 - https://github.com/cockroachdb/apd/pull/110 - https://github.com/cockroachdb/apd/pull/111 Release note (performance improvement): The memory representation of DECIMAL datums has been optimized to save space, avoid heap allocations, and eliminate indirection. This increases the speed of DECIMAL arithmetic and aggregation by up to 20% on large data sets. sql/sem: remove EvalContext.getTmpDec `apd.Decimal` can now be entirely stack allocated during arithmetic, so there's no longer any need for this. With https://github.com/cockroachdb/apd/pull/104, this does not introduce any new heap allocations: ``` ➜ (cd pkg/sql/sem/tree && goescape . | grep moved | wc -l) 328 ``` scripts: add sgrep @aliher1911 showed me this handy awk trick to filter goroutines [here]. I find myself reaching for this frequently, but it always takes me a minute or two to get it right. This helper will make it a lot easier and will perhaps enable many others to reap the benefits as well. [here]: https://github.com/cockroachdb/cockroach/pull/66761#pullrequestreview-691529838 Release note: None sqlproxyccl: better test output This moves the logging output to files and enables inspection of the authentication results in logs crdb-side. Release note: None storage/metamorphic: update clearRangeOp to not conflict with in-flight writes This change updates clearRangeOp to only run on key spans that do not have any transactional writes in-progress. This more closely matches behaviour in KV and avoids cases where we'd trample on intents and other in-flight keys. Also makes a similar adjustment to mvccClearTimeRangeOp. Fixes #72476. Speculatively. Reproduction was nearly impossible. Release note: None. Revert "roachpb: change `Lease.String()`/`SafeFormat()` to support refs." This reverts commit 33fe9d15b2660cc40d255a3d448d88bb35c2c39d. storage,roachpb: Add Key field to WriteTooOldError Currently, WriteTooOldErrors that happen as part of a ranged put operation (eg. MVCCClearTimeRange) can be very opaque, as we don't know what key we tried to "write too old" on. This change addresses that by adding a Key field to WriteTooOldError to store one of the keys the error pertains to. Informs #72476. Release note: None. dev: enumerate explicit list of all generated `.go` files when hoisting We should be able to `find _bazel/bin/go_path -type f -name '*.go'` and list the generated files that way, but bazelbuild/rules_go#3041 is in the way and I can't figure out how to resolve it. Instead we have to do things the hard way: run `bazel aquery` and parse out the paths to all the generated files. In this way we avoid accidentally hoisting out stale symlinks from a previous build. If bazelbuild/rules_go#3041 is resolved upstream then this change can be reverted. Release note: None sql: public schema long running migration Release note: None sql: insert missing public schema namespace entry When restoring a database, a namespace entry for the public schema was not created. Release note: None authors: add msirek to authors Release note: None sql: Add hints to CREATE/ALTER table errors for MR Add some hints when trying to create (or alter to) a multi-region table which indicate that if the database is not multi-region enabled, it should be converted to a multi-region enabled database via a "ALTER DATABASE ... SET PRIMARY REGION <region>" statement. Release note: None catalog: add Index*Columns methods to table descriptor These methods return the columns referenced in indexes as []catalog.Column slices. Release note: None catalog: add Index*ColumnDirections methods to table descriptor This complements the previous commit which added Index*Column methods to catalog.TableDescriptor. Release note: None catalog: remove FullIndexColumnIDs Calls to this function can now be replaced with recently-added IndexFullColumns and IndexFullColumnDirections method calls on the table descriptor. Release note: None tabledesc: improve memory-efficiency of Index*Column* methods This commit removes IndexCompositeColumns (which wasn't actually used and is unlikely to ever be) and generates the indexColumnCache with less memory allocations. The current scheme is causing a statistically-significant performance regression. Release note: None sql: allow the 1PC optimization in some cases in the extended protocol A previous commit (7e2cbf51869fc326974a5665db80da8b29422631) fixed our pgwire implementation so that it does not auto-commit a statement executed in the extended protocol until a Sync message is received. That change also had the undesired effect of disabling the 1PC ("insert fast path") optimization that is critical for write-heavy workloads. With this current commit, the 1PC optimization is allowed again, as long as the statement execution is immediately followed by a Sync message. This still has the correct bugfix semantics, but allows the optimization for the common case of how the extended protocol is used. No release note since this only affects unreleased versions. Release note: None execgen: remove temporary decimals from the helper This commit removes two temporary decimals from `execgen.OverloadHelper` since after the upgrade of the decimal library it can use stack-allocated temporary objects without them escaping to the heap. Release note: None colexec: batch allocate datums for projectTupleOp This has a profound impact on the amount of garbage generated by the delivery query in TPC-C. ``` name old time/op new time/op delta TPCC/mix=delivery=1-16 38.0ms ± 2% 35.8ms ± 1% -5.76% (p=0.000 n=9+8) name old alloc/op new alloc/op delta TPCC/mix=delivery=1-16 8.17MB ± 1% 7.97MB ± 1% -2.36% (p=0.000 n=9+10) name old allocs/op new allocs/op delta TPCC/mix=delivery=1-16 80.2k ± 0% 20.3k ± 1% -74.65% (p=0.000 n=10+9) ``` Leveraged https://github.com/cockroachdb/cockroach/pull/74443 to find this. Release note: None kvserver/loqrecovery: check full key coverage in quorum recovery Previously when doing unsafe replica recovery, if some ranges are missing or represented by stale replicas that were split or merged, recovery will change cluster to inconsistent state with gaps or overlaps in keyspace. This change adds checks for range completeness as well as adds a preference for replicas with higher range applied index. Release note: None sql: fix enum hydration in distsql expression evaluation Fixes: #74442 Previously in some circumstances we could fail to hydrate enum types used in join predicate expressions and possibly other situations. Now types used in ExprHelper are always hydrated during Init phase when a distsql type resolver is being used. Also add a test case for the lookup semi join repro case. Release note (bug fix): Fix panic's possible in some distributed queries using enum's in join predicates. backupccl: add libgeos to BUILD.bazel This is required when using randgen so that libgeos can be initialized when a geometric type is generated. Fixes #73895 Release note: None sql: refactor pg_builtin to use actual grant options Refactor builtins.priv -> privilege.Privilege. Replace privilege.Kind with privilege.Privilege in functions that need access to privilege.Privilege.GrantOption. Release note: None sql: refactor pg_builtin to use actual grant options The builtins has_table_privilege, has_column_privilege, has_any_column_privilege now use privileges.Priv.GrantOption instead of privileges.Kind.GRANT. Release note: None dev: introduce common benchmarking flags Specifically: --bench-mem (to print out memory allocations); --bench-time (to control how long each benchmark is run for); --count (how many times to run it, also added to `dev test`); -v and --show-logs (similar to `dev test`) We also add supports for args-after-double-dash-are-for-bazel within `dev bench`. This commit is light on testing (read: there isn't any), so doesn't bump DEV_VERSION to roll it out to everyone just yet. Release note: None execgen: skip encoding/decoding JSON when hashing it Previously, in order to hash a JSON object we would encode and decode it (due to the behavior of `coldata.JSONs.Get` and the setup of the execgen). This commit refactors how the hash functions are handled in the execgen which allows us to peek inside of `coldata.Bytes` (that is under the `coldata.JSONs`) in order to get direct access to the underlying `[]byte`. This should be a minor performance improvement but also allows us to remove the need for the overload helper in the hashing context. It is possible (although I'm not certain) that the hash computation is now different, so this commit bumps the DistSQL versions to be safe. Release note: None colexechash: fix BCEs in some cases Previously, we were not getting (and not asserting) some of the bounds check eliminations in `rehash` function because we were accessing `.Sliceable` property in the wrong context. This is now fixed (by accessing it via `.Global`) as well as assertions are added correctly. Additionally, this commit pulls out the call to the cancel checker out of the templated block to reduce the amount of code duplication. Release note: None execinfra: prefer leased table descriptors This commit makes better use of the table descriptor protos in the DistSQL specs by first checking if they've already been leased. If so, then we use those instead of re-generating catalog.TableDescriptor instances. This has a statistically-significant impact on memory allocations, as illustrated with this microbenchmark which I ran on my development machine: name old time/op new time/op delta KV/Scan/SQL/rows=1-16 190µs ± 7% 184µs ± 4% -3.60% (p=0.016 n=10+8) KV/Scan/SQL/rows=10-16 196µs ± 4% 198µs ±12% ~ (p=0.762 n=8+10) name old alloc/op new alloc/op delta KV/Scan/SQL/rows=1-16 19.5kB ± 1% 17.4kB ± 1% -11.12% (p=0.000 n=9+10) KV/Scan/SQL/rows=10-16 21.0kB ± 1% 18.9kB ± 1% -10.20% (p=0.000 n=10+10) name old allocs/op new allocs/op delta KV/Scan/SQL/rows=1-16 222 ± 0% 210 ± 1% -5.59% (p=0.000 n=7+10) KV/Scan/SQL/rows=10-16 256 ± 0% 244 ± 0% -4.84% (p=0.000 n=10+7) This change opens us to the possibility of no longer shipping the whole table descriptor proto in the DistSQL spec. Release note: None sql: add missing error check in GetHydratedZoneConfigForNamedZone Closes #74606 Release note: None sql: rename pg_cast_provolatile_dump.csv to pg_cast_dump.csv Release note: None sql: remove castInfo `castMap` now contains all volatility information and `castInfo` is no longer needed. `castInfo` has been removed. For backwards compatibility, some invalid casts between VARBIT and integer types have been added to `castMap`. These casts have unexpected behavior before and after this commit. They are not supported by Postgres. They should be disallowed in the future. See #74577 and 74580 Casts between tuple and string types are now dynamically handled in `lookupCast` support casts with named record types that have different OIDs than `oid.T_record`. Casts from OID and REG* types to INT2 are now allowed to maintain backward compatibility. Release note: None sql: update pg_cast_dump.csv This commit updates the `pg_cast_dump.csv` file with new rows for casts from the JSONB type (OID 3802). It also adds the Postgres version as a column to annotate the version of Postgres that the CSV was generated from. This is important because the `pg_cast` table can change from version to version. This CSV was generated from Postgres version 13.5. Release note: None sql: consistently order pg_cast_dump.csv Release note: None sql: check cast contexts against Postgres's pg_cast table This commit includes the `pg_cast.castcontext` column in `pg_cast_dump.csv` and uses it to validate the `maxContext` field in `castMap`. Note that Postgres's `pg_cast` table does not include all possible casts, such as automatic I/O conversions, so this test is not comprehensive. Release note: None ci: bazelize nightly pebble ycsb, write-throughput benchmarks We still have to take care of the metamorphic nightly. Part of #67335. Release note: None authors: add bardin to authors Release note: None opt: fix corner case with lookup joins and provided orderings This commit fixes a case where the lookup join was passing through a provided ordering with unnecessary columns. This was caused by imperfect FDs at the join level such that the ordering cannot be simplified at the join level but it can be simplified at the level of its input. Note that the case causes an internal error in test builds but there are no known examples of user-visible problems in non-test builds (hence no release note). Fixes #73968. Release note: None rpc: fix span use-after-Finish in internalClientAdapter The internalClientAdapter performs some local "RPCs" by directly calling the server method, without going through gRPC. The streaming RPC calls are made on a different goroutine, and these goroutines were using the callers tracing span. These goroutines could outlive the caller's span, resulting in a use-after-Finish crash. This patch fixes them by creating dedicated RPC spans, mimicking what our gRPC interceptors do. Fixes #74326 Release note: None vendor: Bump pebble to 3d0ff924d13a3d5fdf6e56a391c5c178c18ff196 Changes pulled in: ``` 3d0ff924d13a3d5fdf6e56a391c5c178c18ff196 *: Add trySeekUsingNext optimization to SeekGE 0c503048eb0365981929177c30178add8a56ae3e sstable: add (*sstable.Writer).RangeKey{Set,Unset,Delete} methods fe52b49cc28df62dce9b00c382a5ce217936be56 tool/logs: aggregate compaction logs by node and store ID 8ab4358bc59dfa62e5e34e4b0e5ce81a68f5fe91 sstable: return err from `(BlockPropertyCollector).FinishTable` 91c18ef0ee999980c2869d11e5ce468410acbe8d internal/keyspan: add FragmentIterator interface 953fdb078ff0585489206ae96e1d80ca9f6f90c7 internal/keyspan: implement SetBounds on Iter aa376a819bf67cd6766ee827feed4bf0bd508f1f tool: add compaction log event aggregation tool ``` Release note: None. storage: Use optimized SeekGE in CheckSSTConflicts This nearly reverts #73514 by moving back to calling SeekGE on the engine to skip past any empty spans on either the engine or the SSTable. This is the more optimal approach on average, and given optimizations in cockroachdb/pebble#1412 which this change depends on, it also ends up performing better than a SeekPrefixGE-driven appraoch and the pre-#73514 approach. Improvement when running BenchmarkCheckSSTConflicts against the pre-#73514 revision (vs. this one): ``` name old time/op new time/op delta CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=false-24 72.6µs ±11% 66.3µs ± 1% -8.67% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=true-24 12.2ms ± 1% 1.7ms ± 1% -86.41% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=false-24 69.8µs ± 2% 67.4µs ± 1% -3.48% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=true-24 13.3ms ± 3% 2.8ms ± 1% -78.97% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=false-24 75.8µs ± 3% 63.8µs ± 1% -15.86% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=true-24 13.0ms ± 1% 1.9ms ± 1% -85.11% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=false-24 69.8µs ±11% 64.6µs ± 1% -7.45% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=true-24 14.8ms ± 9% 3.1ms ± 2% -79.05% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=false-24 66.1µs ± 2% 63.7µs ± 1% -3.65% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=true-24 14.2ms ± 9% 1.9ms ± 1% -86.55% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=false-24 72.3µs ±10% 64.5µs ± 0% -10.77% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=true-24 122ms ± 2% 17ms ± 1% -86.03% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=false-24 69.0µs ± 9% 62.4µs ± 1% -9.57% (p=0.032 n=5+5) CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=true-24 14.0ms ± 1% 2.3ms ± 2% -83.46% (p=0.016 n=4+5) CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=false-24 69.4µs ± 9% 62.7µs ± 1% -9.63% (p=0.016 n=5+5) CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=true-24 140ms ± 5% 26ms ± 1% -81.70% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=false-24 69.2µs ±10% 62.5µs ± 1% -9.66% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=true-24 15.3ms ±11% 2.3ms ± 1% -85.21% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=false-24 69.7µs ±12% 63.6µs ± 1% ~ (p=0.095 n=5+5) CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=true-24 148ms ± 6% 28ms ± 2% -80.90% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=false-24 67.1µs ±10% 61.1µs ± 2% -8.93% (p=0.016 n=5+5) CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=true-24 14.4ms ± 2% 2.5ms ± 5% -82.45% (p=0.016 n=4+5) CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=false-24 68.9µs ±21% 62.2µs ± 1% -9.76% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=true-24 204ms ±14% 42ms ± 5% -79.44% (p=0.008 n=5+5) ``` Fixes #66410. Release note: None. execinfrapb: add a helper for index joins based on the JoinReaderSpec Release note: None rowexec: refactor the joinReader to not exceed the batch size The joinReader operates by buffering the input rows until a certain size limit (which is dependent on the strategy). Previously, the buffering would stop right after the size limit is reached or exceeded, and this commit refactors the code to not exceed the limit except in a case of a single large row. This is what we already do for vectorized index joins. Release note: None sql,kv: introduce Streamer API and use it for index joins in some cases This commit introduces the Streamer API (see https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210617_index_lookups_memory_limits.md) as well as its implementation for the simplest case - when requests are unique and can be served in any order. It additionally hooks up the implementation to be used by the index joins in both execution engines. There are three main pieces that this commit adds: 1. the `Streamer` struct itself. It is largely the same as described in the RFC. Some notable changes are: - `Cancel` is renamed to `Close` and is made blocking to ensure that all goroutines kicked off by the `Streamer` exit before `Close` returns. - `Shrink` has been removed from the `budget` struct (see below for more details). - furthermore, the `budget` interface has been unexported and the `streamer` has been tightly coupled with the `budget`'s implementation. - the TODO about collecting DistSQL metadata is removed because we are collecting the LeafTxnFinalState already when using the LeafTxn. 2. the limited `Streamer` implementation - only `OutOfOrder` mode is supported when the requests are unique. Notably, buffering and caching of the results is not implemented yet. 3. `TxnKVStreamer` component that sits below the SQL fetchers, uses the `Streamer`, and is an adapter from `BatchResponse`s to key/value pairs that fetchers understand. Although not used at the moment, `TxnKVStreamer` is written under the assumption that a single result can satisfy multiple requests. The memory budget of the `Streamer` is utilized lazily. The RFC was calling for making a large reservation upfront and then shrinking the budget if we see that we don't need that large reservation; however, while working on this I realized that lazy reservations are a better fit for this. The `Streamer` can reserve up to the specified limit (determined by `distsql_workmem` variable) against the root monitor (in the degenerate case of a single large row more memory will be reserved). The reservation then never shrinks under the assumption that if the reservation has gotten large, it means it was needed for higher concurrency (or large responses), and it is likely to be needed for the same reasons in the future. The layout of the main components of the `Streamer` implementation: - in `Enqueue` we have a logic similar to what DistSender does in order to split each request (that might span multiple ranges) into single-range requests. Those sub-requests are batched together to be evaluated by a single `BatchRequest`. - `workerCoordinator.mainLoop` is responsible for taking single-range batches, estimating the corresponding response size, and issuing requests to be evaluated in parallel while adhering to the provided memory budget. - `workerCoordinator.performRequestAsync` is responsible for populating the `BatchRequest` and then processing the results while updating the memory budget. Current known limitations that will be addressed in the follow-up work: - at the moment a single row can be split across multiple BatchResponses when TargetBytes limit is reached when the table has multiple column families; therefore, we use the streamer only for single column family cases. We will expand the KV API shortly to not split the rows in multiple column family cases. - manual refresh of spans when `ReadWithinUncertaintyIntervalError` is encountered by a single streamer in a single flow is not implemented. It is an optimization that is considered a must for the final implementation in order to not regress on simple cases in terms of retriable errors. This will be implemented shortly as a follow-up. - I'm thinking that eventually we probably want to disable the batch splitting done by the DistSender to eliminate unnecessary blocking when the streamer's splitting was incorrect. This would give us some performance improvements in face of range boundary changes, but it doesn't seem important enough for the initial implementation. Release note: None sql: check equivalent constraint when creating hash index Fixes #68031 Previously we only try to create constraint for shard column if it's newly created. We check duplicate constraint for shard column when `Alter Primary Key` and `Create Index`, however the check is simply a name check. This pr adds logic to check equivalent constraint by checking if the formatted expression string is the same. With this logic we can try to create the constraint no matter if a shard column is newly created or not. With this fix, we also don't need to expose the constraint through `SHOW CREATE TABLE` result since we make sure the constraint is created or skipped if one already exists. Release note (sql change): Before this change, the check constraint on shard column used by hash sharded index was printed in the corresponding `SHOW CREATE TABLE`. The constraint had been shown because cockroach lacked logic to ensure that shard columns which are part of hash sharded indexs always have the check constraint which the optimizer relies on to achieve properly optimized plans on hash sharded indexes. We no longer display this constraint in `SHOW CREATE TABLE` as it is now implied by the `USING HASH` clause on the relevant index. colexec: clean up the usage of the binary overload helper Previously, for projection and selection operators we would always pass in `execgen.OverloadHelper`. Up until recently it served several purposes, but now it is only used in order to pass down the binary function (as well as the eval context) for the cases when we fallback to the row-by-row computation. This commit takes advantage of this observation and cleans up the situation: now the helper is only passed when it is needed which allows us to remove a lot of redundant code. Additionally, the helper itself has been renamed from `OverloadHelper` to `BinaryOverloadHelper`. Release note: None build: Add PATH to .bazelrc for dev builds. Release note: none opt: intern provided physical properties This commit changes the provided physical properties to be referenced with a pointer inside bestProps, in preparation for adding more fields to physical.Provided. Release note: None opt: add Distribution physical property and Distribute enforcer This commit adds a new physical property called "Distribution" to expressions in the optimizer. Currently, Distribution is just the list of regions that are touched by the expression. This commit also adds a new enforcer called "Distribute", which enforces a particular Distribution on its input. This is similar to the way Sort enforces a particular Ordering. Currently, Distribute is only used to enforce that the results of a query end up in the same region as the gateway node. The Distribution property and Distribute enforcer will enable us to perform better costing of query plans in future commits. This commit adds a tiny cost for the Distribute enforcer, but otherwise does not yet take advantage of distribution for costing. In the future, we can use Distribution to better cost distributed joins and aggregations, and accurately cost the Distribute enforcer so as to avoid scanning data in remote regions if possible. This commit represents a departure from the approach taken in the earlier prototype in #43831. Instead of using a "neighborhood" abstraction to represent arbitrary collections of nodes, this commit sticks to the existing abstractions of region and locality. If we need a new abstration in the future, we should be able to simply modify the representation of Distribution. Additionally, this commit does not make an attempt to represent exactly how data is partitioned across regions (e.g., which column is the partitioning key, whether the data is hash or range partitioned, etc.); it only tracks *which* regions contain data. This greatly simplifies the implementation, and I don't think it will significantly reduce the accuracy of costing. Informs #47226 Release note: None cli: add mt cert create-tenant-signing command This key/cert will be used to generate session revival tokens. No release note since this is only intended to be used internally. Release note: None sql/sem/catid: make a low-level package which defines ID types I'm not sure where exactly in the tree this belongs. Release note: None sql/schemachanger/scpb: adopt catid Release note: None sql/catalog/catpb: add package below scpb and descpb This doesn't go all the way to being principled about what is in catpb and what is in descpb, but it pulls the relevant symbols referenced in scpb down into the new package. Release note: None kvserver: fix bug causing spurious rebalance attempts by the store rebalancer When rebalancing a range, the store rebalancer also tries to move the lease to the voter with the lowest QPS. However, previously, the store rebalancer was doing this even when it found no better replacement targets for the existing stores. This meant that we were redundantly transferring leases over to the coldest voter, even when we weren't rebalancing the range. This was likely contributing to the thrashing observed in https://github.com/cockroachdb/cockroach/issues/69817. Release note (bug fix): A bug that could previously cause redundant lease transfers has now been fixed. schemachanger: delete comments when dropping schemas Previously, the declarative schema changer did not clean comments for schema objects when dropping them. This was inadequate because the system.comment table would have rows for dropped schema objects left over when using the declarative schema changer versus the legacy schema changer. To address this, this patch will remove rows from the system.comment table for schema objects that are dropped. Release note: None sql: adopt CommentUpdater from declarative schema changer. Previously, there was duplicate code for adding/deleting comments on schema objects inside the legacy schema changer. So, each comment type would have similar code for setting up an upsert/delete statements. This patch adopts the CommentUpdater interface in the legacy schema changer, so that logic to update/delete comments can be shared with the declarative schema changer, and the simplify code in the legacy schema changer. Release note: None kvserver: don't use `ClearRange` point deletes with estimated MVCC stats `ClearRange` avoids dropping a Pebble range tombstone if the amount of data that's deleted is small (<=512 KB), instead dropping point deletions. It uses MVCC statistics to determine this. However, when clearing an entire range, it will rely on the existing range MVCC stats rather than computing them. These range statistics can be highly inaccurate -- in some cases so inaccurate that they even become negative. This in turn can cause `ClearRange` to submit a huge write batch, which gets rejected by Raft with `command too large`. This patch avoids dropping point deletes if the statistics are estimated (which is only the case when clearing an entire range). Alternatively, it could do a full stats recomputation in this case, but entire range deletions seem likely to be large and/or rare enough that dropping a range tombstone is fine. Release note (bug fix): Fixed a bug where deleting data via schema changes (e.g. when dropping an index or table) could fail with a "command too large" error. backup: add memory monitor to manifest loading Release note: none. sql: add assignment casts for UPSERTs Assignment casts are now added to query plans for upserts, including `UPSERT`, `INSERT .. ON CONFLICT DO NOTHING`, and `INSERT .. ON CONFLICT DO UPDATE ..` statements. Assignment casts are a more general form of the logic for rounding decimal values, so the use of `round_decimal_values` in mutations is no longer needed. This logic has been removed. Fixes #67083 There is no release note because the behavior of upserts should not change with this commit. Release note: None sql: add logic tests for assignment casts of ON UPDATE expressions Release note: None sql: schema changer not to validate shard column constraint Fixes #67613 The shard column constraint is created internally and should be automatically upheld. So not need to verify it when backfilling hash sharded indexes. Release not: None schemachanger: fix error messages for drop dependencies. 1) When a view depends on another view the message text generated was incorrect. 2) When a sequence is OWNED BY a table, it could be dropped even if there were other dependencies. 3) When RESTRICT was not specified DROP DATABASE would generate the wrong error. 4) If the database name is empty we would generate the wrong error. Release note: None sql: modify message for drop schema privilege errors Previously, the legacy schema changer used a less clear message for when a DROP SCHEMA failed due to a privilege error. This patch will improve the message generated to be more clear by using message from the declarative schema changer. The new message will have the form "must be owner of schema <schema name>". Release note: None schemachanger: avoid leased descriptors for event logging Previously, when fully resolving descriptor names, we would fetch leased descriptors which could cause the schema change transaction to get pushed out. This was inadequate because in certain scenarios we would perpetually retry transactions, since attempts to generate event log entries acquire leases pushing out the schema change transaction. To address this, this patch intentionally avoids leasing descriptors for event logging. Release note: None build: put regular file at bazel-out by default Release note: none. bulk,backup,import: add Write-at-now settings for RESTORE and IMPORT This adds a new arguemnt to BulkAdderOptions and MakeSSTBatcher control the WriteAtBatchTS field in the AddSSTableRequests it sends. This flag is then optionally set in RESTORE and IMPORT based on the the new hidden cluster settigs 'bulkio.restore_at_current_time.enabled' and 'bulkio.import_at_current_time.enabled' respectively. These settings default to off for now, for testing usage only as setting this flag is known to _significantly_ reduce the performance of those jobs until further work is done. Release note: none. sql/row: remove an allocation for Get responses ``` name old time/op new time/op delta IndexJoin/Cockroach-24 8.02ms ± 9% 7.92ms ± 2% ~ (p=1.000 n=9+10) name old alloc/op new alloc/op delta IndexJoin/Cockroach-24 1.68MB ± 1% 1.62MB ± 1% -3.83% (p=0.000 n=10+9) name old allocs/op new allocs/op delta IndexJoin/Cockroach-24 11.1k ± 1% 10.1k ± 0% -8.85% (p=0.000 n=9+9) ``` Release note: None kvstreamer: refactor memory tokens to reduce allocations ``` name old time/op new time/op delta IndexJoin/Cockroach-24 7.72ms ± 7% 7.51ms ± 3% -2.75% (p=0.001 n=10+9) name old alloc/op new alloc/op delta IndexJoin/Cockroach-24 1.61MB ± 1% 1.60MB ± 1% -0.91% (p=0.001 n=10+9) name old allocs/op new allocs/op delta IndexJoin/Cockroach-24 10.1k ± 1% 9.1k ± 1% -10.24% (p=0.000 n=10+10) ``` Release note: None spanconfig: drop 'experimental_' suffix from settings We're going to start to use this infrastructure for realsies soon. Release note: None spanconfig/testcluster: plumb in testing knobs for tenants We were previously using an empty one with no control at the caller to override specific values. This comes in handy when looking to control knobs for all tenants in these tests. Release note: None migrationsccl: disable reconciliation job creation in tests In a future commit we'll enable the span configs infrastructure by default for all crdb unit tests. Doing so will surface the need to have the reconciliation job disabled for the specific migration tests that assert on the contents of `system.span_configurations` (also written to by the reconciliation job/reconciler). Release note: None spanconfig/sqltranslator: deflake datadriven test This is a follow up to #73531, there we forgot to update package tests to also use consistent reads when looking up descriptors. Looking at the surrounding commentary in these datadriven files and comparing against the actual results, there was a mismatch (now no longer so). Release note: None sql: future-proof TestTruncatePreservesSplitPoints This test previously relied on range split decisions to happen near-instantaneously (especially for the single node variant). This was a fair assumption with the gossiped SystemConfigSpan, but is no longer true with the span configs infrastructure where (i) updated descriptors/zone configs make its way to `system.span_configurations` asynchronously, and (ii) KV learns about learns about `system.span_configurations` updates asynchronously. We update the test to be agnostic to either subsystem (tl;dr - throw in a SucceedsSoon block at the right places). Release note: None sql: future-proof TestScatterResponse This is Yet Another Test that made timing assumptions on how instantaneously range split decisions appear, assumptions that no longer hold under the span configs infrastructure. Adding compatibility is a matter of waiting for splits to appear instead of just expecting it. Release note: None kvserver: future-proof TestElectionAfterRestart This is Yet Another Test that made timing assumptions on how instantaneously range split decisions appear, assumptions that don't hold under the span configs infrastructure. Adding compatibility is a matter of waiting for splits to appear instead of only expecting it. Release note: None multiregionccl: future-proof TestEnsureLocalReadsOnGlobalTables This is Yet Another Test that made timing assumptions on how instantaneously range split decisions appear, assumptions that don't hold under the span configs infrastructure. Adding compatibility is a matter of waiting for splits to appear instead of only expecting it. Release note: None server: future-proof TestAdminAPIDatabaseDetails This is Yet Another Test that made timing assumptions on how instantaneously range split decisions appear, assumptions that don't hold under the span configs infrastructure. Adding compatibility is a matter of waiting for splits to appear instead of only expecting it. Release note: None changefeedccl: future-proof TestChangefeedProtectedTimestamps The low gc.ttlseconds in this test that applies to system.{descriptor,zones}, when run with span configs enabled (in a future commit), runs into errors introduced in #73086. The span configs infrastructure makes use of rangefeeds against these tables within the spanconfig.SQLWatcher process. These rangefeeds error out if the timestamp they're started with is already GC-ed, something that's very likely with low GC TTLs. To accommodate, we simply bump the TTL to a more reasonable 100s. Release note: None kvfollowerreadsccl: future-proof TestBoundedStalenessDataDriven This is Yet Another Test that made timing assumptions on how instantaneously range config decisions are applied, assumptions that don't hold under the span configs infrastructure. Adding compatibility is a matter of waiting for the right configs to appear instead of only expecting it. Release note: None changefeedccl: future-proof TestChangefeedBackfillCheckpoint This testing knob was added in #68374 but I'm not sure that it was necessary? Brief stress runs with an without this flag did not surface anything. In a future commit where we enable span configs by default, we'll actually rely on the reconciliation job running, so we get rid of this flag now. Release note: None liveness: future-proof TestNodeLivenessStatusMap Instead of using hooks that directly mutate the system config span, using SQL statements to tweak zone configs future proofs this test for compatibility with the span configs infrastructure. Release note: None sql: migrate has_database_privilege from evalPrivilegeCheck to ctx.Planner.HasPrivilege refs https://github.com/cockroachdb/cockroach/issues/66173 HasPrivilege is required to support WITH GRANT OPTION Release note: None roachtest: fix sequelize nightly Upstream changed how imports are done, so this library had to be updated. Release note: None kvstreamer: fix potential deadlock We have two different locks in the `Streamer` infrastructure: one is for the `Streamer` itself and another one for the `budget`. Previously, there was no contract about which mutex needs to be acquired first which led to the deadlock detector thinking that there is a potential deadlock situation. This is now fixed by requiring that the `budget`'s mutex is acquired first and by releasing the `Streamer`'s mutex in `Enqueue` early in order to not overlap with the interaction with the `budget`. I believe that it was a false positive (i.e. the deadlock cannot actually occur) because without the support of pipelining, `Enqueue` calls and asynchronous requests evaluation never overlap in time. Still, it's good to fix the order of mutex acquisitions. Release note: None sql: remove PHYSICAL scrub code The PHYSICAL scrub code is experimental and not considered production ready. It complicates a lot of code paths involved in normal query execution (it significantly overloads the semantics of TableReader and of the Fetcher) and is getting in the way of some improvements in how the fetchers work. In particular, we are trying to reduce the amount of information passed to TableReader/Fetcher (which in the non-scrubbing case should be a lot less than the full table descriptor). There are some proposals for a better design floating around, e.g. provide a facility for returning KVs as results from DistSQL and have some higher-level code run the scrub checks. This change removes the code for the PHYSICAL scrub for now. Release note (sql change): the experimental SCRUB PHYSICAL is no longer implemented. vars: add placeholder session variable for xmloption Release note (sql change): The `xmloption` session variable is now accepted, only taking in `content`. Note this does not do anything. clusterversion: improve a version comment Gets rid of a squiggly line in Goland. Release note: None kvserver: plumb in a context into (*Store).GetConfReader We'll use it in a future commit. Release note: None spanconfig/reconciler: export the checkpoint timestamp We'll make use of it in a future commit. Release note: None spanconfig: get rid of ReconciliationDependencies interface It was hollow, simply embedding the spanconfig.Reconciler interface. In a future commit we end up relying on each pod's span con…

rytaft · 2022-07-20T18:12:20Z

I just updated the issue description above to more closely represent the new reality as of July 2022. The first two tasks were completed in #74349.

rytaft · 2022-07-20T18:15:26Z

FYI @msirek, this issue gives an overview of the scope of work we were discussing yesterday

msirek · 2023-02-06T19:38:11Z

Notes:
Regarding

take into account the different localities in the input and output distributions and the latency between them.

If we measure latencies between regions at every cluster startup, we might get query plans which change too frequently.
Some choices are:

Measure latencies upon the first cluster startup and store them in a system table. When a node in a new region joins a cluster, detect that, and compute latencies between that region and all other regions.
Don't use precise latencies, but ranges of latencies. If latencies need to be remeasured, we don't want small differences between 2 measures to cause query plan changes. For example, ranges could be 0-5ms, 5-15ms, 15-60ms, 60-195ms, 195-600ms, above 600ms, the cost of the next range being triple that of the previous range.

rytaft added the A-sql-optimizer SQL logical planning and optimizations. label Apr 8, 2020

rytaft added this to the 20.2 milestone Apr 8, 2020

rytaft self-assigned this Apr 8, 2020

rytaft added this to Triage in BACKLOG, NO NEW ISSUES: SQL Optimizer via automation Apr 8, 2020

RaduBerinde moved this from Triage to Plan enhancements (higher priority) in BACKLOG, NO NEW ISSUES: SQL Optimizer Apr 18, 2020

This was referenced Jun 7, 2021

sql: SELECT queries are not utilizing indexes that complete faster with hints #42248

Closed

sql: Distribution awareness in the CBO/route local #41207

Closed

jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021

rytaft mentioned this issue Jul 9, 2021

Add DistSql Enforcer Expr in Memo #67416

Closed

rytaft added this to Triage in SQL Queries via automation Dec 30, 2021

rytaft mentioned this issue Dec 30, 2021

opt: add Distribution physical property and Distribute enforcer #74349

Merged

mgartner moved this from Triage to 22.1 December Milestone in SQL Queries Jan 4, 2022

rytaft moved this from 22.1 December Milestone to 22.1 January Milestone in SQL Queries Jan 5, 2022

rytaft moved this from 22.1 January Milestone to 22.1 February Milestone in SQL Queries Feb 2, 2022

rytaft moved this from 22.1 February Milestone to 22.1 March Milestone in SQL Queries Mar 7, 2022

rytaft moved this from 22.1 March Milestone to 22.2 Low Likelihood (10%) in SQL Queries Apr 7, 2022

rytaft moved this from 22.2 Low Likelihood (10%) to 22.2 Release in SQL Queries May 25, 2022

mgartner moved this from 22.2 Release to Backlog in SQL Queries Jul 7, 2022

rytaft mentioned this issue Jul 20, 2022

sql, opt: support broadcast join #84731

Open

rytaft removed their assignment Jul 20, 2022

DrewKimball mentioned this issue Aug 31, 2022

opt: locality-optimized search prevents plan distribution #87234

Open

rytaft moved this from Backlog (DO NOT ADD NEW ISSUES) to 23.2 Release in SQL Queries Mar 30, 2023

rytaft moved this from 23.2 Release to New Backlog in SQL Queries May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: make the optimizer "distribution aware" #47226

opt: make the optimizer "distribution aware" #47226

rytaft commented Apr 8, 2020 •

edited

rytaft commented Jul 20, 2022

rytaft commented Jul 20, 2022

msirek commented Feb 6, 2023

opt: make the optimizer "distribution aware" #47226

opt: make the optimizer "distribution aware" #47226

Comments

rytaft commented Apr 8, 2020 • edited

rytaft commented Jul 20, 2022

rytaft commented Jul 20, 2022

msirek commented Feb 6, 2023

rytaft commented Apr 8, 2020 •

edited