Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: make the optimizer "distribution aware" #47226

Open
2 of 8 tasks
rytaft opened this issue Apr 8, 2020 · 3 comments
Open
2 of 8 tasks

opt: make the optimizer "distribution aware" #47226

rytaft opened this issue Apr 8, 2020 · 3 comments
Labels
A-sql-optimizer SQL logical planning and optimizations. T-sql-queries SQL Queries Team
Milestone

Comments

@rytaft
Copy link
Collaborator

rytaft commented Apr 8, 2020

CockroachDB is a geo-distributed database, but the optimizer currently operates with the assumption that all data is located on the same physical machine. There are a couple of exceptions to this rule, such as the deprecated feature called "duplicated indexes". To use this feature, a user can create two (or more) indexes that have exactly the same columns but are pinned to different localities. For example, a user might choose to pin one index to a data center in the East coast of the US, and another identical index to a data center in the West coast. In this case, with all other things being equal, the optimizer will choose whichever index is closest to the gateway node of the SQL query.

There are lots of other cases, however, where the optimizer does not take data locality into account when it should. For example, the optimizer does not consider locality when deciding between two non-identical indexes (i.e., indexes with different columns). This can lead to suboptimal plans in many cases. The purpose of this issue is to track the work needed to make the optimizer aware of data distribution when planning all queries, not just those that use the duplicated indexes feature.

A few of the features we plan to add are:

  • Support a "distribution" physical property in the optimizer, which describes the localities (e.g., regions) that are touched by a given relational operator.
  • Support a "distribute" operator in the optimizer, which enforces a particular distribution by routing data from one set of localities to another set. This is similar to how the sort operator enforces a particular ordering. In the execution engine, "distribute" can either represent DistSQL routers or the lookups performed by the DistSender. If data is spread across localities, a "distribute" operator should be used to enforce that all results are ultimately returned to the gateway node.
  • Update the optimizer cost model to give a realistic cost to the "distribute" operator. This should take into account the different localities in the input and output distributions and the latency between them.
  • Since lookup and index joins rely on the DistSender to fetch data from remote nodes, update the cost model for lookup joins and index joins to include the distribution cost. This will require identifying which localities will be visited by the DistSender when fetching data and using a cost model similar to the "distribute" operator.
  • Since distributed hash and merge joins require data shuffling, update the cost model for those operators to include the distribution cost. To prevent the optimizer from always choosing a non-distributed join, account for the benefits of parallel computation in the cost of distributed joins.
  • Similarly, update the cost model for distributed hash group by and distinct operators to include the distribution cost. To prevent the optimizer from always choosing a non-distributed operation, account for the benefits of parallel computation.
  • Add transformation rules in the optimizer to support different partitioning schemes for distributed joins. For example, the optimizer should explore broadcast joins in addition to hash-partitioned joins (see sql, opt: support broadcast join #84731). The optimizer will be able to choose the best partitioning scheme by using the distributed cost model supported by the above steps.
  • Add more region-aware transformations in the optimizer that can leverage the distributed cost model. We already have transformations to support locality optimized search, but there are others we could add. For example, if a distributed join is spread across multiple regions and the partitioning column (e.g., crdb_region) is one of the join keys, it may be possible to hash-partition data within a region and perform the join phase before transferring any data across regions. In other words, the optimizer could transform a single join across regions into a union of joins, where each child of the union is a join within a single region. Another possible transformation where we turn a scan of a REGIONAL BY ROW table into a pk-fk join with a GLOBAL table is described in opt: infer lookup join equality conditions from unique key and foreign key #69617 (comment).

This is not an exhaustive list, but gives a sense of the scope of work.

gz#9256

Jira issue: CRDB-5025

@rytaft rytaft added the A-sql-optimizer SQL logical planning and optimizations. label Apr 8, 2020
@rytaft rytaft added this to the 20.2 milestone Apr 8, 2020
@rytaft rytaft self-assigned this Apr 8, 2020
@rytaft rytaft added this to Triage in BACKLOG, NO NEW ISSUES: SQL Optimizer via automation Apr 8, 2020
@RaduBerinde RaduBerinde moved this from Triage to Plan enhancements (higher priority) in BACKLOG, NO NEW ISSUES: SQL Optimizer Apr 18, 2020
@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@rytaft rytaft added this to Triage in SQL Queries via automation Dec 30, 2021
rytaft added a commit to rytaft/cockroach that referenced this issue Dec 30, 2021
This commit adds a new physical property called "Distribution" to
expressions in the optimizer. Currently, Distribution is just the
list of regions that are touched by the expression.

This commit also adds a new enforcer called "Distribute", which enforces
a particular Distribution on its input. This is similar to the way
Sort enforces a particular Ordering. Currently, Distribute is only
used to enforce that the results of a query end up in the same region
as the gateway node.

The Distribution property and Distribute enforcer will enable us to
perform better costing of query plans in future commits. This commit
adds a tiny cost for the Distribute enforcer, but otherwise does not
yet take advantage of distribution for costing. In the future, we can use
Distribution to better cost distributed joins and aggregations, and
accurately cost the Distribute enforcer so as to avoid scanning data
in remote regions if possible.

This commit represents a departure from the approach taken in the earlier
prototype in cockroachdb#43831. Instead of using a "neighborhood" abstraction to
represent arbitrary collections of nodes, this commit sticks to the
existing abstractions of region and locality. If we need a new abstration
in the future, we should be able to simply modify the representation of
Distribution.

Additionally, this commit does not make an attempt to represent exactly
how data is partitioned across regions; it only tracks *which* regions
contain data. This greatly simplifies the implementation, and I don't
think it will significantly reduce the accuracy of costing.

Informs cockroachdb#47226

Release note: None
rytaft added a commit to rytaft/cockroach that referenced this issue Jan 4, 2022
This commit adds a new physical property called "Distribution" to
expressions in the optimizer. Currently, Distribution is just the
list of regions that are touched by the expression.

This commit also adds a new enforcer called "Distribute", which enforces
a particular Distribution on its input. This is similar to the way
Sort enforces a particular Ordering. Currently, Distribute is only
used to enforce that the results of a query end up in the same region
as the gateway node.

The Distribution property and Distribute enforcer will enable us to
perform better costing of query plans in future commits. This commit
adds a tiny cost for the Distribute enforcer, but otherwise does not
yet take advantage of distribution for costing. In the future, we can use
Distribution to better cost distributed joins and aggregations, and
accurately cost the Distribute enforcer so as to avoid scanning data
in remote regions if possible.

This commit represents a departure from the approach taken in the earlier
prototype in cockroachdb#43831. Instead of using a "neighborhood" abstraction to
represent arbitrary collections of nodes, this commit sticks to the
existing abstractions of region and locality. If we need a new abstration
in the future, we should be able to simply modify the representation of
Distribution.

Additionally, this commit does not make an attempt to represent exactly
how data is partitioned across regions; it only tracks *which* regions
contain data. This greatly simplifies the implementation, and I don't
think it will significantly reduce the accuracy of costing.

Informs cockroachdb#47226

Release note: None
@mgartner mgartner moved this from Triage to 22.1 December Milestone in SQL Queries Jan 4, 2022
@rytaft rytaft moved this from 22.1 December Milestone to 22.1 January Milestone in SQL Queries Jan 5, 2022
rytaft added a commit to rytaft/cockroach that referenced this issue Jan 6, 2022
This commit adds a new physical property called "Distribution" to
expressions in the optimizer. Currently, Distribution is just the
list of regions that are touched by the expression.

This commit also adds a new enforcer called "Distribute", which enforces
a particular Distribution on its input. This is similar to the way
Sort enforces a particular Ordering. Currently, Distribute is only
used to enforce that the results of a query end up in the same region
as the gateway node.

The Distribution property and Distribute enforcer will enable us to
perform better costing of query plans in future commits. This commit
adds a tiny cost for the Distribute enforcer, but otherwise does not
yet take advantage of distribution for costing. In the future, we can use
Distribution to better cost distributed joins and aggregations, and
accurately cost the Distribute enforcer so as to avoid scanning data
in remote regions if possible.

This commit represents a departure from the approach taken in the earlier
prototype in cockroachdb#43831. Instead of using a "neighborhood" abstraction to
represent arbitrary collections of nodes, this commit sticks to the
existing abstractions of region and locality. If we need a new abstration
in the future, we should be able to simply modify the representation of
Distribution.

Additionally, this commit does not make an attempt to represent exactly
how data is partitioned across regions; it only tracks *which* regions
contain data. This greatly simplifies the implementation, and I don't
think it will significantly reduce the accuracy of costing.

Informs cockroachdb#47226

Release note: None
rytaft added a commit to rytaft/cockroach that referenced this issue Jan 12, 2022
This commit adds a new physical property called "Distribution" to
expressions in the optimizer. Currently, Distribution is just the
list of regions that are touched by the expression.

This commit also adds a new enforcer called "Distribute", which enforces
a particular Distribution on its input. This is similar to the way
Sort enforces a particular Ordering. Currently, Distribute is only
used to enforce that the results of a query end up in the same region
as the gateway node.

The Distribution property and Distribute enforcer will enable us to
perform better costing of query plans in future commits. This commit
adds a tiny cost for the Distribute enforcer, but otherwise does not
yet take advantage of distribution for costing. In the future, we can use
Distribution to better cost distributed joins and aggregations, and
accurately cost the Distribute enforcer so as to avoid scanning data
in remote regions if possible.

This commit represents a departure from the approach taken in the earlier
prototype in cockroachdb#43831. Instead of using a "neighborhood" abstraction to
represent arbitrary collections of nodes, this commit sticks to the
existing abstractions of region and locality. If we need a new abstration
in the future, we should be able to simply modify the representation of
Distribution.

Additionally, this commit does not make an attempt to represent exactly
how data is partitioned across regions (e.g., which column is the
partitioning key, whether the data is hash or range partitioned, etc.);
it only tracks *which* regions contain data. This greatly simplifies the
implementation, and I don't think it will significantly reduce the
accuracy of costing.

Informs cockroachdb#47226

Release note: None
craig bot pushed a commit that referenced this issue Jan 12, 2022
74349: opt: add Distribution physical property and Distribute enforcer r=rytaft a=rytaft

This commit adds a new physical property called `Distribution` to
expressions in the optimizer. Currently, `Distribution` is just the
list of regions that are touched by the expression.

This commit also adds a new enforcer called `Distribute`, which enforces
a particular `Distribution` on its input. This is similar to the way
`Sort` enforces a particular `Ordering`. Currently, `Distribute` is only
used to enforce that the results of a query end up in the same region
as the gateway node.

The `Distribution` property and `Distribute` enforcer will enable us to
perform better costing of query plans in future commits. This commit
adds a tiny cost for the `Distribute` enforcer, but otherwise does not
yet take advantage of distribution for costing. In the future, we can use
`Distribution` to better cost distributed joins and aggregations, and
accurately cost the `Distribute` enforcer so as to avoid scanning data
in remote regions if possible.

This commit represents a departure from the approach taken in the earlier
prototype in #43831. Instead of using a "neighborhood" abstraction to
represent arbitrary collections of nodes, this commit sticks to the
existing abstractions of region and locality. If we need a new abstration
in the future, we should be able to simply modify the representation of
`Distribution`.

Additionally, this commit does not make an attempt to represent exactly
how data is partitioned across regions (e.g., which column is the
partitioning key, whether the data is hash or range partitioned, etc.);
it only tracks *which* regions contain data. This greatly simplifies the
implementation, and I don't think it will significantly reduce the
accuracy of costing.

Informs #47226

Release note: None

Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
RajivTS added a commit to RajivTS/cockroach that referenced this issue Jan 17, 2022
Updated planProjectionOperators and planSelectionOperators to include case for tree.NotExpr

Replaced RunTests with RunTestsWithoutAllNullsInjection

Fixes for build & lint issues

Removed unused field

Addressing PR review comments

backupccl: use SpanGroup.Sub instead of CoveringMerge

Release note: none.

rowenc: reduce reliance on the entire TableDescriptor

This change reduces uses of the entire TableDescriptor.

Release note: None

descpb: add and use NoColumnID constant

This change adds a `descpb.NoColumnID` constant so we don't have to
use `descpb.ColumnID(encoding.NoColumnID)` everywhere.

Release note: None

rowenc: remove DatumAlloc.scratch

This commit removes the `DatumAlloc.scratch` field which is always
nil.

Release note: None

sql: move `DatumAlloc` to sem/tree

This commit moves `DatumAlloc` from `sql/rowenc` to `sql/sem/tree`.
This is a fairly low-level construct and is not used exclusively for
row encoding/decoding.

Release note: None

rowenc: move low-level key encoding functions to subpackage

The `rowenc` package contains a hefty mix of functions operating at
different conceptual levels - some use higher level constructs like
table and index descriptors, others are lower level utilities. The
naming of these functions doesn't help, for example `EncodeTableKey`
sounds like a top-level function but is actually a low level utility
that appends a single value to a key

This commit moves the lower level utilities for encoding datum values
into keys to the `rowenc/keyside` package.

Release note: None

rowenc: minor cleanup of array decoding code

The code around array decoding was confusing; we now have a single
`decodeArray` variant which is the inverse of `encodeArray`.

Release note: None

rowenc: move low-level value encoding functions to subpackage

This commit moves the lower level utilities for encoding datum values
into values to the `rowenc/valueside` package.

Release note: None

valueside: introduce ColumnIDDelta type

When encoding multiple values, we encode the differences between the
ColumnIDs. This difference is passed to `valueside.Encode`, but it is
passed as a `descpb.ColumnID`. This commit introduces a new type to
make this distinction more evident.

Release note: None

valueside: add Decoder helper

We have a few copies of the same logic of decoding multiple SQL values
from a Value, in particular in range feed handling code.

This commit adds a helper type that centralizes this logic, reducing
duplication.

Note that the query executione engines retain their own specialized
implementations.

Release note: None

sql: release BatchFlowCoordinator objects to pool

This commit adds `BatchFlowCoordinator` objects to their `vectorizedFlowCreator`'s
`releasables` slice, which ensures that their `Release` method is called and that
they are properly recycled.

Before this change, heap profiles and instrumentation both indicated that these
objects were not being recycled as intended. For example, heap profiles looked like:
```
      Type: alloc_objects
Time: Dec 30, 2021 at 2:10pm (EST)
Active filters:
   focus=Pool\).Get
Showing nodes accounting for 92189, 0.57% of 16223048 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                             71505 79.69% |   github.com/cockroachdb/cockroach/pkg/sql/colflow.NewBatchFlowCoordinator /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:218
                                             10923 12.17% |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.newLockTableGuardImpl /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table.go:452
                                              2048  2.28% |   github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree.(*Map).maybeInitialize /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree/map.go:112
                                              1170  1.30% |   github.com/cockroachdb/cockroach/pkg/sql/colexec.newMaterializerInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:179
                                              1030  1.15% |   github.com/cockroachdb/pebble.(*Iterator).Clone /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:1228
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/sql.MakeDistSQLReceiver /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:841
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/sql/colfetcher.NewColBatchScan /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:223
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/sql/span.MakeBuilder /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/span/span_builder.go:61
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/storage.mvccGet /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:859
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/storage.mvccScanToBytes /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2357
                                               128  0.14% |   github.com/cockroachdb/pebble.(*DB).newIterInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:765
         0     0%     0%      89729  0.55%                | sync.(*Pool).Get /Users/nathan/Go/go/src/sync/pool.go:148
```
Notice that `sync.(*Pool).Get` is responsible for *0.55%* of heap allocations and
**79.69%** of these are from `colflow.NewBatchFlowCoordinator`.

After this change, that source of allocations goes away and we see the following
impact on micro-benchmarks:
```
name                    old time/op    new time/op    delta
KV/Scan/SQL/rows=1-10     95.1µs ± 7%    95.9µs ± 5%    ~     (p=0.579 n=10+10)
KV/Scan/SQL/rows=10-10     100µs ± 3%     103µs ±12%    ~     (p=0.829 n=8+10)

name                    old alloc/op   new alloc/op   delta
KV/Scan/SQL/rows=10-10    21.7kB ± 0%    21.5kB ± 0%  -0.76%  (p=0.000 n=10+10)
KV/Scan/SQL/rows=1-10     20.1kB ± 0%    19.9kB ± 0%  -0.70%  (p=0.000 n=10+9)

name                    old allocs/op  new allocs/op  delta
KV/Scan/SQL/rows=1-10        245 ± 0%       244 ± 0%  -0.41%  (p=0.000 n=10+10)
KV/Scan/SQL/rows=10-10       280 ± 0%       279 ± 0%  -0.36%  (p=0.001 n=8+9)
```

sql/catalog: restore fast-path in FullIndexColumnIDs

This commit restores a [fast-path](https://github.com/cockroachdb/cockroach/commit/c9e116e586f24c5f3a831ac653f14fd03f588b93#diff-19625608f4a6e23e6fe0818f3a621e716615765cb338d18fe34b43f0a535f06dL140)
in `FullIndexColumnIDs` which was lost in c9e116e. The fast-path avoided
the allocation of a `ColumnID` slice and a `IndexDescriptor_Direction`
slice in `FullIndexColumnIDs` when given a unique index. In such cases,
these slices are already stored on the `IndexDescriptor`.

```
name                   old time/op    new time/op    delta
KV/Scan/SQL/rows=1-10    94.9µs ±10%    94.9µs ± 8%    ~     (p=0.739 n=10+10)

name                   old alloc/op   new alloc/op   delta
KV/Scan/SQL/rows=1-10    20.1kB ± 0%    20.1kB ± 1%    ~     (p=0.424 n=10+10)

name                   old allocs/op  new allocs/op  delta
KV/Scan/SQL/rows=1-10       245 ± 0%       241 ± 0%  -1.63%  (p=0.000 n=10+8)
```

kv: protect Replica's lastToReplica and lastFromReplica fields with raftMu

This commit moves the Replica's lastToReplica and lastFromReplica from
under the `Replica.mu` mutex to the `Replica.raftMu` mutex. These are
strictly Raft-specific pieces of state, so we don't need fine-grained
locking around them. As a reward, we don't need to grab the `Replica.mu`
exclusively (or at all) when setting the fields in
`Store.withReplicaForRequest`.

The locking in `setLastReplicaDescriptors` showed up in a mutex profile
under a write-heavy workload. It was responsible for **3.44%** of mutex
wait time. Grabbing the mutex was probably also slowing down request
processing, as the exclusive lock acquisition had to wait for read locks
to be dropped.

coldata: operate on Nulls value, not reference

This commit changes `col.Vec.SetNulls` to accept a `Nulls` struct by
value instead of by pointer. This lets us avoid a heap allocation on
each call to `Nulls.Or`.

Releaes note: None

roachtest: fix silly bug in tlp roachtest

I thought that we'd get some basic coverage of roachtests in unit tests
at low duration or something, but I guess not.

Release note: None

sql: support readonly default_with_oids var

This is just one less error when importing PGDUMP.

Release note (sql change): We now support `default_with_oids` which
only accepts being false.

importccl,jobs: truncate all job system errors

In #73303 we truncated the row data in this error message to prevent
problems with large row data preventing the job status from being
saved.

However, truncating the row means that our
"experimental_save_rejected" option does not work as expected. This
feature previously saved entire rows and now it would have truncated
rows in some case.

Now, we only truncate the error when producing the error
string. Further, we also add truncation to the job system to prevent
other parts of the system from saving too much data to the job system.

Release note: None

importccl: remove unused argument

Release note: None

sql/logictest: add diff support to expectation output

This adds support for generating a unified diff for expectation
mismatches. This can make it easier to spot the differences between
two results when the result contains many rows.

For example:

```
    ../../sql/logictest/testdata/logic_test/system_namespace:54: SELECT * FROM system.namespace
    expected:
        0   0   defaultdb                        50
        0   0   postgres                         51
        0   0   system                           1
        0   0   test                             52
        1   0   public                           29
        1   29  comments                         24
        1   29  database_role_settings           44
        1   29  descriptor                       3
        1   29  descriptor_id_seq                7
        1   29  eventlog                         12
        1   29  jobs                             15
        1   29  join_tokens                      41
        1   29  lease                            11
        1   29  locations                        21
        1   29  migrations                       40
        1   29  namespace                        30
        1   29  protected_ts_meta                31
        1   29  protected_ts_records             32
        1   29  rangelog                         13
        1   29  replication_constraint_stats     25
        1   29  replication_critical_localities  2
        1   29  replication_stats                27
        1   29  reports_meta                     28
        1   29  role_members                     23
        1   29  role_options                     33
        1   29  scheduled_jobs                   37
        1   29  settings                         6
        1   29  sql_instances                    46
        1   29  sqlliveness                      39
        1   29  statement_bundle_chunks          34
        1   29  statement_diagnostics            36
        1   29  statement_diagnostics_requests   35
        1   29  statement_statistics             42
        1   29  table_statistics                 20
        1   29  transaction_statistics           43
        1   29  ui                               14
        1   29  users                            4
        1   29  web_sessions                     19
        1   29  zones                            5
        50  0   public                           29
        51  0   public                           29
        52  0   public                           29

    but found (query options: "rowsort" -> ignore the following ordering of rows) :
        0   0   defaultdb                        50
        0   0   postgres                         51
        0   0   system                           1
        0   0   test                             52
        1   0   public                           29
        1   29  comments                         24
        1   29  database_role_settings           44
        1   29  descriptor                       3
        1   29  descriptor_id_seq                7
        1   29  eventlog                         12
        1   29  jobs                             15
        1   29  join_tokens                      41
        1   29  lease                            11
        1   29  locations                        21
        1   29  migrations                       40
        1   29  namespace                        30
        1   29  protected_ts_meta                31
        1   29  protected_ts_records             32
        1   29  rangelog                         13
        1   29  replication_constraint_stats     25
        1   29  replication_critical_localities  26
        1   29  replication_stats                27
        1   29  reports_meta                     28
        1   29  role_members                     23
        1   29  role_options                     33
        1   29  scheduled_jobs                   37
        1   29  settings                         6
        1   29  sql_instances                    46
        1   29  sqlliveness                      39
        1   29  statement_bundle_chunks          34
        1   29  statement_diagnostics            36
        1   29  statement_diagnostics_requests   35
        1   29  statement_statistics             42
        1   29  table_statistics                 20
        1   29  transaction_statistics           43
        1   29  ui                               14
        1   29  users                            4
        1   29  web_sessions                     19
        1   29  zones                            5
        50  0   public                           29
        51  0   public                           29
        52  0   public                           29

    Diff:
    --- Expected
    +++ Actual
    @@ -20,3 +20,3 @@
         1   29  replication_constraint_stats     25
    -    1   29  replication_critical_localities  2
    +    1   29  replication_critical_localities  26
         1   29  replication_stats                27
logic.go:3340:
    ../../sql/logictest/testdata/logic_test/system_namespace:100: error while processing
```

The diff output is only added when the `-show-diff` flag is provided
because in some cases it can produce more noise than it is worth.

The diff library used here is the same one already used by the testify
libraries we depend on.

Release note: None

teamcity-trigger: give `kvserver` package higher stress timeout

Closes https://github.com/cockroachdb/cockroach/issues/69519.

Release note: None

backupccl: allow columns of type array in EXPORT PARQUET

Previously, EXPORT PARQUET only supported CRDB relations whose columns were
scalars of type int, string, float, or  boolean. This change allows columns of
type array whose values can be int, string, float, boolean. Following the CRDB
ARRAY documentation, a value in an exported array can be NULL as well.

Informs: #67710

Release note (sql change): EXPORT PARQUET can export columns of type array

backupccl: remove old makeImportSpans

Release note: none.

backupccl: expand some comments on covering code

Release note: none.

security: make the bcrypt cost configurable

Release note (security update): For context, when configuring
passwords for SQL users, if the client presents the password in
cleartext via ALTER/CREATE USER/ROLE WITH PASSWORD, CockroachDB is
responsible for hashing this password before storing it.
By default, this hashing uses CockroachDB's bespoke `crdb-bcrypt`
algorithm, itself based off the standard Bcrypt algorithm.

The cost of this hashing function is now configurable via the new
cluster setting
`server.user_login.password_hashes.default_cost.crdb_bcrypt`.
Its default value is 10, which corresponds to an approximate
password check latency of 50-100ms on modern hardware.

This value should be increased over time to reflect improvements to
CPU performance: the latency should not become so small that it
becomes feasible to bruteforce passwords via repeated login attempts.

Future versions of CockroachDB will likely update the default accordingly.

security: make `server.user_login.min_password_length` visible in doc gen

This cluster setting was meant to be exported for visibility in
auto-generated docs (we've documented it before). This was an oversight.

Release note: None

authors: add Fenil Patel to authors

Release note: None

bench: force the index join in BenchmarkIndexJoin

Without using the index hint, we now choose to perform the full scan
over the primary index and put the filter on top of that rather than
performing a limited scan of the secondary followed by an index join.
I confirmed that this is the case on 21.1 and 21.2 binaries, so I'll
backport to the latter (in order to make the comparison between releases
sane).

Release note: None

dev: a few improvements to the script

* Add `set -e` so if the `dev` build fails, it doesn't try to run the
  binary anyway.
* Add a way to force recompilation of `dev` (useful for developers).

Release note: None

changefeedccl: Shutdown tenant before main server.

Shutdown tenant server before stopping main test server.
This elliminates some of the error messages we see in the test
output when tenant attempts to connect to some ranges which are no
longer accessible.

Release Notes: None

importccl: fix import pgdump target column bug

Previously, if a COPY FROM statement had less columns than
the CREATE TABLE schema defined in the dump file,
 we would get a nil pointer exception. This is because
we were not filling the non-targeted columns with a NULL datum.
This change fixes that and aligns behvaiour with how INSERT
handles non-targeted columns.

Release note (bug fix): IMPORT TABLE ... PGDUMP with a
COPY FROM statement in the dump file that has less target columns
than the CREATE TABLE schema definition would result in a
nil pointer exception.

roachtest: add new passing tests for pgjdbc nightly

Release note: None

go.mod: fix ordering

The first `require` is for direct dependencies. The second one for
indirect dependencies. There's no need for a third one.

Release note: None

Makefile: specify that code needs to be generated for `roach{test,prod}`

Release note: None

setting: clean up slot indexes

The internal setting code passes around slot indexes as bare integers,
and there are confusingly two variants: one is 1-indexed, another is
0-indexed.

This change makes all slot indexes 0-indexed and adds a `slotIdx`
type.

Note: the 1-indexed choice was perhaps useful at the time to help find
code paths where the slot index is not initialized, but now the slot
index is set along with other mandatory fields by `init()`.

Release note: None

setting: prevent use of SystemOnly settings from tenant code

This change implements the `system-only` semantics in the RFC
(#73349).

All SystemOnly setting values are now invisible from tenant code. If
tenant code attempts to get or set a SystemOnly setting by handle, it
results in panics in test builds; in non-test builds, these settings
always report the default value.

Release note (sql change): System-only cluster settings are no longer
visible for non-system tenants.

build: fix Pebble nightly benchmarks

The Pebble nightly benchmarks stopped running due to a build error
stemming from the removal of the generated code from the codebase.

Release note: None

scpb: move Node to screl, introduce scpb.TargetState

This refactoring commit moves scpb.Node to screl and groups the targets
into a new scpb.TargetState. Separating the quasi-constant target
elements and statuses from the elements' intermediate statuses
ultimately makes for cleaner code in the declarative schema changer.

Release note: None

scplan: make child packages internal, including scgraph and scgraphviz

This commit moves scgraph and scgraphviz under scplan, and makes all its
child packages internal.

Release note: None

<pkg>: <short description - lowercase, no final period>

authors: add gtr to authors

Release note: None

sql: format statements for the UI

Previously, statements sent to the UI were not formatted, making them difficult
to read. With this change, statements are now formatted using a builtin
function that prettifies statements (using existing pretty-printing logic).
Statements are formatted using pre-determined pretty-printing configuration
options mentioned in this issue: #71544.

Resolves: #71544

Release note (sql change): statements are now formatted prior to being send to
the UI, this is done using a new builtin function that formats statements.

ui: added formatting to statements on the details pages

Previously, statements displayed on the statement/transaction/index details
pages were not formatted.  Formatting was added to allow for better readability
of statements on these detail pages.  Requested statements are now formatted on
the server using the existing pretty-printing logic. Statements returned from
the statement handler are now formatted.

Formatting is done via a new builtin function 'prettify_statement', a wrapper
function over the existing pretty-printing logic.

Resolves: #71544

Release note (ui change): added formatting to statements on the statement,
transaction and index details pages.

roachpb: InternalServer API for tenant settings

This commit introduces the API that the tenants will use to obtain and
list for updates to tenant setting overrides. The API was designed to
allow for maximum flexibility on the server side so that it can be
implemented as a range feed without any extra state (if necessary).

Release note: None

schemachanger: fully qualify object names inside event log entries

Previously, the original SQL was displayed inside the declarative schema
changers event logs, which was both missing redactions and full name resolutions.
This was inadequate because the existing schema changer always fully resolved
missing names within its entries. To address this, this patch adds new metadata
that contains the fully resolved and redacted text.

Release note: None

schemachanger: full resolve names inside the AST for event logs and errors

Previously, the declarative schema changer left the AST
untouched when resolving names. This was inadequate because
both event log entry generation and some error messages generated
expect the fully resolved names inside the AST. To address this,
this patch adds support for copying and altering the AST including
support for adding annotations. Additionally, all resolved names
are now propagated back into the AST and extra validation is introduced
to make sure that no unresolved names are left.

Release note: None

schemachanger: add statement tag to event log entries

Previously, the tag field inside the event log entries
generated by the declarative schema changer were empty.
This was inadequate because the existing schema changer
always populated these fields. To address this, this
patch will now populate the statement tag field inside
the element metadata and into event log entries.

Release note: None

rfc: token-based authentication for SQL session revival

Release note: None

vendor: upgrade cockroachdb/apd to v3

This commit picks up the following changes to `cockroachdb/apd`:
- https://github.com/cockroachdb/apd/pull/103
- https://github.com/cockroachdb/apd/pull/104
- https://github.com/cockroachdb/apd/pull/107
- https://github.com/cockroachdb/apd/pull/108
- https://github.com/cockroachdb/apd/pull/109
- https://github.com/cockroachdb/apd/pull/110
- https://github.com/cockroachdb/apd/pull/111

Release note (performance improvement): The memory representation of
DECIMAL datums has been optimized to save space, avoid heap allocations,
and eliminate indirection. This increases the speed of DECIMAL arithmetic
and aggregation by up to 20% on large data sets.

sql/sem: remove EvalContext.getTmpDec

`apd.Decimal` can now be entirely stack allocated during arithmetic, so
there's no longer any need for this.

With https://github.com/cockroachdb/apd/pull/104, this does not introduce
any new heap allocations:
```
➜ (cd pkg/sql/sem/tree && goescape . | grep moved | wc -l)
     328
```

scripts: add sgrep

@aliher1911 showed me this handy awk trick to filter goroutines [here].

I find myself reaching for this frequently, but it always takes me a
minute or two to get it right. This helper will make it a lot easier
and will perhaps enable many others to reap the benefits as well.

[here]: https://github.com/cockroachdb/cockroach/pull/66761#pullrequestreview-691529838

Release note: None

sqlproxyccl: better test output

This moves the logging output to files and enables inspection of the
authentication results in logs crdb-side.

Release note: None

storage/metamorphic: update clearRangeOp to not conflict with in-flight writes

This change updates clearRangeOp to only run on key spans
that do not have any transactional writes in-progress. This more
closely matches behaviour in KV and avoids cases where we'd trample on
intents and other in-flight keys. Also makes a similar adjustment
to mvccClearTimeRangeOp.

Fixes #72476. Speculatively. Reproduction was nearly impossible.

Release note: None.

Revert "roachpb: change `Lease.String()`/`SafeFormat()` to support refs."

This reverts commit 33fe9d15b2660cc40d255a3d448d88bb35c2c39d.

storage,roachpb: Add Key field to WriteTooOldError

Currently, WriteTooOldErrors that happen as part of a
ranged put operation (eg. MVCCClearTimeRange) can be very
opaque, as we don't know what key we tried to "write too old"
on. This change addresses that by adding a Key field to WriteTooOldError
to store one of the keys the error pertains to.

Informs #72476.

Release note: None.

dev: enumerate explicit list of all generated `.go` files when hoisting

We should be able to `find _bazel/bin/go_path -type f -name '*.go'` and
list the generated files that way, but bazelbuild/rules_go#3041 is in
the way and I can't figure out how to resolve it. Instead we have to do
things the hard way: run `bazel aquery` and parse out the paths to all
the generated files. In this way we avoid accidentally hoisting out
stale symlinks from a previous build. If bazelbuild/rules_go#3041 is
resolved upstream then this change can be reverted.

Release note: None

sql: public schema long running migration

Release note: None

sql: insert missing public schema namespace entry

When restoring a database, a namespace entry for the public
schema was not created.

Release note: None

authors: add msirek to authors

Release note: None

sql: Add hints to CREATE/ALTER table errors for MR

Add some hints when trying to create (or alter to) a multi-region table which
indicate that if the database is not multi-region enabled, it should be
converted to a multi-region enabled database via a "ALTER DATABASE ... SET
PRIMARY REGION <region>" statement.

Release note: None

catalog: add Index*Columns methods to table descriptor

These methods return the columns referenced in indexes as
[]catalog.Column slices.

Release note: None

catalog: add Index*ColumnDirections methods to table descriptor

This complements the previous commit which added Index*Column methods to
catalog.TableDescriptor.

Release note: None

catalog: remove FullIndexColumnIDs

Calls to this function can now be replaced with recently-added
IndexFullColumns and IndexFullColumnDirections method calls on the table
descriptor.

Release note: None

tabledesc: improve memory-efficiency of Index*Column* methods

This commit removes IndexCompositeColumns (which wasn't actually used
and is unlikely to ever be) and generates the indexColumnCache with less
memory allocations. The current scheme is causing
a statistically-significant performance regression.

Release note: None

sql: allow the 1PC optimization in some cases in the extended protocol

A previous commit (7e2cbf51869fc326974a5665db80da8b29422631) fixed our
pgwire implementation so that it does not auto-commit a statement
executed in the extended protocol until a Sync message is received. That
change also had the undesired effect of disabling the 1PC ("insert fast
path") optimization that is critical for write-heavy workloads.

With this current commit, the 1PC optimization is allowed again, as long
as the statement execution is immediately followed by a Sync message.
This still has the correct bugfix semantics, but allows the optimization
for the common case of how the extended protocol is used.

No release note since this only affects unreleased versions.

Release note: None

execgen: remove temporary decimals from the helper

This commit removes two temporary decimals from `execgen.OverloadHelper`
since after the upgrade of the decimal library it can use
stack-allocated temporary objects without them escaping to the heap.

Release note: None

colexec: batch allocate datums for projectTupleOp

This has a profound impact on the amount of garbage generated by the delivery
query in TPC-C.

```
name                    old time/op    new time/op    delta
TPCC/mix=delivery=1-16    38.0ms ± 2%    35.8ms ± 1%   -5.76%  (p=0.000 n=9+8)

name                    old alloc/op   new alloc/op   delta
TPCC/mix=delivery=1-16    8.17MB ± 1%    7.97MB ± 1%   -2.36%  (p=0.000 n=9+10)

name                    old allocs/op  new allocs/op  delta
TPCC/mix=delivery=1-16     80.2k ± 0%     20.3k ± 1%  -74.65%  (p=0.000 n=10+9)
```

Leveraged https://github.com/cockroachdb/cockroach/pull/74443 to find this.

Release note: None

kvserver/loqrecovery: check full key coverage in quorum recovery

Previously when doing unsafe replica recovery, if some ranges are
missing or represented by stale replicas that were split or merged,
recovery will change cluster to inconsistent state with gaps or
overlaps in keyspace.
This change adds checks for range completeness as well as adds a
preference for replicas with higher range applied index.

Release note: None

sql: fix enum hydration in distsql expression evaluation

Fixes: #74442

Previously in some circumstances we could fail to hydrate enum types
used in join predicate expressions and possibly other situations. Now
types used in ExprHelper are always hydrated during Init phase when a
distsql type resolver is being used. Also add a test case for the lookup
semi join repro case.

Release note (bug fix): Fix panic's possible in some distributed queries
using enum's in join predicates.

backupccl: add libgeos to BUILD.bazel

This is required when using randgen so that libgeos can be initialized
when a geometric type is generated.

Fixes #73895

Release note: None

sql: refactor pg_builtin to use actual grant options

Refactor builtins.priv -> privilege.Privilege.

Replace privilege.Kind with privilege.Privilege in functions that need
access to privilege.Privilege.GrantOption.

Release note: None

sql: refactor pg_builtin to use actual grant options

The builtins has_table_privilege, has_column_privilege,
has_any_column_privilege now use privileges.Priv.GrantOption instead
of privileges.Kind.GRANT.

Release note: None

dev: introduce common benchmarking flags

Specifically:
  --bench-mem (to print out memory allocations);
  --bench-time (to control how long each benchmark is run for);
  --count (how many times to run it, also added to `dev test`);
  -v and --show-logs (similar to `dev test`)

We also add supports for args-after-double-dash-are-for-bazel within
`dev bench`. This commit is light on testing (read: there isn't any), so
doesn't bump DEV_VERSION to roll it out to everyone just yet.

Release note: None

execgen: skip encoding/decoding JSON when hashing it

Previously, in order to hash a JSON object we would encode and decode it
(due to the behavior of `coldata.JSONs.Get` and the setup of the
execgen). This commit refactors how the hash functions are handled in
the execgen which allows us to peek inside of `coldata.Bytes` (that is
under the `coldata.JSONs`) in order to get direct access to the
underlying `[]byte`. This should be a minor performance improvement but
also allows us to remove the need for the overload helper in the hashing
context.

It is possible (although I'm not certain) that the hash computation is
now different, so this commit bumps the DistSQL versions to be safe.

Release note: None

colexechash: fix BCEs in some cases

Previously, we were not getting (and not asserting) some of the bounds
check eliminations in `rehash` function because we were accessing
`.Sliceable` property in the wrong context. This is now fixed (by
accessing it via `.Global`) as well as assertions are added correctly.
Additionally, this commit pulls out the call to the cancel checker out
of the templated block to reduce the amount of code duplication.

Release note: None

execinfra: prefer leased table descriptors

This commit makes better use of the table descriptor protos in the
DistSQL specs by first checking if they've already been leased. If so,
then we use those instead of re-generating catalog.TableDescriptor
instances.

This has a statistically-significant impact on memory allocations, as
illustrated with this microbenchmark which I ran on my development
machine:

    name                    old time/op    new time/op    delta
    KV/Scan/SQL/rows=1-16      190µs ± 7%     184µs ± 4%   -3.60%  (p=0.016 n=10+8)
    KV/Scan/SQL/rows=10-16     196µs ± 4%     198µs ±12%     ~     (p=0.762 n=8+10)

    name                    old alloc/op   new alloc/op   delta
    KV/Scan/SQL/rows=1-16     19.5kB ± 1%    17.4kB ± 1%  -11.12%  (p=0.000 n=9+10)
    KV/Scan/SQL/rows=10-16    21.0kB ± 1%    18.9kB ± 1%  -10.20%  (p=0.000 n=10+10)

    name                    old allocs/op  new allocs/op  delta
    KV/Scan/SQL/rows=1-16        222 ± 0%       210 ± 1%   -5.59%  (p=0.000 n=7+10)
    KV/Scan/SQL/rows=10-16       256 ± 0%       244 ± 0%   -4.84%  (p=0.000 n=10+7)

This change opens us to the possibility of no longer shipping the whole
table descriptor proto in the DistSQL spec.

Release note: None

sql: add missing error check in GetHydratedZoneConfigForNamedZone

Closes #74606

Release note: None

sql: rename pg_cast_provolatile_dump.csv to pg_cast_dump.csv

Release note: None

sql: remove castInfo

`castMap` now contains all volatility information and `castInfo` is no
longer needed. `castInfo` has been removed.

For backwards compatibility, some invalid casts between VARBIT and
integer types have been added to `castMap`. These casts have unexpected
behavior before and after this commit. They are not supported by
Postgres.  They should be disallowed in the future.
See #74577 and 74580

Casts between tuple and string types are now dynamically handled in
`lookupCast` support casts with named record types that have different
OIDs than `oid.T_record`.

Casts from OID and REG* types to INT2 are now allowed to maintain
backward compatibility.

Release note: None

sql: update pg_cast_dump.csv

This commit updates the `pg_cast_dump.csv` file with new rows for casts
from the JSONB type (OID 3802). It also adds the Postgres version as a
column to annotate the version of Postgres that the CSV was generated
from. This is important because the `pg_cast` table can change from
version to version. This CSV was generated from Postgres version 13.5.

Release note: None

sql: consistently order pg_cast_dump.csv

Release note: None

sql: check cast contexts against Postgres's pg_cast table

This commit includes the `pg_cast.castcontext` column in
`pg_cast_dump.csv` and uses it to validate the `maxContext` field in
`castMap`. Note that Postgres's `pg_cast` table does not include all
possible casts, such as automatic I/O conversions, so this test is not
comprehensive.

Release note: None

ci: bazelize nightly pebble ycsb, write-throughput benchmarks

We still have to take care of the metamorphic nightly.

Part of #67335.

Release note: None

authors: add bardin to authors

Release note: None

opt: fix corner case with lookup joins and provided orderings

This commit fixes a case where the lookup join was passing through a
provided ordering with unnecessary columns. This was caused by
imperfect FDs at the join level such that the ordering cannot be
simplified at the join level but it can be simplified at the level of
its input.

Note that the case causes an internal error in test builds but there
are no known examples of user-visible problems in non-test builds
(hence no release note).

Fixes #73968.

Release note: None

rpc: fix span use-after-Finish in internalClientAdapter

The internalClientAdapter performs some local "RPCs" by directly calling
the server method, without going through gRPC. The streaming RPC calls
are made on a different goroutine, and these goroutines were using the
callers tracing span. These goroutines could outlive the caller's span,
resulting in a use-after-Finish crash. This patch fixes them by creating
dedicated RPC spans, mimicking what our gRPC interceptors do.

Fixes #74326

Release note: None

vendor: Bump pebble to 3d0ff924d13a3d5fdf6e56a391c5c178c18ff196

Changes pulled in:

```
3d0ff924d13a3d5fdf6e56a391c5c178c18ff196 *: Add trySeekUsingNext optimization to SeekGE
0c503048eb0365981929177c30178add8a56ae3e sstable: add (*sstable.Writer).RangeKey{Set,Unset,Delete} methods
fe52b49cc28df62dce9b00c382a5ce217936be56 tool/logs: aggregate compaction logs by node and store ID
8ab4358bc59dfa62e5e34e4b0e5ce81a68f5fe91 sstable: return err from `(BlockPropertyCollector).FinishTable`
91c18ef0ee999980c2869d11e5ce468410acbe8d internal/keyspan: add FragmentIterator interface
953fdb078ff0585489206ae96e1d80ca9f6f90c7 internal/keyspan: implement SetBounds on Iter
aa376a819bf67cd6766ee827feed4bf0bd508f1f tool: add compaction log event aggregation tool
```

Release note: None.

storage: Use optimized SeekGE in CheckSSTConflicts

This nearly reverts #73514 by moving back to calling SeekGE on
the engine to skip past any empty spans on either the engine
or the SSTable. This is the more optimal approach on average,
and given optimizations in cockroachdb/pebble#1412 which this
change depends on, it also ends up performing better than
a SeekPrefixGE-driven appraoch and the pre-#73514 approach.

Improvement when running BenchmarkCheckSSTConflicts
against the pre-#73514 revision (vs. this one):

```
name                                                                      old time/op  new time/op  delta
CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=false-24      72.6µs ±11%  66.3µs ± 1%   -8.67%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=true-24       12.2ms ± 1%   1.7ms ± 1%  -86.41%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=false-24     69.8µs ± 2%  67.4µs ± 1%   -3.48%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=true-24      13.3ms ± 3%   2.8ms ± 1%  -78.97%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=false-24     75.8µs ± 3%  63.8µs ± 1%  -15.86%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=true-24      13.0ms ± 1%   1.9ms ± 1%  -85.11%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=false-24    69.8µs ±11%  64.6µs ± 1%   -7.45%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=true-24     14.8ms ± 9%   3.1ms ± 2%  -79.05%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=false-24     66.1µs ± 2%  63.7µs ± 1%   -3.65%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=true-24      14.2ms ± 9%   1.9ms ± 1%  -86.55%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=false-24    72.3µs ±10%  64.5µs ± 0%  -10.77%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=true-24      122ms ± 2%    17ms ± 1%  -86.03%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=false-24    69.0µs ± 9%  62.4µs ± 1%   -9.57%  (p=0.032 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=true-24     14.0ms ± 1%   2.3ms ± 2%  -83.46%  (p=0.016 n=4+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=false-24   69.4µs ± 9%  62.7µs ± 1%   -9.63%  (p=0.016 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=true-24     140ms ± 5%    26ms ± 1%  -81.70%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=false-24    69.2µs ±10%  62.5µs ± 1%   -9.66%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=true-24     15.3ms ±11%   2.3ms ± 1%  -85.21%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=false-24   69.7µs ±12%  63.6µs ± 1%     ~     (p=0.095 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=true-24     148ms ± 6%    28ms ± 2%  -80.90%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=false-24   67.1µs ±10%  61.1µs ± 2%   -8.93%  (p=0.016 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=true-24    14.4ms ± 2%   2.5ms ± 5%  -82.45%  (p=0.016 n=4+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=false-24  68.9µs ±21%  62.2µs ± 1%   -9.76%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=true-24    204ms ±14%    42ms ± 5%  -79.44%  (p=0.008 n=5+5)
```

Fixes #66410.

Release note: None.

execinfrapb: add a helper for index joins based on the JoinReaderSpec

Release note: None

rowexec: refactor the joinReader to not exceed the batch size

The joinReader operates by buffering the input rows until a certain size
limit (which is dependent on the strategy). Previously, the buffering
would stop right after the size limit is reached or exceeded, and this
commit refactors the code to not exceed the limit except in a case of
a single large row. This is what we already do for vectorized index
joins.

Release note: None

sql,kv: introduce Streamer API and use it for index joins in some cases

This commit introduces the Streamer API (see
https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210617_index_lookups_memory_limits.md)
as well as its implementation for the simplest case - when requests are
unique and can be served in any order. It additionally hooks up the
implementation to be used by the index joins in both execution engines.

There are three main pieces that this commit adds:
1. the `Streamer` struct itself. It is largely the same as described in
the RFC. Some notable changes are:
- `Cancel` is renamed to `Close` and is made blocking to ensure that all
goroutines kicked off by the `Streamer` exit before `Close` returns.
- `Shrink` has been removed from the `budget` struct (see below for
more details).
- furthermore, the `budget` interface has been unexported and the
`streamer` has been tightly coupled with the `budget`'s implementation.
- the TODO about collecting DistSQL metadata is removed because we are
collecting the LeafTxnFinalState already when using the LeafTxn.
2. the limited `Streamer` implementation - only `OutOfOrder` mode is
supported when the requests are unique. Notably, buffering and caching
of the results is not implemented yet.
3. `TxnKVStreamer` component that sits below the SQL fetchers, uses the
`Streamer`, and is an adapter from `BatchResponse`s to key/value pairs
that fetchers understand. Although not used at the moment,
`TxnKVStreamer` is written under the assumption that a single result can
satisfy multiple requests.

The memory budget of the `Streamer` is utilized lazily. The RFC was
calling for making a large reservation upfront and then shrinking the
budget if we see that we don't need that large reservation; however,
while working on this I realized that lazy reservations are a better fit
for this. The `Streamer` can reserve up to the specified limit
(determined by `distsql_workmem` variable) against the root monitor (in
the degenerate case of a single large row more memory will be reserved).
The reservation then never shrinks under the assumption that if the
reservation has gotten large, it means it was needed for higher
concurrency (or large responses), and it is likely to be needed for the
same reasons in the future.

The layout of the main components of the `Streamer` implementation:
- in `Enqueue` we have a logic similar to what DistSender does in order
to split each request (that might span multiple ranges) into
single-range requests. Those sub-requests are batched together to be
evaluated by a single `BatchRequest`.
- `workerCoordinator.mainLoop` is responsible for taking single-range
batches, estimating the corresponding response size, and issuing
requests to be evaluated in parallel while adhering to the provided
memory budget.
- `workerCoordinator.performRequestAsync` is responsible for populating
the `BatchRequest` and then processing the results while updating the
memory budget.

Current known limitations that will be addressed in the follow-up work:
- at the moment a single row can be split across multiple BatchResponses
when TargetBytes limit is reached when the table has multiple column
families; therefore, we use the streamer only for single column family
cases. We will expand the KV API shortly to not split the rows in
multiple column family cases.
- manual refresh of spans when `ReadWithinUncertaintyIntervalError` is
encountered by a single streamer in a single flow is not implemented. It
is an optimization that is considered a must for the final
implementation in order to not regress on simple cases in terms of
retriable errors. This will be implemented shortly as a follow-up.
- I'm thinking that eventually we probably want to disable the batch
splitting done by the DistSender to eliminate unnecessary blocking when
the streamer's splitting was incorrect. This would give us some
performance improvements in face of range boundary changes, but it
doesn't seem important enough for the initial implementation.

Release note: None

sql: check equivalent constraint when creating hash index

Fixes #68031

Previously we only try to create constraint for shard column if
it's newly created. We check duplicate constraint for shard
column when `Alter Primary Key` and `Create Index`, however the
check is simply a name check. This pr adds logic to check
equivalent constraint by checking if the formatted expression
string is the same. With this logic we can try to create the
constraint no matter if a shard column is newly created or not.
With this fix, we also don't need to expose the constraint through
`SHOW CREATE TABLE` result since we make sure the constraint is
created or skipped if one already exists.

Release note (sql change): Before this change, the check constraint
on shard column used by hash sharded index was printed in the
corresponding `SHOW CREATE TABLE`. The constraint had been shown
because cockroach lacked logic to ensure that shard columns which
are part of hash sharded indexs always have the check constraint
which the optimizer relies on to achieve properly optimized plans
on hash sharded indexes. We no longer display this constraint in
`SHOW CREATE TABLE` as it is now implied by the `USING HASH` clause
on the relevant index.

colexec: clean up the usage of the binary overload helper

Previously, for projection and selection operators we would always pass
in `execgen.OverloadHelper`. Up until recently it served several
purposes, but now it is only used in order to pass down the binary
function (as well as the eval context) for the cases when we fallback
to the row-by-row computation.

This commit takes advantage of this observation and cleans up the
situation: now the helper is only passed when it is needed which allows
us to remove a lot of redundant code. Additionally, the helper itself
has been renamed from `OverloadHelper` to `BinaryOverloadHelper`.

Release note: None

build: Add PATH to .bazelrc for dev builds.

Release note: none

opt: intern provided physical properties

This commit changes the provided physical properties to be referenced
with a pointer inside bestProps, in preparation for adding more fields
to physical.Provided.

Release note: None

opt: add Distribution physical property and Distribute enforcer

This commit adds a new physical property called "Distribution" to
expressions in the optimizer. Currently, Distribution is just the
list of regions that are touched by the expression.

This commit also adds a new enforcer called "Distribute", which enforces
a particular Distribution on its input. This is similar to the way
Sort enforces a particular Ordering. Currently, Distribute is only
used to enforce that the results of a query end up in the same region
as the gateway node.

The Distribution property and Distribute enforcer will enable us to
perform better costing of query plans in future commits. This commit
adds a tiny cost for the Distribute enforcer, but otherwise does not
yet take advantage of distribution for costing. In the future, we can use
Distribution to better cost distributed joins and aggregations, and
accurately cost the Distribute enforcer so as to avoid scanning data
in remote regions if possible.

This commit represents a departure from the approach taken in the earlier
prototype in #43831. Instead of using a "neighborhood" abstraction to
represent arbitrary collections of nodes, this commit sticks to the
existing abstractions of region and locality. If we need a new abstration
in the future, we should be able to simply modify the representation of
Distribution.

Additionally, this commit does not make an attempt to represent exactly
how data is partitioned across regions (e.g., which column is the
partitioning key, whether the data is hash or range partitioned, etc.);
it only tracks *which* regions contain data. This greatly simplifies the
implementation, and I don't think it will significantly reduce the
accuracy of costing.

Informs #47226

Release note: None

cli: add mt cert create-tenant-signing command

This key/cert will be used to generate session revival tokens.

No release note since this is only intended to be used internally.

Release note: None

sql/sem/catid: make a low-level package which defines ID types

I'm not sure where exactly in the tree this belongs.

Release note: None

sql/schemachanger/scpb: adopt catid

Release note: None

sql/catalog/catpb: add package below scpb and descpb

This doesn't go all the way to being principled about what is in catpb
and what is in descpb, but it pulls the relevant symbols referenced in
scpb down into the new package.

Release note: None

kvserver: fix bug causing spurious rebalance attempts by the store rebalancer

When rebalancing a range, the store rebalancer also tries to move the lease to
the voter with the lowest QPS. However, previously, the store rebalancer was
doing this even when it found no better replacement targets for the existing
stores. This meant that we were redundantly transferring leases over to the
coldest voter, even when we weren't rebalancing the range. This was likely
contributing to the thrashing observed in
https://github.com/cockroachdb/cockroach/issues/69817.

Release note (bug fix): A bug that could previously cause redundant lease
transfers has now been fixed.

schemachanger: delete comments when dropping schemas

Previously, the declarative schema changer did not clean
comments for schema objects when dropping them. This was
inadequate because the system.comment table would have
rows for dropped schema objects left over when using
the declarative schema changer versus the legacy schema
changer. To address this, this patch will remove rows
from the system.comment table for schema objects that
are dropped.

Release note: None

sql: adopt CommentUpdater from declarative schema changer.

Previously, there was duplicate code for adding/deleting comments
on schema objects inside the legacy schema changer. So, each
comment type would have similar code for setting up an
upsert/delete statements. This patch adopts the CommentUpdater
interface in the legacy schema changer, so that logic to
update/delete comments can be shared with the declarative
schema changer, and the simplify code in the legacy schema
changer.

Release note: None

kvserver: don't use `ClearRange` point deletes with estimated MVCC stats

`ClearRange` avoids dropping a Pebble range tombstone if the amount of
data that's deleted is small (<=512 KB), instead dropping point
deletions. It uses MVCC statistics to determine this. However, when
clearing an entire range, it will rely on the existing range MVCC stats
rather than computing them.

These range statistics can be highly inaccurate -- in some cases so
inaccurate that they even become negative. This in turn can cause
`ClearRange` to submit a huge write batch, which gets rejected by Raft
with `command too large`.

This patch avoids dropping point deletes if the statistics are estimated
(which is only the case when clearing an entire range). Alternatively,
it could do a full stats recomputation in this case, but entire range
deletions seem likely to be large and/or rare enough that dropping a
range tombstone is fine.

Release note (bug fix): Fixed a bug where deleting data via schema
changes (e.g. when dropping an index or table) could fail with a
"command too large" error.

backup: add memory monitor to manifest loading

Release note: none.

sql: add assignment casts for UPSERTs

Assignment casts are now added to query plans for upserts, including
`UPSERT`, `INSERT .. ON CONFLICT DO NOTHING`, and
`INSERT .. ON CONFLICT DO UPDATE ..` statements.

Assignment casts are a more general form of the logic for rounding
decimal values, so the use of `round_decimal_values` in mutations is no
longer needed. This logic has been removed.

Fixes #67083

There is no release note because the behavior of upserts should not
change with this commit.

Release note: None

sql: add logic tests for assignment casts of ON UPDATE expressions

Release note: None

sql: schema changer not to validate shard column constraint

Fixes #67613
The shard column constraint is created internally and should
be automatically upheld. So not need to verify it when backfilling
hash sharded indexes.

Release not: None

schemachanger: fix error messages for drop dependencies.

1) When a view depends on another view the message text generated was
   incorrect.
2) When a sequence is OWNED BY a table, it could be dropped even
   if there were other dependencies.
3) When RESTRICT was not specified DROP DATABASE would generate the wrong
   error.
4) If the database name is empty we would generate the wrong error.

Release note: None

sql: modify message for drop schema privilege errors

Previously, the legacy schema changer used a less clear message
for when a DROP SCHEMA failed due to a privilege error. This
patch will improve the message generated to be more clear
by using  message from the declarative schema changer.
The new message will have the form "must be owner
of schema <schema name>".

Release note: None

schemachanger: avoid leased descriptors for event logging

Previously, when fully resolving descriptor names, we would
fetch leased descriptors which could cause the schema change
transaction to get pushed out. This was inadequate because in
certain scenarios we would perpetually retry transactions,
since attempts to generate event log entries acquire leases
pushing out the schema change transaction. To address this,
this patch intentionally avoids leasing descriptors for event logging.

Release note: None

build: put regular file at bazel-out by default

Release note: none.

bulk,backup,import: add Write-at-now settings for RESTORE and IMPORT

This adds a new arguemnt to BulkAdderOptions and MakeSSTBatcher control
the WriteAtBatchTS field in the AddSSTableRequests it sends.

This flag is then optionally set in RESTORE and IMPORT based on the the
new hidden cluster settigs 'bulkio.restore_at_current_time.enabled' and
'bulkio.import_at_current_time.enabled' respectively.

These settings default to off for now, for testing usage only as setting
this flag is known to _significantly_ reduce the performance of those
jobs until further work is done.

Release note: none.

sql/row: remove an allocation for Get responses

```
name                    old time/op    new time/op    delta
IndexJoin/Cockroach-24    8.02ms ± 9%    7.92ms ± 2%    ~     (p=1.000 n=9+10)

name                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24    1.68MB ± 1%    1.62MB ± 1%  -3.83%  (p=0.000 n=10+9)

name                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24     11.1k ± 1%     10.1k ± 0%  -8.85%  (p=0.000 n=9+9)
```

Release note: None

kvstreamer: refactor memory tokens to reduce allocations

```
name                    old time/op    new time/op    delta
IndexJoin/Cockroach-24    7.72ms ± 7%    7.51ms ± 3%   -2.75%  (p=0.001 n=10+9)

name                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24    1.61MB ± 1%    1.60MB ± 1%   -0.91%  (p=0.001 n=10+9)

name                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24     10.1k ± 1%      9.1k ± 1%  -10.24%  (p=0.000 n=10+10)
```

Release note: None

spanconfig: drop 'experimental_' suffix from settings

We're going to start to use this infrastructure for realsies soon.

Release note: None

spanconfig/testcluster: plumb in testing knobs for tenants

We were previously using an empty one with no control at the caller to
override specific values. This comes in handy when looking to control
knobs for all tenants in these tests.

Release note: None

migrationsccl: disable reconciliation job creation in tests

In a future commit we'll enable the span configs infrastructure by
default for all crdb unit tests. Doing so will surface the need to have
the reconciliation job disabled for the specific migration tests that
assert on the contents of `system.span_configurations` (also written to
by the reconciliation job/reconciler).

Release note: None

spanconfig/sqltranslator: deflake datadriven test

This is a follow up to #73531, there we forgot to update package tests
to also use consistent reads when looking up descriptors. Looking at the
surrounding commentary in these datadriven files and comparing against
the actual results, there was a mismatch (now no longer so).

Release note: None

sql: future-proof TestTruncatePreservesSplitPoints

This test previously relied on range split decisions to happen
near-instantaneously (especially for the single node variant). This was
a fair assumption with the gossiped SystemConfigSpan, but is no longer
true with the span configs infrastructure where
(i)  updated descriptors/zone configs make its way to
     `system.span_configurations` asynchronously, and
(ii) KV learns about learns about `system.span_configurations` updates
     asynchronously.

We update the test to be agnostic to either subsystem (tl;dr - throw in
a SucceedsSoon block at the right places).

Release note: None

sql: future-proof TestScatterResponse

This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that no longer
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of just expecting it.

Release note: None

kvserver: future-proof TestElectionAfterRestart

This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.

Release note: None

multiregionccl: future-proof TestEnsureLocalReadsOnGlobalTables

This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.

Release note: None

server: future-proof TestAdminAPIDatabaseDetails

This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.

Release note: None

changefeedccl: future-proof TestChangefeedProtectedTimestamps

The low gc.ttlseconds in this test that applies to
system.{descriptor,zones}, when run with span configs enabled (in a
future commit), runs into errors introduced in #73086. The span configs
infrastructure makes use of rangefeeds against these tables within the
spanconfig.SQLWatcher process. These rangefeeds error out if the
timestamp they're started with is already GC-ed, something that's very
likely with low GC TTLs. To accommodate, we simply bump the TTL to a
more reasonable 100s.

Release note: None

kvfollowerreadsccl: future-proof TestBoundedStalenessDataDriven

This is Yet Another Test that made timing assumptions on how
instantaneously range config decisions are applied, assumptions that
don't hold under the span configs infrastructure. Adding compatibility
is a matter of waiting for the right configs to appear instead of only
expecting it.

Release note: None

changefeedccl: future-proof TestChangefeedBackfillCheckpoint

This testing knob was added in #68374 but I'm not sure that it was
necessary? Brief stress runs with an without this flag did not surface
anything. In a future commit where we enable span configs by default,
we'll actually rely on the reconciliation job running, so we get rid of
this flag now.

Release note: None

liveness: future-proof TestNodeLivenessStatusMap

Instead of using hooks that directly mutate the system config span,
using SQL statements to tweak zone configs future proofs this test for
compatibility with the span configs infrastructure.

Release note: None

sql: migrate has_database_privilege from evalPrivilegeCheck to
ctx.Planner.HasPrivilege

refs https://github.com/cockroachdb/cockroach/issues/66173

HasPrivilege is required to support WITH GRANT OPTION

Release note: None

roachtest: fix sequelize nightly

Upstream changed how imports are done, so this library had to be
updated.

Release note: None

kvstreamer: fix potential deadlock

We have two different locks in the `Streamer` infrastructure: one is for
the `Streamer` itself and another one for the `budget`. Previously,
there was no contract about which mutex needs to be acquired first which
led to the deadlock detector thinking that there is a potential deadlock
situation. This is now fixed by requiring that the `budget`'s mutex is
acquired first and by releasing the `Streamer`'s mutex in `Enqueue`
early in order to not overlap with the interaction with the `budget`.

I believe that it was a false positive (i.e. the deadlock cannot
actually occur) because without the support of pipelining, `Enqueue`
calls and asynchronous requests evaluation never overlap in time.
Still, it's good to fix the order of mutex acquisitions.

Release note: None

sql: remove PHYSICAL scrub code

The PHYSICAL scrub code is experimental and not considered production
ready. It complicates a lot of code paths involved in normal query
execution (it significantly overloads the semantics of TableReader and
of the Fetcher) and is getting in the way of some improvements in how
the fetchers work. In particular, we are trying to reduce the amount
of information passed to TableReader/Fetcher (which in the
non-scrubbing case should be a lot less than the full table
descriptor).

There are some proposals for a better design floating around, e.g.
provide a facility for returning KVs as results from DistSQL and have
some higher-level code run the scrub checks.

This change removes the code for the PHYSICAL scrub for now.

Release note (sql change): the experimental SCRUB PHYSICAL is no
longer implemented.

vars: add placeholder session variable for xmloption

Release note (sql change): The `xmloption` session variable is now
accepted, only taking in `content`. Note this does not do anything.

clusterversion: improve a version comment

Gets rid of a squiggly line in Goland.

Release note: None

kvserver: plumb in a context into (*Store).GetConfReader

We'll use it in a future commit.

Release note: None

spanconfig/reconciler: export the checkpoint timestamp

We'll make use of it in a future commit.

Release note: None

spanconfig: get rid of ReconciliationDependencies interface

It was hollow, simply embedding the spanconfig.Reconciler interface. In
a future commit we end up relying on each pod's span con…
@rytaft rytaft moved this from 22.1 January Milestone to 22.1 February Milestone in SQL Queries Feb 2, 2022
@rytaft rytaft moved this from 22.1 February Milestone to 22.1 March Milestone in SQL Queries Mar 7, 2022
@rytaft rytaft moved this from 22.1 March Milestone to 22.2 Low Likelihood (10%) in SQL Queries Apr 7, 2022
@rytaft rytaft moved this from 22.2 Low Likelihood (10%) to 22.2 Release in SQL Queries May 25, 2022
@mgartner mgartner moved this from 22.2 Release to Backlog in SQL Queries Jul 7, 2022
@rytaft
Copy link
Collaborator Author

rytaft commented Jul 20, 2022

I just updated the issue description above to more closely represent the new reality as of July 2022. The first two tasks were completed in #74349.

@rytaft rytaft removed their assignment Jul 20, 2022
@rytaft
Copy link
Collaborator Author

rytaft commented Jul 20, 2022

FYI @msirek, this issue gives an overview of the scope of work we were discussing yesterday

@msirek
Copy link
Contributor

msirek commented Feb 6, 2023

Notes:
Regarding

take into account the different localities in the input and output distributions and the latency between them.

If we measure latencies between regions at every cluster startup, we might get query plans which change too frequently.
Some choices are:

  • Measure latencies upon the first cluster startup and store them in a system table. When a node in a new region joins a cluster, detect that, and compute latencies between that region and all other regions.
  • Don't use precise latencies, but ranges of latencies. If latencies need to be remeasured, we don't want small differences between 2 measures to cause query plan changes. For example, ranges could be 0-5ms, 5-15ms, 15-60ms, 60-195ms, 195-600ms, above 600ms, the cost of the next range being triple that of the previous range.

@rytaft rytaft moved this from Backlog (DO NOT ADD NEW ISSUES) to 23.2 Release in SQL Queries Mar 30, 2023
@rytaft rytaft moved this from 23.2 Release to New Backlog in SQL Queries May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-optimizer SQL logical planning and optimizations. T-sql-queries SQL Queries Team
Projects
BACKLOG, NO NEW ISSUES: SQL Optimizer
Plan enhancements (higher priority)
Status: Backlog
SQL Queries
New Backlog
Development

No branches or pull requests

3 participants