Permalink
Newer
100644
586 lines (486 sloc)
25.4 KB
3
- Start Date: 2017-09-08
4
- Authors: Peter Mattis, Radu Berinde
5
- RFC PR: [#18399](https:////github.com/cockroachdb/cockroach/pull/18399)
6
- Cockroach Issue: (one or more # from the issue tracker)
7
8
# Summary
9
10
This RFC describes the motivations and mechanisms for collecting
11
statistics for use in powering SQL optimization decisions. The
12
potential performance impact is very large as the decisions an
13
optimizer makes can provide orders of magnitude speed-ups to queries.
14
15
# Motivation
16
17
Modern SQL optimizers seek the lowest cost for a query. The cost for a
18
query is usually related to time, but is usually not directly
19
expressed in units of time. For example, the cost might be in units of
20
disk seeks or RPCs or some unnamed unit related to I/O. A cost model
21
estimates the cost for a particular query plan and the optimizer seeks
22
to find the lowest cost plan from among many alternatives. The input
23
to the cost model are the query plan and statistics about the table
24
data that can guide selection between alternatives.
25
26
One example of where statistics guide query optimization is in join
27
ordering. Consider the natural join:
28
29
`SELECT * FROM a NATURAL JOIN b`
30
31
In the absence of other opportunities, this might be implemented as a
32
hash join. With a hash join, we want to load the smaller set of rows
33
(either from `a` or `b`) into the hash table and then query that table
34
while looping through the larger set of rows. How do we know whether
35
`a` or `b` is larger? We keep statistics about the cardinality of `a`
36
and `b`.
37
38
Simple table cardinality is sufficient for the above query but fails
39
in other queries. Consider:
40
41
`SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y > 10`
42
43
Table statistics might indicate that `a` contains 10x more data than
44
`b`, but the predicate `a.y > 10` is filtering a chunk of the
45
table. What we care about is whether the result of the scan of `a`
46
after filtering returns more rows than the scan of `b`. This can be
47
accomplished by making a determination of the selectivity of the
48
predicate `a.y > 10` and then multiplying that selectivity by the
49
cardinality of `a`. The common technique for estimating selectivity is
50
to collect a histogram on `a.y` (prior to running the query).
51
52
A final example is the query:
53
54
`SELECT * FROM system.rangelog WHERE rangeID = ? ORDER BY timestamp DESC LIMIT 100`
55
56
Currently, `system.rangelog` does not have an index on `rangeID`, but
57
does have an index (the primary key) on `timestamp`. This query is
58
planned so that it sequentially walks through the ranges in descending
59
timestamp order until it fills up the limit. If there are a large
60
number of ranges in the system, the selectivity of `rangeID = ?` will
61
be low and we can make an estimation of how many ranges we'll need to
62
query in order to find 100 rows. Rather than walking through the
63
ranges sequentially, we can query batches of ranges concurrently where
64
the size of batches is calculated from the expected number of matches
65
per range.
66
67
These examples are simple and clear. The academic literature is
68
littered with additional details. How do you estimate the selectivity
69
of multiple conjunctive or disjunctive predicates? How do you estimate
70
the number of rows output from various operators such as joins and
71
aggregations? Some of these questions are beyond the scope of this RFC
72
which is focused on the collection of basic statistics.
73
74
# Terminology
75
76
* Selectivity: The number of values that pass a predicate divided by
77
the total number of values. The selectivity multiplied by the
78
cardinality can be used to compute the total number of values after
79
applying the predicate to set of values. The selectivity is
80
interesting independent of the number of values because
81
selectivities for multiple predicates can be combined.
82
83
* Histogram: Histograms are the classic data structure for computing
84
selectivity. One kind that is frequently used is the equi-depth
85
histogram, where each bucket contains approximately (or exactly) the
86
same number of values. More complex histograms exist that aim to
87
provide more accuracy in various data distributions, e.g. "max-diff"
88
histograms. A simple one-pass algorithm exists for constructing an
89
equi-depth histogram for ordered values. A histogram for unordered
90
values can be constructed by first sampling the data (e.g. using
91
reservoir sampling), sorting it and then using the one-pass
92
algorithm on the sorted data.
93
94
* Cardinality: the number of distinct values (on a single column, or
95
on a group of columns).
96
97
* Sketch: Used for cardinality estimation. Algorithms such as
98
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
99
use a hash of the value to estimate the number of distinct values.
100
Sketches can be merged allowing for parallel computation. For
101
example, the sketches for distinct values on different ranges can be
102
combined into a single sketch for an entire table.
103
104
# Design considerations
105
106
* Statistics need to be available on every node performing query
107
optimization (i.e. every node receiving SQL queries).
108
109
* Statistics collection needs to be low overhead but not necessarily
110
the lowest overhead. Statistics do not need to be updated in real
111
time.
112
113
* Statistics collection should be decoupled from consumption. A simple
114
initial implementation of statistics collection should not preclude
115
more complex collection in the future.
116
117
* The desired statistics to collect are:
118
- a histogram for a given column
120
121
* The count of distinct values for a column or group of columns can be
122
computed using a sketch algorithm such as
123
[HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). Such
124
sketches can be merged making it feasible to collect a sketch at the
125
range-level and combine the sketches to create a table-level sketch.
126
127
* Optimizing `AS OF SYSTEM TIME` queries implies being able to
128
retrieve statistics at a point in time. The statistics do not need
129
to be perfectly accurate, but should be relatively close. On the
130
other hand, historical queries are relatively rare right now.
131
132
# Design overview
133
134
This RFC focuses on the infrastructure and does not set a policy for
135
what stats to collect or when to collect them; these are discussed
136
under future work.
137
138
A single node per table will act as the central coordinator for
139
gathering table-level stats. This node sets up a DistSQL operation
140
that involves reading the primary index and sampling rows (as well as
141
computing sketches). While performed at a specific timestamp, the
142
table read operations explicitly skip the timestamp cache in order to
143
avoid disturbing foreground traffic.
144
145
[TBD: which node acts at the stat coordinator for a table? Presumably
146
there is some sort of lease based on the table ID. Can possibly reuse
147
or copy whatever mechanism is used for schema changes.]
148
149
A new system table provides storage for stats:
150
`system.table_statistics`. This table contains the count of distinct
151
values, NULL values and the histogram buckets for a column/index.
152
153
During query optimization, the stats for a table are retrieved from a
154
per-node stats cache. The cache is populated on demand from the
155
`system.table_statistics` table and refreshed periodically.
156
157
Pseudo-stats are used whenever actual statistics are unavailable. For
158
example, the selectivity of a less-then or greater-than predicate is
159
estimated as 1/3. The academic literature provides additional
160
heuristics to use when stats are unavailable.
161
162
An Admin UI endpoint is provided for visualizing the statistics for a
163
table.
164
165
## Detailed design
166
167
### `system.table_statistics`
168
169
The table-level statistics are stored in a new
170
`system.table_statistics` table. This table contains a row per
171
attribute or group of attributes that we maintain stats for. The
172
schema is:
173
174
```
175
CREATE TABLE system.table_statistics (
176
table_id INT,
177
statistic_id INT,
178
column_ids []INT,
179
created_at TIMESTAMP,
180
row_count INT,
181
cardinality INT,
182
null_values INT,
183
histogram BYTES,
184
PRIMARY KEY (table_id, statistic_id)
185
)
186
```
187
188
Each table has zero or more rows in the `table_statistics` table
189
corresponding to the histograms maintained for the table.
190
191
* `table_id` is the ID of the table
192
* `statistic_id` is an ID that identifies each particular
193
statistic for a table.
194
* `column_ids` stores the IDs of the columns for which the statistic
195
is generated.
196
* `created_at` is the time at which the statistic was generated.
197
* `row_count` is the total number of rows in the table.
198
* `cardinality` is the estimated cardinality (on all columns).
199
* `null_values` is the number of rows that have a NULL on any
200
of the columns in `column_ids`; these rows don't contribute
201
to the cardinality.
202
* `histogram` is optional and can only be set if there is a single
203
column; it encodes a proto defined as:
204
205
```
206
message HistogramData {
207
message Bucket {
208
// The estimated number of rows that are equal to upper_bound.
209
optional int64 eq_rows;
210
211
// The estimated number of rows in the bucket (excluding those
212
// that are equal to upper_bound). Splitting the count into two
213
// makes the histogram effectively equivalent to a histogram with
214
// twice as many buckets, with every other bucket containing a
215
// single value. This might be particularly advantageous if the
216
// histogram algorithm makes sure the top "heavy hitters" (most
217
// frequent elements) are bucket boundaries (similar of a
218
// compressed histogram).
219
optional int64 range_rows;
220
221
// The upper boundary of the bucket. The column values for the
222
// upper bound are encoded using the same ordered encoding used
223
// for index keys.
224
optional bytes upper_bound;
225
}
226
227
// Histogram buckets. Note that NULL values are excluded from the
228
// histogram.
229
repeated Bucket buckets;
230
}
231
```
232
233
[TBD: using a proto for the histogram data is convenient but makes
234
ad-hoc querying of the stats data more difficult. We could store the
235
buckets in a series of parallel SQL arrays. Or we could store them in
236
a separate table.]
237
238
### Estimating selectivity
239
240
Given a predicate such as `a > 1` and a histogram on `a`, how do we
241
compute the selectivity of the predicate? For range predicates such as
242
`>` and `<`, we determine which bucket the value lies in, interpolate
243
within the bucket to estimate the fraction of the bucket values that
244
match the predicate (assuming a uniform data distribution) and perform
245
a straightforward calculation to determine the fraction of values of
246
the entire histogram the predicate will allow through.
247
248
An equality predicate deserves special mention as it is common and
249
somewhat more challenging to estimate the selectivity for. The count
250
of distinct values in a bucket is assumed to have a uniform
251
distribution across the bucket. For example, let's consider a
252
histogram on `a` where the bucket containing the value `1` has a
253
`count` of `100` and a `distinct_count` of `2`. This means there were
254
`100` rows in the table with values for `a` that fell in the bucket,
255
but only `2` distinct values of `a`. We compute the selectivity of `a
256
= 1` as `100 / 2 == 50 rows` (an additional division by the table
257
cardinality can provide a fraction). In general, the selectivity for
258
equality is `bucket_count / bucket_distinct_count`.
259
260
### Collecting statistics
261
262
Consider the table:
263
264
```
265
CREATE TABLE test (
266
k INT PRIMARY KEY,
267
v STRING,
268
INDEX (v, k)
269
)
270
```
271
272
At a high-level, we will maintain statistics about each column and
273
each tuple of columns composing an index. For the above table, that
274
translates into statistics about `k`, `v` and the tuple `(v, k)`. The
275
per-column statistics allow estimation of the selectivity of a
276
predicate on that column. Similarly, the per-index statistics allow
277
estimation of the selectivity of a predicate on the columns in the
278
index. Note that in general there are vastly more possible
279
permutations and combinations of columns that we don't maintain
280
statistics on than indexes. The intuition behind collecting statistics
281
on indexes is that the existence of the index is a hint that the table
282
will be queried in such a way that the columns in the index will be
283
used.
284
285
The collection of statistics for a table span will be provided by a
286
`Sampler` processor. This processor receives rows from a TableReader
287
and:
288
- calculates sketches for estimating the density vector, and
289
- selects a random sample of rows of a given size; it does this by
290
generating a random number for each row and choosing the rows with
291
the top K values. The random values are returned with each row,
292
allowing multiple sample sets to be combined in a single set.
293
294
```
295
enum SketchType {
296
HLL_PLUS_PLUS_v1 = 0
297
}
298
299
// SamplerSpec is the specification of a "sampler" processor which
300
// returns a sample (random subset) of the input columns and computes
301
// cardinality estimation sketches on sets of columns.
302
//
303
// The sampler is configured with a sample size and sets of columns
304
// for the sketches. It produces one row with global statistics, one
305
// row with sketch information for each sketch plus at most
306
// sample_size sampled rows.
307
//
308
// The internal schema of the processor is formed of two column
309
// groups:
310
// 1. sampled row columns:
311
// - columns that map 1-1 to the columns in the input (same
312
// schema as the input).
313
// - an INT column with the random value associated with the row
314
// (this is necessary for combining sample sets).
315
// 2. sketch columns:
316
// - an INT column indicating the sketch index
317
// (0 to len(sketches) - 1).
318
// - an INT column indicating the number of rows processed
319
// - an INT column indicating the number of NULL values
320
// on the first column of the sketch.
321
// - a BYTES column with the binary sketch data (format
322
// dependent on the sketch type).
323
// Rows have NULLs on either all the sampled row columns or on all the
324
// sketch columns.
325
message SamplerSpec {
326
optional uint32 sample_size;
327
328
message SketchSpec {
329
optional SketchType sketch_type;
330
331
// Each value is an index identifying a column in the input stream.
332
repeated uint32 columns;
333
}
334
repeated SketchSpec sketches;
335
}
336
```
337
338
A different `SampleAggregator` processor aggregates the results from
339
multiple Samplers and generates the histogram and other statistics
340
data. The processor is configured to populate the relevant rows in
341
`system.table_statistics`.
342
343
```
344
message SampleAggregator {
345
optional sqlbase.TableDescriptor table;
346
347
// The processor merges reservoir sample sets into a single
348
// sample set of this size. This must match the sample size
349
// used for each Sampler.
350
optional uint32 sample_size;
351
352
message SketchSpec {
353
optional SketchType sketch_type;
354
355
// Each value is a sqlbase.ColumnID.
356
repeated uint32 column_ids;
357
358
// If set, we generate a histogram for the first column in the sketch.
359
optional bool generate_histogram;
360
}
361
repeated SketchSpec sketches;
362
363
// The i-th value indicates the column ID of the i-th sampled row
364
// column.
365
repeated uint32 sampled_column_ids;
366
367
// Indicates the columns for which we want to generate histograms.
368
// Must be a subset of the sampled column IDs.
369
repeated uint32 histogram_column_ids;
370
}
371
```
372
373
The SampleAggregator combines the samples into a single sample set (by
374
choosing the rows with the top K random values across all the sets);
375
the histogram is constructed from this sample set. In terms of
376
implementation, both the Sampler and SampleAggregator can use the same
377
method of obtaining the top-K rows. Some efficient methods:
378
- maintain the top K rows in a heap. Updating the heap is a
379
logarithmic operation, but it only needs to happen when we find a
380
new top element (for the initial sampling, we have only O(KlogK)
381
operations).
382
- maintain the top 2K (or 1.5K) rows; when the set is full, select
383
the Kth element (which takes O(K)) and trim the set down to K.
384
385
This RFC is not proposing a specific algorithm for constructing the
386
histogram. The literature indicates that a very good choice is the
387
Max-Diff histogram with the "area" diff metric; this has a fairly
388
simple construction algorithm. We may implement a simpler (e.g.
389
equi-depth histogram) as a first iteration. Note that how we generate
390
the histogram does not affect the histogram data format, so this can
391
be changed arbitrarily.
392
393
[TBD: How many buckets to use for the histogram? SQL Server uses at
394
most 200].
395
396
Good choices for the sketch algorithm are
397
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
398
and its improved version
399
[HLL++](http://research.google.com/pubs/pub40671.html). There are a
400
few go implementations for both on github with permissive licenses we
401
can use. The infrastructure allows adding better sketch algorithms later.
402
403
A DistSQL plan for this operation will look like this:
404

405
406
### Stats cache
407
408
To cache statistics, we emply two caches:
409
1. *statistics cache*: at the level of table and stores information
410
for all statistics available for each table. The information
411
excludes the histogram data.
412
2. *histogram* cache: each entry is a `(table_id, statistic_id)`
413
pair and stores the full statistic information, including the
414
histogram data.
415
416
The statistics cache is populated whenever we need the statistics for
417
a table. It gets populated by reading the relevant rows from
418
`table_statistics`; whenever this happens, we can choose to add the
419
histograms to the histogram cache as well (since we're scanning that
420
data anyway). The statistics cache can hold a large number of tables,
421
as the metadata it stores is fairly small.
422
423
The histogram cache is more limited because histograms are expected to
424
be 5-10KB in size. Whenever a histogram is required, the cache is
425
populated by reading the one row from `table_statistics`.
426
427
TBD: how do we update the statistics cache when a new statistic is
428
available? One option is to expire entries from the statistics
429
periodically to force a rescan; another option is to gossip a key
430
whenever a new table statistic is available.
431
432
The API for accessing the cache mirrors the two levels of caching:
433
there will be a function to obtain information (metadata) for all
434
statistics of a table, and a function to obtain the full information
435
for a given statistic. We can also have a function that returns all
436
statistics that include certain columns (which can potentially be more
437
efficient than retrieving each histogram separately).
438
439
## Drawbacks
440
441
* There are number of items left for Future Work. It is possible some
442
of these should be addressed sooner.
443
444
* Are there any additional simplifications that could be made for the
445
initial implementation which wouldn't constrain Future Work?
446
447
## Rationale and Alternatives
448
449
* Range-level statistics could be computed by a new KV operation.
450
While this may yield some performance benefits, it would require
451
pushing down more knowledge of row encoding, as well as the sketch
452
algorithms themselves. The Sampler processor model is quite a bit
453
cleaner in terms of separating the implementation.
454
455
* Range-level statistics could be computed by a new storage-level
456
`StatsQueue` that periodically refreshes the stats for a
457
range. Pacing of the queue could be driven by the number of
458
adds/deletes to the range since the last refresh. We'd want the
459
range-level stats to generate histograms because back of the
460
envelope calculations indicate that storing samples for
461
cluster-level aggregation would be too large. Storing histograms at
462
the range-level might also be too large as we'd have one histogram
463
per column which could get expensive for rows with lots of columns.
464
465
- How to combine range-level histograms into cluster-level
466
histograms?
467
468
- Is it problematic for the range-level stats for different ranges of
469
a table to be gathered at different times? How does that affect
470
later cluster-level aggregation?
471
472
- We split ranges at table boundaries, but not at index boundaries. So
473
a range may have to provide more than 1 set of statistics. The
474
range-level statistics collection will probably need to know about
475
the format of keys to identify table/index boundaries.
476
477
* Range-level statistics could be gathered during RocksDB
478
compactions. The upside of doing so is that the I/O is essentially
479
free. The downside is that we'd need to perform the range-level
480
statistics gathering in C++. This decision seems independent of much
481
of the rest of the machinery. Nothing precludes us from gathering
482
range-level statistics during RocksDB compactions in the future.
483
484
* The centralized stats table provides a new failure mode. If the
485
stats table is unavailable we would proceed without the stats but
486
doing so might make a query run orders of magnitude slower. An
487
alternative is to store a table's stats next to the table data
488
(e.g. in a separate range). We can then require the stats are
489
present for complex queries (e.g. queries for which the best plan,
490
without stats, has very high cost); this is similar to the existing
491
failure mode where if you lose some ranges from a table, you might
492
not be able to run queries on that table. The concern about the new
493
failure mode is not new to the stats table: most of the system
494
metadata tables have the same concern. Storing the stats for a table
495
near the table data could be done using a special key
496
(e.g. `/Table/TableID/MaxIndexID`). This would help the
497
backup/restore scenario as well.
498
499
* There are other ways to obtain and combine reservoir sets, e.g.
500
[this blog post](https://ballsandbins.wordpress.com/2014/04/13/distributedparallel-reservoir-sampling/)
501
Note though that we would still need to pass the sampling rate with
502
each row; otherwise the SampleAggregator can't tell which rows are
503
coming from which set. The "top K random numbers" idea is cleaner.
504
505
## Future work
506
507
508
* Determining when and what stats to collect. We will probably start
509
with an explicit command for calculating statistics. Later we can
510
work on automatic stat generation, and automatically choosing which
511
statistics to gather. Columns that form indexes are a good signal
512
that we may see constraints on those columns, so we can start by
513
choosing those columns (or column groups).
514
515
* We don't need stats for every column in a table. For example,
516
consider the "second line" of an address, the "address complement".
517
This is the kind of column that mostly gets printed, and seldom gets
518
queried. Determining which columns to collect stats on is a tricky
519
problem. Perhaps there should be feedback from query execution to
520
indicate which columns and groups of columns are being used in
521
predicates. Perhaps there should be DBA control to indicate that
522
stats on certain columns should not be collected.
523
524
* For columns that are already sorted (i.e. the first column in an
525
index), we can build the histogram directly instead of using a
526
sample. Doing so can generate a more accurate histogram. We're
527
leaving this to future work as it can be considered purely an
528
optimization.
529
530
* Provide support for historical stats in order to better optimize `AS
531
OF SYSTEM TIME` queries. The relatively rarity of such queries is
532
motivating not supporting them initially and to just use the current
533
stats.
534
535
* While executing a query we might discover that the expectations we
536
inferred from the statistics are inaccurate. At a minimum, a
537
significant discrepancy should trigger a stats refresh. The actual
538
data retrieved by the query could also be used to update the
539
statistics, though this complicates normal query execution. See also
540
[self-tuning
541
histograms](https://ashraf.aboulnaga.me/pubs/sigmod99sthist.pdf).
542
543
* If the stats haven't been updated in a long time and we happen to be
544
doing a full table-range scan for some query, we might as well
545
recalculate the stats too in the same pass.
546
547
* Implement a strategy for detecting "heavy hitters" (a.k.a. the N
548
most frequent values). The literature says this can be done by
549
performing a second pass of sampling. The first pass performs a
550
normal reservoir sample. The N most frequent values are very likely
551
to appear in this sample, so sort the sample by frequency and
552
perform a second pass where we compute the exact counts for the N
553
most frequent values in the first pass sample. The "heavy hitters"
554
would be subtracted from the histogram and maintained in a separate
555
list. Note that a second pass over the table data would also allow a
556
much better estimation of the distinct values per histogram
557
bucket. The downside is that we'd be performing a second pass over
558
the table which would likely double the cost of stats
559
collection. Perhaps we can find a query driven means of determining
560
whether this second pass is worthwhile.
561
562
* For interleaved tables, we can optimize the stats work so that we
563
collect the stats for both the parent and interleaved table at the
564
same time. It is possible that this will not fall into future work
565
but be required in the initial implementation in order to support
566
interleaved tables.
567
568
## Unresolved questions
569
570
* How is the central coordinator selected? Should there be a different
571
coordinator per-table? How is a request to analyze a table directed
572
to the central coordinator?
573
574
* Should stats be segregated by partition? Doing so could help if we
575
expect significant partition-specific data distributions. Would
576
doing so make cross-partition queries more difficult to optimize?
577
See the [table partitioning](20170208_sql_partitioning.md)
578
RFC. Segregating stats by partition might make repartitioning more
579
difficult. Or perhaps it makes stats inaccurate for a period of time
580
after repartitioning. If partitions do not have a partition ID (one
581
of the proposals) we'll need to find some way to identify the
582
partition for stats storage.
583
584
* What if a column has very large values? Perhaps we want to adjust
585
the number of buckets depending on that.